Blog

Why Hardware Monitoring Needs Infrastructure, Not Just Sensors

Admin — Wed, 18 Mar 2026 09:20:16 GMT

proteanTecs Hardware Monitoring System from Agents & Sensors to Insights

OCP Warns AI Is Compromised by SDC. proteanTecs In-Chip Monitoring Restores Trust | proteanTecs Blog

Admin — Tue, 24 Feb 2026 16:00:00 GMT

Ensuring AI Reliability: Mitigating OCP's Silent Data Corruption Risks.

Silent Data Corruption (SDC) is an industry challenge affecting data centers worldwide with increasing frequency. This phenomenon stems from untraceable hardware failures that make detection notoriously difficult. SDCs don’t leave any record in system logs or trigger exception mechanisms. The corrupted data they produce can propagate unnoticed, causing cascading failures that often demand extensive resources to root cause.

A recentOpen Compute Project (OCP) whitepaper, authored by experts from NVIDIA, Google, Meta, Microsoft, and others, underscores the critical impact of SDC on large-scale AI/ML systems.

OCP Says SDC Is on the Rise, Compromising AI Workload Integrity in Data Centers

SDC has emerged as a critical reliability threat to scaling AI training and inference, as it corrupts computations without triggering alerts. Unlike memory bit flips, for example, mitigated by error correction codes (ECC), SDCs originate from subtle timing violations, aging effects, or marginal defects that escape standard semiconductor testing and data center monitoring.

The problem has grown worse with GenAI's explosive growth and increasingly complex chip architectures, leading the paper to regard SDC as a "needle in a haystack" challenge. New process nodes push semiconductor boundaries, while the unprecedented scale of intensive AI workloads stresses chipsets to their thermal and timing limits.

The OCP paper walks through multiple stress factors that increase SDC probability in AI hardware, with several significant ones outlined below:

Shrinking Process Geometries Increase Fault Susceptibility
Smaller transistors tighten device margins, increasing vulnerability to transient faults and permanent failures.

Aggressive Voltage and Frequency Scaling Reduces Timing Margins
Dynamic scaling improves performance but narrows timing headroom, making small-delay defects more likely to escape detection.

Increased Current Draw and Power-Delivery Noise Raise Timing Issues
Wider parallel execution and higher clock frequencies increase current draw and PDN noise, making systems more prone to timing-related faults.

Progressive Wear-Out Introduces Time-Dependent Failures
Over time, defects such as electromigration or process marginalities can cause transistors or interconnects to fail occasionally. As a result, hardware that initially passes validation gradually degrades under intensive AI workloads until it fails.

Hardware Faults Remain Hidden Across Software Layers
Errors introduced at the hardware level may surface only after several software transformations, making detection more difficult.
Combined Stress Conditions Amplify Reliability Challenges
The likelihood of SDC grows when several factors, such as voltage droop and high temperature, combine under intensive workloads, making silent errors hard to detect and reproduce. This factor is more significant in AI accelerators, which often operate near the limits of their power and thermal envelopes.

‍

SDC Impact on AI Training and Inference

The OCP whitepaper emphasizes that SDCs pose distinct challenges depending on the type of AI workload. During training, even a single undetected fault can waste months of valuable computational resources by silently corrupting the learning process. In inference deployments, SDCs directly undermine the reliability of AI services by producing incorrect outputs. The impact is especially severe in safety-critical applications such as autonomous vehicles and medical diagnostics.

Workload-specific SDC impacts:

Training: Corrupted Gradients Create an Illusion of Progress

W hen SDC corrupts values without triggering Not-a-Number (NaN) errors, distributed training propagates this invalid data as legitimate results across multiple cluster accelerators. This contamination can lead to gradient explosion, implosion, or convergence at an incorrect local minimum. Such problems may take a very long time to detect while the training appears to be making forward progress.

‍

Inference: Persistent Defects Contaminate Thousands of Predictions

Faulty hardware in inference clusters might generate corrupted outputs, potentially affecting thousands of users per hour. Debugging these errors can be highly challenging, as they can bypass detection mechanisms while compromising privacy and integrity policies. Moreover, this troubleshooting process can affect production capacity until the offending node is identified and quarantined.

Why Traditional Controls Miss SDC in AI Fleets

Standard testing methodologies, whether executed in situ or via scheduled maintenance, exhibit notable deficiencies:

In-situ testing, when relying on canary circuits, fails to account for the actual, critical path timing margins, which might decrease due to aging and process variations. This is a particularly vital concern given the rising levels of on-chip variation within a device, a trend highlighted in the 2024 paper, "Manufacturing Roadmap for Heterogeneous Integration and Electronics Packaging."
‍

Periodic maintenance testing often lacks sufficient sensitivity, tending to detect only distinct failures while missing the more subtle issues related to SDC. Furthermore, this method lacks the real-life operational conditions that characterize in-situ monitoring, as the tested devices are temporarily removed from the active fleet.
‍

_{A canary circuit that monitors design margins is a critical path replicator, which cannot provide accurate data about actual critical path timing.}

‍

Given the limited efficacy of current best-known methods, the OCP paper dedicates a whole section to multiple open research questions. It regards SDC as an unresolved challenge with a critical impact on AI systems, calling for novel approaches that capture the nuanced ways in which silent errors occur.

‍

proteanTecs’ In-Chip Monitoring Restores Trust With Real-Time SDC Prevention

Conventional SDC prevention methods typically rely on periodic maintenance, which incurs costly overhead by testing all servers regardless of their health. However, even fleet operators who accept the expense of excessive testing are not secure. Unfortunately, they still face many SDC cases, which they often detect only after the faults have already impacted the production environment.

‍

proteanTecs takes a different approach, offering predictive maintenance instead of preventive maintenance. This novel technology can identify issues in real time and even correct them. The detected events are not actual faults yet, but they might accumulate to a low chip Health Index, which often precedes SDCs. proteanTecs uses dedicated thresholds to deduce when margins get dangerously low, as depicted below.

‍

_{proteanTecs provides a real-time indication of a severe margin drop that might cause SDC in a 5nm data center chip (visualization of embedded firmware application).}

Unlike canary circuits, proteanTecs uses on-chip Agents that monitor the timing margins of millions of real paths for more informed decisions. These Agents can provide very high coverage of the design’s logic and pinpoint the real critical paths that traditional methods often miss. This approach allows precise action based on real workloads, aging, and IR drops.

_{Unlike canary circuits (right, in yellow), proteanTecs uses on-chip Agents (left, in blue) that monitor true critical paths.}

‍

proteanTecs provides the Health Index by processing on-chip Agent readings alongside other inputs using advanced real-time algorithms. A low index score might trigger an interrupt, allowing the Baseboard Management Controller (BMC) to decide whether to take corrective action given the current system status.

In some configurations, the proteanTecs solutions take corrective action on their own without the BMC, offering prescriptive maintenance as well. Chips equipped with this technology can automatically adjust voltage or frequency to compensate for aging, adapt to workload demands, and help prevent SDC.

Ensuring AI Reliability: Chip Monitoring as the Answer to SDC

As AI systems continue to scale and process nodes shrink further, SDC will only become more prevalent. The OCP whitepaper makes clear that traditional approaches to mitigating SDC are insufficient for the RAS (reliability, availability, serviceability) demands of modern AI infrastructure.

proteanTecs' runtime monitoring technology represents a fundamental shift in how the industry can address this challenge. By monitoring millions of real critical paths during actual workload execution, it transforms SDC from an invisible threat into a manageable risk.

The ability to detect margin degradation before it causes corruption protects months of training investment and prevents corrupted outputs from reaching inference customers. At AI's current scale and intensity, this capability is no longer optional.

Resilient and Optimized GenAI Systems with proteanTecs and Arm’s Neoverse CSS | proteanTecs Blog

Admin — Mon, 03 Nov 2025 16:00:00 GMT

Next-gen AI demands real-time insight. Discover proteanTecs and Arm integration.

AI and datacenter systems are being pushed to their limits, with soaring complexity, nonstop inference workloads, and rising energy demands. Addressing these pressures requires more than incremental improvements, it calls for collaboration across the ecosystem. That’s why proteanTecs has joined forces with Arm, bringing our real-time monitoring technology into Arm’s Neoverse Compute Subsystems (CSS). Successful integration brings a customer-ready solution - designed to accelerate power efficiency, performance, and reliability at scale.

Challenges Facing Next-Gen AI Infrastructure

The cloud AI landscape is at an inflection point. Explosive growth in model complexity, inference demand, and system scale has strained the very fabric of compute infrastructure. Training runs that once required thousands of GPUs now demand tens of thousands, with costs reaching hundreds of millions of dollars. Inference, once considered “easier,” now drives massive daily workloads thatpush energy budgets and hardware reliability to the brink.

Power efficiency: AI data centers will consume over 90 TWh annually by 2026. Excessive voltage guard bands, designed for worst-case scenarios, drive unnecessary energy waste.

Performance at scale: Even small throughput inefficiencies cascade at hyperscale. A 10% gain in throughput can reduce training times by weeks and save millions in infrastructure costs
Reliability and resilience: Silent Data Corruption (SDC) is an invisible risk. A single undetected error can corrupt weights across thousands of GPUs, invalidating billion-dollar training runs.

For hyperscalers, the stakes are clear: every watt saved, every percentage of performance reclaimed, and every silent error prevented translates into millions of dollars and competitive advantage.

Meeting these challenges requires more than node upgrades or incremental optimizations. It demands in-situ visibility into how chips behave under real workloads and operating conditions, and the ability to act on that knowledge in real time.

Growth in transistor density versus the PFLOPS required to train AI models from a 2021 baseline. By 2024, AI compute requirements surged by 6847%, while transistor density grew by only 183%. 2025 value is based on the projected PFLOPS required to train GPT-5. Source: Mollick, E. (2024). Scaling: The state of play in AI. One Useful Thing.

Deep Data Needed to Face these Challenges

Current methods for optimizing performance, power, and reliability all share the same blind spot: they don’t see how chips behave under actual workloads in the field. GenAI cloud operators pay for this lack of real-time visibility through higher power draw, lower throughput, and increased risk of failure. Performance tuning relies on static margins. Power controls are triggered by basic telemetry. Reliability checks happen too late, after failure is already underway. None of these approaches adapts to actual stress and environmental conditions during live operation.

That’s the gap.

proteanTecs closes this gap by providing deep data monitoring solutions that give system designers and operators unprecedented visibility into chip health and performance throughout the lifecycle.

The technology delivers a complete monitoring solution spanning silicon to system. At the hardware level, an on-chip HW IP Monitoring System combines lightweight Agents with built-in infrastructure for seamless access, control, and integration, enabling deep visibility from within the silicon. Complementing this are advanced EDA-based integration and implementation tools that ensure high coverage and smooth deployment with no design impact. On top of the hardware, a suite of machine learning–driven software applications run in the field and in real time, providing predictive monitoring.

By embedding Agents within the silicon, we enable performance improvements, power reduction, and diagnostics throughout the device’s mission.

The on-chip Agents provide parametric measurements in-situ and in functional mode, to detect timing issues, operational and environmental effects, aging and application stress. Among the suite of Agents are the Margin Agents that monitor timing margins of millions of real paths for more informed decisions. Margin Agents provide very high coverage of the design’s logic and monitor the real performance-limiting paths that traditional methods often miss. The real performance-limiting (minimum voltage or maximum frequency) paths are ensured to be covered for all devices in the process distribution, and for all the operating conditions and functional workloads.

Unlike canary circuits (right, in yellow), proteanTecs uses on-chip Margin Agents (left, in blue) that monitor true critical paths.

proteanTecs and Arm CSS: Customer-Ready Integration

Now, in collaboration with Arm, we’re bringing these capabilities directly into the heart of next-generation datacenter and AI infrastructure.As part of Arm Total Design, proteanTecs has successfully integrated its monitoring solutions into Arm’s Neoverse Compute Subsystems (CSS). This milestone means our Agent integrationis validated, and optimized for Neoverse CSS, enabling mutual customers to benefit from seamless integration into their custom SoCs.

This milestone means:

Customer-ready integration: proteanTecs monitoring solutions are now natively available within Neoverse CSS-based custom SoCs.

Preferential access: As a member of Arm Total Design, proteanTecs gains early access to Neoverse CSS, enabling deep integration and joint validation.
Faster time-to-market: Mutual customers benefit from seamless adoption - cutting integration effort, validation cycles, and deployment risk.

The result: system designers can bring powerful AI/datacenter SoCs to market faster, with embedded visibility, power/performance optimization, and reliability monitoring built-in.

Demonstrating Coverage, Efficiency, and Seamless Integration

The integration of proteanTecs monitoring solutions into Arm’s Neoverse CSS has now been validated in practice, and the results underscore the value of a customer-ready reference design.

In this implementation - in an advanced process node, 200 Margin Agents (MAs) were integrated and implemented in one of the most advanced Arm Neoverse CPU core. proteanTecs proprietary algorithms, part of proteanTecs EDA tools, provide the decision on which endpoints should be monitored by each Margin Agent. This ensures that the true performance-limiting paths are monitored.

This strategic monitoring achieved a coverage result of 96.63% (based on proteanTecs proprietary coverage metrics), a level of visibility that allows customers to make confident, data-driven decisions. For more information about proteanTecs’ coverage methodology, customers are encouragedto reach out to our support team.

‍

Equally important, the addition of monitoring capability had virtually no effect on the design itself. Timing and power measurements remained stable and well within normal run-to-run variation, confirming that the integration does not compromise efficiency. Max timing and power results are shown in the table below.

No manual timing fixes were applied, so the results reflect a true Synthesis and Place-and-Route tools output, ensuring transparency and reliability in the process.

Taken together, these findings provide customers with a reference implementation that demonstrates how proteanTecs can be embedded seamlessly into high-speed designs at advanced process nodes, without introducing overhead or risk.

proteanTecs’ solution is an open architecture and can work under partner monitoring frameworks. Among the supported frameworks is the Arm System Monitoring Control Framework (SMCF), which enhances monitoring for Arm CSS solutions. You can learn more about proteanTecs’ integration with SMCFhere.

Unlocking Efficiency, Performance, and Reliability

proteanTecs’ suite of applications, now enabled for Neoverse CSS, ensure datacenter operators can optimize at runtime:

AVS Pro™: Workload and reliability aware, real-time power reduction - delivering up to 14% lower power with no performance loss, while extending the device RUL by ~20%. To learn more, read thewhite paperhere.

AFS Pro™: Workload and reliability aware, real-time frequency increase - capturing frequency headroom for up to 10% performance boost.

RTHM™: Monitors health in real-time, flagging risks before they cascade into SDC or system failures.Read more here.

By embedding these capabilities into Neoverse CSS-based SoCs, mutual customers gain a powerful edge: the ability to scale AI infrastructure power efficiency, performance, and reliably.

Conclusion: Real-Time Monitoring for Scalable GenAI Chips

As GenAI chips reach unprecedented levels of complexity, chipmakers need visibility into how each chip truly behaves under live workloads.

proteanTecs delivers exactly that, with a new class of in-chip monitoring and applications that dynamically tune in real-time each device for optimal efficiency, performance, and RAS. Now, through successful integration with Arm’s Neoverse Compute Subsystems (CSS) as part of Arm Total Design, proteanTecs’ real-time monitoring solutions are validated, optimized, and customer-ready. This seamless integration enables mutual customers to accelerate time-to-market while benefiting from power reduction, performance improvement, and built-in resilience at hyperscale.

Same Chip, Two Destinies: How Power Profiles Improve With On-Chip Monitoring | proteanTecs Blog

Admin — Tue, 09 Sep 2025 15:00:00 GMT

The Impact of On-Chip Telemetry on Peak Power, Average Power, and Di/Dt Noise

What happens to critical power-related considerations when the same chip is handled two different ways, with or without visibility from within?

This article begins by examining how the absence of on-chip monitoring impacts peak power, average power, and Di/Dt noise (rate of current change), as illustrated in the diagram below and the subsequent discussion. It then details how these aspects change when in-chip telemetry is available.
‍

Fig. 1: As the power profile shifts with different modes and switching activity, high Di/Dt noise, peak power, and average power introduce thermal, cost, and reliability penalties.

‍

On-Chip Telemetry OFF: Excessive Peak Power

To improve power and performance specs while reducing chip operational costs, engineers must determine the lowest reliable voltage, known as VDDmin, at a certain frequency of operation, which varies significantly between dies due to the process distribution.

Without on-chip telemetry, chipmakers typically detect VDDmin using VDD search testing, which lowers the voltage step by step until chip failure occurs to identify the last functional VDD. However, this method presents a difficult tradeoff:

Smaller voltage steps improve accuracy but increase test time.
Larger voltage steps are quicker but might overshoot the optimal point.
‍

Fig. 2: Voltage search plots. Determining an accurate VDDmin using this method often requires an impractically long time and high test cost, leading to painful compromises.

‍

As a compromise, many chipmakers divide all chips into a few bins, such as slow/fast/typical, setting a single voltage level per bin. However, due to the substantial variation in each bin, many units are assigned higher-than-required VDDmin, leading to excessive peak power and power density that have significant downsides, including:
‍

Higher case temperature (Tcase)
Higher Thermal Design Power (TDP)
More expensive cooling
Reduced reliability
Shorter product lifetime
‍

TDP dictates the form factor, cooling architecture, and rack density. When chips operate above their true minimum voltage, dynamic power increases sharply. That power converts to heat, resulting in higher TDP, expensive cooling solutions, higher failure rates, and shorter lifetime.

On-Chip Telemetry OFF: Excessive Di/Dt noise

Current spikes go undetected without on-chip telemetry, forcing engineers to compensate with increasing chip cost due to more expensive packaging, on-die/off-die decoupling capacitance, and on-die active droop mitigation solutions that are designed to absorb Di/Dt noise and reduce voltage droop. But that cost is only part of the tradeoff. Without visibility into current transients, designers must raise voltage or apply large safety margins to prevent failures in marginal paths.

These decisions suppress frequency and harm performance. Meanwhile, higher power turns into heat, increasing cooling demands and pushing thermal limits.

What begins as an invisible current fluctuation ends in performance loss and higher costs:

Higher risk of droop
Cost
Performance penalty

On-Chip Telemetry OFF: Excessive Average Power

Without on-chip monitoring, voltage adjustment in the field relies on guesswork rather than real timing data. Typically, Adaptive Voltage Scaling (AVS) uses canary circuits based on ring oscillators (ROSC). This method attempts to mimic critical paths but fails to reflect actual workload and reliability stress, or aging effects on the real logic.

‍

Fig. 3: A canary circuit that monitors design margins is a critical path replicator, which cannot provide accurate data about actual critical path timing.

‍

To compensate for the inaccuracy, designers must apply conservative guard bands to prevent failures, leading to higher voltages that cause excessive average power and reduced performance.

These overprotective settings inflate operational costs and compromise long-term reliability, while offering no visibility into when and where timing issues may arise.

Excessive average power also affects performance by raising thermal load and limiting voltage-frequency optimization. Both effects force the system to reduce operating frequency to remain within power and thermal limits.

The effects of excessive average power carry several long-term drawbacks:
‍

Inefficient power-performance solution
High power cost
Shorter battery life (when applicable)
Shorter product lifetime
Reliability degradation
‍

Power optimization: A solution that sees what others can’t

Chipmakers face three power-related constraints in every design: peak power, average power, and Di/Dt. Without visibility into real device behavior, these factors are managed through best known assumptions and worst-case settings.

To compensate for these blind spots, engineers divide dies into broad voltage bins, apply conservative voltage guard bands, and use expensive packages designed to absorb transient noise.

These choices increase test time, inflate cost, reduce performance, and shorten system life, among other drawbacks.

To address these severe implications, proteanTecs has introduced a novel approach with its on-chip Agents, which are specialized monitoring IPs embedded during design. These Agents provide accurate measurements of critical parameters such as real logic timing margins during actual operation.

The rich telemetry data can also feed the proteanTecs advanced data analytics software, including ML models, to guide vital decisions throughout the device production lifecycle. This level of accuracy enables meaningful reductions in cost, greater reliability, and measurable improvements in power and performance.

‍

Table 1: The impact of the proteanTecs on-chip monitoring solutions on three key optimization goals.

‍

On-Chip Telemetry ON: Optimized Peak Power

With proteanTecs VDDmin Prediction for static operational voltage setting per device, voltage is accurately predicted per die and mapped to a much finer bin based on actual measured behavior. No more time-consuming voltage sweeps that lead to unnecessary overhead. Production cases have demonstrated ~ 70% reduction in test steps with no accuracy impact, resulting in decreased costs and accelerated time-to-market. This VDDmin prediction can be done both at the tester level and at the system level, using real application workloads.

‍

Fig. 4: proteanTecs VDDmin Prediction: Measured VDDmin (Y-axis) vs. predicted VDDmin (X-axis) comparison demonstrates exceptional accuracy with 0.15 NRMSE.

‍

VDDmin Prediction is based on an ML model, trained on accurate data from the on-chip Agents. During chip-level high volume production testing, the model is integrated into the test program software and used in real time, per device, on the test floor – to predict the optimal voltage. The prediction is tested and after a minimal number of search steps, the operational voltage is fused in the device.

Voltage reduction has a substantial effect on peak power, which in turn lowers the Tcase and the cooling solution cost:

As peak power reduction translates to lower thermal load, it has a system-wide impact:

‍

Lower Tcase
Lower TDP
Cheaper cooling solution
Better reliability
Increased lifetime
‍

For example, these are the quantified benefits when VDDmin Prediction reduces voltage by 3%-5%:

‍

∆P [W] is within -6% to -10%
∆T_case [°C] is within -3% to -5%
∆TDP [W] is within 3% to 5%
∆Cooling Cost [$] is within -3% to -5%
‍

These optimizations are critical because cooling systems already consume 30% to 55% of datacenter power budgets. Reducing chip power directly cuts thermal load, which translates into real savings in infrastructure.

In high-density racks, advanced liquid cooling can cost between $1,000 and $2,000 per kW cooled, which can add up to millions of dollars annually. Every watt saved at the silicon level reduces that burden.

On-Chip Telemetry ON: Optimized Di/Dt Noise

High current swings trigger voltage droop, which can disrupt timing and cause failures. In the absence of accurate real-time monitoring, engineers compensate by using higher voltages, wider margins, on-die droop mitigation solutions (that incur performance penalty), and die cost to absorb these transients.

proteanTecs VDDmin Prediction makes these compensations unnecessary\y by lowering VDD per die, which improves signal integrity through reduced current swings and Di/Dt noise. Lower voltage also makes room for higher frequencies that boost performance, as captured by this equation:

These improvements enable:

Safer operation (reduced noise)
Cheaper package
Better performance

For example, these are the quantified benefits when VDDmin Prediction reduces voltage by 3%-5%:

∆I [mA] is within -3% to -5%
∆V_noise [mV] is within -3% to -5%
∆F [MHz] is within 3% to 5%

In addition, proteanTecs provides real-time voltage droop sensors to protect the device in mission-mode. They provide real-time hardware signals that can trigger a clock throttling event to avoid failure and reduce Di/Dt.

‍On-Chip Telemetry ON: Optimized Average Power

Unlike canary circuits, proteanTecs AVS Pro uses Agents that monitor true logic paths for more informed decisions. proteanTecs’ technology allows high coverage of the performance limiters, allowing precise guard-band tuning based on real workloads, aging, and IR drops.

This approach enables safer voltage scaling, avoiding worst-case guard-bands and allowing the device to operate closer to its actual limits without compromising functionality, performance and reliability. As demonstrated below, AVS Pro safely reduced power consumption of a production 5nm SoC by 12.5%. At the same time, it extended the predicted lifetime by 18%.

Fig. 5: AVS Pro, visualized here, enables 12.51% power saving through safer voltage scaling, leading to 18% projected lifetime extension.

‍

proteanTecs AVS Pro continuously adjusts voltage based on real-time Agent data. As the device operates with a surplus of timing margin, AVS Pro reduces the voltage. When more stressful functional workloads operate or degradation reduces timing margins, AVS Pro increases the voltage only as much as needed to maintain safe operation.

This continuous response avoids both oversupply and undershoot, providing substantial benefits:

Optimized power-performance
Reduced power cost
Higher battery life (when applicable)
Increased lifetime
Better reliability

The chart below shows how AVS Pro delays degradation over time. The device maintains safe performance levels for longer, pushing the wear-out point further into the product lifecycle.

This type of lifetime extension has significant financial implications. Hyperscalers like Amazon, Alphabet, and Microsoft publicly attribute billions in annual net income to extending server lifespans by just one to two years. proteanTecs AVS Pro supports similar CAPEX reduction strategies by delaying degradation without compromising performance.

To learn more about the benefits of using AVS Pro for chip lifetime extension, read the full paper here.

Fig. 6: An example of chip lifetime extension enabled by AVS Pro: 5nm delay degradation simulations [%] at nominal conditions: T junction 85 °C, V=0.75V

Conclusion – A tale of two chips

Many love underdogs, but as this article shows, products with on-chip telemetry win by a knockout.

proteanTecs provides visibility from within that spans production and deployment. With VDDmin Prediction and AVS Pro, power optimization begins at production test and continues throughout system operation.

VDDmin Prediction reduces peak power and Di/DT noise by tuning VDD with personalization and precision. Dedicated voltage droop sensors protect the device in real time, when unexpected workloads arrive. AVS Pro cuts average power through safer voltage scaling in the field. Together, they improve critical aspects of power, performance, and cost:

For end users such as data center operators: lower energy costs, better performance, improved reliability, and longer lifetime.
For system providers: lower system power, lower TDP, cheaper system that is also more compact, improved reliability, and longer lifetime.
For chip vendors: improved power-performance, VDD Noise reduction, cheaper package, improved reliability, and longer lifetime.

Ready to realize the full benefits of on-chip telemetry? Contact our team here or download our whitepaper to see how proteanTecs enhances performance, efficiency, reliability, and product lifetime from test to deployment.

‍

Thermal Sensing Headache Finally Over for 2nm and Beyond | proteanTecs Blog

Admin — Mon, 01 Sep 2025 15:00:00 GMT

Silicon-Proven LVTS for 2nm: A New Era of Accuracy and Integration in Thermal Monitoring

Effective thermal management is crucial to prevent overheating and optimize performance in modern SoCs. Inadequate temperature control due to inaccurate thermal sensing compromises power management, reliability, processing speed, and lifespan, leading to issues like electromigration, and hot carrier injection and even thermal runaway.

Unfortunately, precise thermal monitoring reached an inflection point at 2nm, with traditional solutions proving less practical below 3nm. To tackle the issue, this article delves into a novel approach, accurate to ±1.0°C, that overcomes this critical challenge.

proteanTecs now offers a customer-ready, silicon-proven solution for 5nm, 3nm and 2nm nodes. In fact, our latest silicon reports demonstrate robust performance, validating that accurate and scalable thermal sensing is achievable in the most advanced nodes.

Accurate Thermal Sensing in Advanced Process Nodes: A Growing Challenge

As process nodes scale to 2nm and below, accurately measuring on-chip temperature has become increasingly difficult. Traditional Voltage and Temperature sensors based on diodes are less practical in these nodes due to their high-voltage requirements. This gap in temperature measurement creates risks that compel chipmakers to seek future-ready solutions. The challenge is magnified in designs that leverage DVFS techniques.

Why Traditional Solutions Fall Short

Traditional thermal sensing technologies are hitting hard limitations in precision and overall feasibility when moving beyond 3nm:

Temperature sensors based on BJT diodes

Analog thermal diodes with Bipolar Junction Transistors (BJTs) have been a go-to option for accurate thermal sensing. However, their reliance on high I/O voltages makes them inapplicable for nodes beyond 3nm based on Gate-All-Around (GAA) technology, which doesn't support high I/O (analog) voltages, and BJT support may be discontinued as well in the future.

‍

PNPBJT in a diode-connected configuration. The base-emitter junction has a predictable ‍transfer function that depends on temperature, making it suitable for thermal sensing. ‍However, analog thermal diodes are a no-go for nodes beyond 3nm.

Even before GAA, thermal diodes suffered from low coverage as they were hard to integrate. Their design restricted placement to chip edges near the I/O power supply, leaving vital internal areas unmonitored due to analog routing limitations. Furthermore, they consumed more power than low-voltage alternatives due to their high-voltage requirement.

Digital Temperature Measurements based on Ring oscillators

Ring oscillators are scalable to advanced nodes, but their temperature measurement error can be as high as ±10°C. They are inadequate where accuracy is paramount. One example concern using thermal sensing to determine voltage or frequency adjustments (e.g. DVFS), as even slight temperature variations can significantly degrade performance.

Ring oscillator temperature error of different calibration techniques. Can be greater than -10°C, which is too high for many use cases.

‍

The limitations above underscore the need for an accurate thermal sensing solution designed with core transistors only to fit advanced nodes.

A Thermal Sensor Built for the Future

proteanTecs LVTS™ (Local Voltage and Thermal Sensor) is purpose-built for precision thermal sensing in advanced nodes without relying on I/O transistors and high analog I/O voltages and even BJTs. It measures temperature with accuracy of ±1.0°C while using core transistors exclusively and operating in a wide range of core voltages, combining precision with future readiness for GAA nodes.

Key features of LVTS:

Temperature measurement accuracy of +/-1°C (3-sigma)

Voltage measurement accuracy of +/-1.5% (3-sigma)

Over temperature fast alert

Wide range of operational voltages (650-950 mV)
High-speed measurement

proteanTecs LVTS measurements demonstrate an accuracy of ±1°C in a wide range of voltages (0.65V SSG – 1.05V FFG) and temperatures (-40°C - 125°C.)

‍

‍Unmatched Benefits Across All Critical Parameters

LVTS operates with low VDD core rather than high I/O voltage while maintaining superb accuracy, unlike Digital Thermal sensors based on ring oscillators. This unique design enables easy integration anywhere on the chip, providing more granular voltage and temperature monitoring than thermal diodes. Additionally, its smaller size and lower power consumption minimize the impact on PPA compared to BJT-based solutions.

LVTS compared with thermal diodes and ring oscillators (ROSC)

‍

An additional capability of LVTS provides real-time warnings and critical alerts in the form of HW signals when predetermined thermal thresholds are breached. This feature enables immediate corrective action, reducing the risk of overheating to maintain chip integrity.

LVTS Flavors for Enhanced Flexibility

In addition to the standard LVTS described above, proteanTecs offers two specialized variants to address diverse design needs:

An extended flavor - includes external voltage measurement to extend the measured voltage range down to zero volts.

A distributed flavor - designed as a Core VDD-only, analog thermal and DC voltage level sensor hub, it supports extremely small remote thermal sensors for precise temperature measurements at hot spots.

These two versions complement the regular LVTS, allowing chipmakers to tailor their thermal sensing approach for maximum coverage, precision, and responsiveness in critical areas of the design.

Complementing Deep Data Analytics with Accurate Voltage and Temperature Sensing

LVTS is already silicon-proven in 5nm, 3nm, and now also in 2nm, with a detailed silicon report available, making it the industry-leading, future-proof, customer-ready solution.

This innovation was warmly embraced by multiple chipmakers concerned about the absence of accurate and reliable thermal sensing in next-generation silicon.

These customers use LVTS alongside other proteanTecs products, as it complements the broader deep data monitoring and analytics solutions explored here.

LVTS is seamlessly integrated into proteanTecs’ HW Monitoring System, enabling accurate DC voltage and thermal measurements real-time, making LVTS a vital addition to chipmaker power and reliability strategies.

Want to know more about how LVTS can help scale your design to advanced nodes with accurate voltage and temperature sensing? Contact us here.

Critical Optimization Factors for GenAI Chipmakers | proteanTecs Blog

Admin — Sat, 05 Jul 2025 15:00:00 GMT

The diverse approaches and innovative solutions shaping the future of AI hardware, essential for win

Today’s GenAI arms race is fought with novel chip architectures and packaging. Specialized hardware designs are proliferating in the form of GPUs, TPUs, NPUs, and more, all tuned for parallelism and matrix-heavy AI math.

In this hyper-competitive landscape, chip vendors scramble to differentiate their products on multiple fronts. They promise some mix of better performance, efficiency, or scalability, but the specific strategies vary widely:

Performance

Some chipmakers aim to outgun the competition with sheer performance. Flagship GPUs, for example, focus on FLOPS and huge memory throughput. While memory is a critical factor in GenAI performance, this paper focuses on compute throughput bottlenecks.

One approach that chipmakers employ to win this category is advanced packaging, connecting multiple silicon chiplets in a single heterogeneous device to increase performance density.

Even a 10% speed improvement will have a profound impact due to the immense scale. For example, training a model like LLaMA 3.1 405B involved 16,000 GPUs, consumed approximately 27 megawatts, and required an estimated 40 billion PFLOPS [23]. That level of optimization can reduce training time by several weeks and eliminate the need for thousands of GPU-days, translating to millions of dollars in infrastructure savings.

In large-scale AI inference operations, even modest throughput enhancements can lead to significant cost reductions. For instance, OpenAI's GPT-4 processes approximately 50 billion queries annually, incurring an estimated $144 million in compute costs [24]. Implementing a 10% throughput improvement could decrease the number of required servers, resulting in an estimated $14.4 million in annual savings.

‍

A dramatic increase in inference latency from 73 ms/token in‍OpenAI gpt-3.5-turbo to 196 ms/token in OpenAI gpt-4 [25].

‍

Throughput optimization also reduces inference latency, which is a critical factor in user experience. For example, the response time of OpenAI's GPT-4 model has been measured at approximately 196 milliseconds per generated token [25]. Enhancing throughput by 10% could proportionally reduce this latency, leading to faster response times and improved user satisfaction.

Performance improvements typically begin with design-time architecture exploration and RTL optimization, such as pipeline depth, compute unit allocation, and dataflow design. On top of that, chipmakers apply techniques like standard Adaptive Frequency Scaling (AFS) to push efficiency under dynamic conditions in the field.

However, these runtime methods are generally static and not workload-aware, leading to suboptimal performance in real-world deployments. Frequency scaling is also done conservatively to preserve thermal and functional stability. While these approaches help extract more performance within safe limits, they may fall short of what GenAI workloads demand.

Power Efficiency

GenAI’s exponential growth in computational requirements urges chipmakers to pay closer attention to power consumption. Beyond immediate consequences, such as thermal problems, excessive wattage has severe implications for customers’ operational costs.

As a consequence, design wins increasingly revolve around Total Cost of Ownership (TCO). This metric factors in not only the upfront hardware cost but also ongoing expenses like power, cooling, and infrastructure. Solutions that deliver more compute per watt can significantly reduce TCO and make large-scale AI deployments more sustainable.

Furthermore, reducing the power consumption of individual devices directly expands infrastructure performance. Every watt saved per chip frees up headroom within the data center’s fixed power budget, enabling higher system utilization across the fleet.

This power reduction allows operators to run more workloads, serve more users, or deploy additional systems without breaching energy limits. Improving PPW at the chip level becomes a strategic lever for maximizing performance within existing power constraints.

To explore how this dynamic plays out across real data center deployments, read the full blog post here.

‍

PPW can grow by increasing performance within the power ‍envelope or by reducing wattage without impacting FLOPS.

‍

Power efficiency is typically optimized through a combination of design-time techniques and runtime control. Clock gating, power gating, and multi-voltage domains are widely used at the architecture and implementation levels to reduce dynamic and leakage power.

At runtime, methods like Dynamic Voltage and Frequency Scaling (DVFS) and Adaptive Voltage Scaling (AVS) are applied to adjust power consumption based on static models or basic telemetry, such as temperature or process variation. These standard techniques are not workload-aware and typically apply uniform guard bands across all chips to ensure stability across all devices and workloads.

As a result, they leave significant excess guard bands that cause unnecessary power consumption, undermining PPW. This inefficiency calls for more precise, real-time approaches that optimize power without compromising performance or reliability.

Reliaiblity

A chip’s reliability at large scales is just as critical as its raw performance. DPPM measures the fraction of chips that exhibit failures post-manufacturing, directly impacting system uptime and operational costs. While semiconductor testing filters out detectable defects, latent issues stay hidden until real workloads expose them. As GenAI compute infrastructure scales to millions of deployed chips, even a low DPPM might translate to frequent failures with substantial consequences.

Furthermore, Silent Data Corruption (SDC) has emerged as a critical reliability threat to scaling GenAI training, as it corrupts computations without triggering alerts. Unlike memory bit flips, for example, mitigated by error correction codes (ECC), SDCs originate from subtle timing violations, aging effects, or marginal defects that escape standard semiconductor testing.

These errors leave no trace, yet a single one can distort model weights across interdependent nodes, quietly derailing a training run that may span weeks, involve over 25,000 GPUs, and cost more than $100 million [12]. In training clusters, even a single faulty processor can jeopardize the entire job. These workloads run across tightly coupled systems, each contributing to shared model parameters. If one chip introduces a silent error during synchronization, that corruption spreads throughout the cluster. ‍

Download White Paper: Outsmarting Silent Data Corruption in AI Processors with Two-Stage Detection

Ensuring reliability has traditionally relied on periodic field testing to uncover potential failures. While effective for basic quality assurance, these methods may miss latent defects, workload-driven faults, accelerated aging, and SDCs. They are also time-consuming and difficult to streamline within data center environments running high-intensity GenAI. The limitations of these offline techniques point to the need for continuous, in-situ monitoring to maintain reliability at hyperscale.

Despite these diverse optimization strategies, all chipmakers share a common challenge. They must set conservative operating guard bands to ensure reliability. This necessity presents an overlooked opportunity for significant optimization that can shape who wins the GenAI race.

proteanTecs Real-Time Monitoring for Scalable GenAI Chips

As GenAI chips reach unprecedented levels of complexity, traditional design-time assumptions and static controls are no longer enough. Standard runtime methods such as AVS, DVFS, and AFS are static and rely on conservative guard bands. These approaches waste power, limit throughput, and fail to detect real-time reliability issues.

What chipmakers need is visibility into how each chip behaves under actual workloads. Not just design-time guard bands or environmental telemetry, but in-situ insights into timing margins, aging, and stress.

proteanTecs closes this critical gap by enabling a new class of in-chip applications that optimize each chip by tuning it in real time according to actual workloads.

By embedding agents inside the chip, proteanTecs delivers precise monitoring of real, performance limiting paths’ timing margins, application stress, operational and environmental effects, aging, latent defects, and process variation. This approach uncovers insights undetectable by legacy methods. With dedicated algorithms, these insights power three breakthrough applications:

AVS Pro dramatically reduces power consumption by safely trimming excess voltage guard bands, improving Performance-Per-Watt (PPW) and lowering TCO while guaranteeing reliability.

RTHM provides continuous health tracking of the device, detecting marginal behavior before it leads to functional failures or SDC. This capability is especially crucial for billion-dollar model training runs.

AFS Pro extracts extra performance by reclaiming hidden frequency headroom, dynamically tuning each chip closer to its unique threshold for maximum throughput, while maintaining a functionality safety net.

In the GenAI era, chipmakers must strategically balance unprecedented performance, stringent power efficiency, and rock-solid reliability. The complexity of achieving these goals calls for real-time, workload-aware optimization techniques beyond conventional guard bands and static methods. As GenAI continues its rapid evolution, embedding advanced monitoring and dynamic tuning capabilities directly within chips emerges not only as a differentiator but a necessity—shaping who will ultimately lead this high-stakes technological revolution.

Together, these solutions turn conservative margins into a competitive advantage, allowing GenAI chipmakers and cloud operators to scale faster, safer, and smarter.

Want to learn how these capabilities deliver up to 12.5% power reduction and 8% higher performance?👉 Read the full white paper to see how real-time in-chip optimization is redefining what’s possible in GenAI infrastructure.

This is part 3 of 3-part series:

Click here for part 1 - GenAI's Breakneck Pace is Reshaping the Semiconductor Industry

Unpacks how generative AI is outpacing Moore’s Law, the semiconductor shake-up driven by generative AI’s explosive rise, where generative models are racing toward superintelligence and chipmakers are scrambling to keep up.

Click here for part 2 - The Painful Reality of Scaling Cloud AI

Delving deeper into the painful realities of scaling cloud AI infrastructure. We'll examine practical obstacles chipmakers face—including hardware failures and reliability issues such as Silent Data Corruption (SDC), surging power demands, and workload growth that continues to outpace Moore's Law.

The Painful Reality of Scaling Cloud AI | proteanTecs Blog

Admin — Mon, 30 Jun 2025 15:00:00 GMT

GenAI Workload Demands are Growing Orders of Magnitude Faster than Transistor Density

The shift to Generative AI (GenAI) has overwhelmed existing infrastructure, transforming previously rare issues into daily operational realities. Skyrocketing costs, intense energy consumption, and hardware failures at unprecedented scales illustrate the strain of current AI workloads. With models like GPT-4 costing tens of millions and GPT-5 projected to surpass a billion-dollar threshold, the economic and energy implications are staggering. In this section, we'll explore these critical challenges, detailing the escalating pressure on infrastructure as GenAI rapidly evolves and highlighting the urgent need for innovative solutions to scale AI sustainably and reliably.

The shift to GenAI has outpaced the infrastructure it runs on. What were once rare exceptions are now daily operations: high model complexity, non-stop inference demand, and intolerable cost structures. The numbers are no longer abstract. They’re a warning.

Training a model like GPT-4 reportedly consumed 25,000 GPUs over nearly 100 days, with costs reaching $100 million [12]. GPT-5 is expected to break the $1 billion mark [13]. Energy usage is just as daunting. Training GPT-4 drew an estimated 50 GWh, enough to power over 23,000 U.S. homes for a year [14]. Even with all that investment, reliability is fragile. A 16,384-GPU run experienced hardware failures every three hours, posing a threat to the integrity of weeks-long workloads [15].

Projected AI power consumption grows from 8 TWh in 2024 to 652 TWh by 2030 (8,050%), driven by both training and a rapidly growing share of inference. Based on Wells Fargo data via IO Fund [16].

‍

Inference isn’t easier. ChatGPT now serves more than one billion queries daily, with operational costs nearing $700K per day [17]. Each response, priced at just fractions of a cent, adds up to an infrastructure bill that outpaces most business models. That pressure is made worse by performance gaps. Users frequently report over 20-second delays for answers [18]. At this scale, even slight inefficiencies multiply into real dollars and degraded user experience.

These are not isolated incidents. They are signs of systemic strain. Massive training runs, crushing query volumes, rising failure rates, and mounting electricity costs—this is the environment GenAI must thrive in. What's needed isn’t incremental optimization. It’s a way to reclaim control and scale effectively.

The table below outlines the core challenges behind these risks. Each is backed by hard data. Together, they show just how steep the hill has become.

‍

Key operational challenges in cloud AI workloads.

‍

‍Why Moore’s Law Is No Longer Enough

Moore’s Law predicts that the number of transistors in an IC doubles approximately every two years. The law was accurate for decades, yet recent fabrication challenges slowed it to around 2.5 years for each new node [19]. More importantly, even the original rate couldn’t keep up with GenAI's computational requirements, which double much faster than transistor density.

It took 2.6 years to move from 5nm to 3nm, yet the reported performance gain at the same power was only about 10-15%, with 25-30% improvements in power efficiency at the same speed [20]. Meanwhile, GenAI workload demands are growing orders of magnitude faster.

‍

Growth in transistor density versus the PFLOPS required to train AI models from a 2021 baseline.‍By 2024, AI compute requirements surged by 6847%, while transistor density grew by only 183%. 2025 value is based on the projected PFLOPS required to train GPT-5 [21].

‍

Still, chipmakers manage to keep up with GenAI advancements, which marks a departure from the traditional scaling model. In some cases, a chip can be 30 times faster than its predecessor, which was announced less than a year earlier [22]. Such relentless demands force chipmakers to constantly seek new ways to optimize their products.

In Part III of this series, we will discuss the critical optimization factors for GenAI chipmakers. We will explore how chipmakers differentiate their products using novel architectures, packaging strategies, and optimization techniques that target performance, power efficiency, and reliability. This next installment will detail the diverse approaches and innovative solutions shaping the future of AI hardware, essential for winning in today's hyper-competitive GenAI arms race.

This is part 2 of a 3-part blog series:

Click here for part 1 - GenAI's Breakneck Pace is Reshaping the Semiconductor Industry

Click here for part 3 - Critical Optimization Factors for GenAI Chipmakers

Discussing the critical optimization factors for GenAI chipmakers. We will explore how chipmakers differentiate their products using novel architectures, packaging strategies, and optimization techniques that target performance, power efficiency, and reliability.

GenAI's Breakneck Pace is Reshaping the Semiconductor Industry | proteanTecs Blog

Admin — Wed, 25 Jun 2025 15:00:00 GMT

GenAI’s Explosive Pace Is Shattering the Semiconductor Landscape

Can Your ATPG Do This? Cut Defects Escaping Detection With ML | proteanTecs Blog

Admin — Thu, 01 May 2025 15:00:00 GMT

Identify early indicators of risk by analyzing timing margin data from within the chip.

Chipmakers worldwide consider Automatic Test Pattern Generation (ATPG) their go-to method for achieving high test coverage in production. ATPG generates test patterns designed to detect faults in the silicon and ensures they are applied effectively using the chip’s Design-for-Test (DFT) infrastructure. This combination enhances fault detection while optimizing test efficiency.

These patterns are injected by Automatic Test Equipment (ATE) into each die during high-volume manufacturing (HVM), enabling solid quality control through large-scale testing of all chips.

ATPG at speed tests are targeted for different kinds of faults (e.g., transition faults, small delay faults) and have earned their spot in the semiconductor testing hall of fame—but what about their limitations? This article explores the risks and remedies of ATPG drawbacks to help you create a robust test program that cuts defects without affecting yield.

Understanding ATPG’s limitations and their impact

If you’re worried about your test patterns letting defects slip through, you’re not alone. Despite its advantages, conventional ATPG may not catch small, latent and marginal defects, while even creating false positives/negatives:

Latent/marginal defects: A threat to product reliability

‍
One of the major concerns is defects that are too subtle for the pass/fail granularity of ATPG results. The marginal performance of such chips is just enough to pass all patterns on ATE, yet they are “walking-wounded” devices.

These issues often escape detection until customers discover them in the field. For example, undetected defects that potentially cause Silent Data Corruption (SDC) might lead to costly post-release issues that jeopardize product reliability and customer trust. They can also cost as much as $50,000 per RMA, not counting lost reputation and resources allocated from other projects to investigate. You can read more about such faults and their remedies in this whitepaper.

Misalignment between ATPG and real-life conditions

Another inherent limitation is the potential misalignment between test patterns and real-world scenarios, raising doubts about whether ATPG truly reflects the conditions a chip will face during lifetime operation.

To compensate for this limitation, chipmakers may tighten test thresholds, but this can lead to two risks. Overly stringent testing (overkill) may generate unrealistic patterns that cause unnecessary failures at ATE, reducing yield without real benefit. On the other hand, insufficiently representative patterns (underkill) may overlook defects that could emerge under actual workloads, leading to field failures.

Striking the right balance is critical to ensuring both high yield and long-term reliability.

What if you had high coverage timing margin data from within the chip?

Many latent faults in the field exhibit abnormal behavior that can evolve into future timing violations. These defects often escape detection due to ATPG’s limitations in capturing subtleties. Thankfully, by analyzing timing margin data from within the chip, it’s possible to identify early indicators of risk, addressing blind spots and strengthening confidence in the test program.

Parametric margin data from within the chip mitigates ATPG limitations by tackling their causes.

‍

The result? Imagine a robust test program that catches all those marginal issues in advance. Thanks to powerful machine learning (ML) algorithms, you could analyze high-coverage timing margin data with unprecedented visibility into every die. The ML model can be loaded onto the ATE to eliminate the blind spots of your ATPG patterns automatically.

Timing margin visibility: Enhancing quality with ML precision

Using proteanTecs’ Margin Agents (MA), designed to boost quality without compromising yield during structural tests, the minimum margin to operating frequency of millions of paths is measured, and critical issues are pinpointed per die. By analyzing parametric timing data, these Margin Agents tackle the inherent limitations of ATPG head-on.

The solution learns the normal behavior by processing margin agent readings using ML and can identify anomalies undetectable by ATPG.

The solution includes a cloud-based deep data analytics platform and edge software deployed on the ATE. It leverages advanced machine learning algorithms in the cloud to analyze timing margin measurements. It trains on extensive data to profile normal behavior across different operating conditions and the process distribution. Then the trained models are deployed to the edge, for inline decisions on the test floor. By generating a highly accurate predicted timing margin values across the chip, it can detect subtle deviations that ATPG would miss. If the measured timing margin deviates from the predicted value, the chip is flagged as an outlier, allowing preventive action before it reaches the field.

‍

Combining on-chip agent reading with precise Machine Learning models deployed at ATE.

‍

The solution integrates seamlessly with your workflow:

On-chip timing margin monitors: proteanTecs Margin Agents capture real-time timing margin data from millions of logic paths, which serves as a baseline for ML model creation.
Cloud-based deep data analytics platform: Processes massive datasets with ML to train a model that learns the normal behavior, enabling the detection of anomalies beyond the scope of ATPG’s pass/fail metrics.
Edge software on the ATE: Automates the detection and classification of faulty dies on the ATE by combining real-time margin measurements with a trained model. This enables identification of latent defects and eliminates ATPG blind spots during high-volume manufacturing.

This powerful combination ensures unprecedented visibility into every die, reducing DPPM, preventing costly RMAs, and driving confidence in your test program.

Eliminating your ATPG blind spots to reduce DPPM and RMA-related costs

proteanTecs MA-based outlier detection can prevent the escapes of marginal and latent defects characteristic of complex designs and advanced nodes. Such issues might pass conventional ATPG tests as they are too subtle to detect, yet they can cause hardware failures in the field. The shift left that timing margin measurements enable directly reduces DPPM and RMA-related costs, by moving detection from the field to production testing.

As depicted below, the new data can help to make informed decisions regarding quality. A close examination of the wafer-level testing results to the left reveals that a faulty outlier which had enough margin to pass all ATPG patterns, including at-speed patterns, has outlier behavior from the expected behavior. Following the detection of the outlier die, the software pinpoints the location in the chip where the problem occurred.

‍

Reducing DPPM while simplifying defect investigation: proteanTecs MA-based outlier detection uses ML to identify faulty outliers undetectable by ATPG and then pinpoints the exact location of the problem in the chip.

‍

Customers report a significant DPPM reduction thanks to proteanTecs MA, detecting chips with timing margin issues that passed all ATPG tests.

Customers report a significant DPPM reduction thanks to proteanTecs MA based outlier detection. For one datacenter chipmaker, despite their high risk of failure, some devices passed all ATPG tests, in fact all production tests, as their performance was marginal rather than unacceptable. After integrating proteanTecs’ solution, the same chips showed lower-than-expected timing margin measurements, leading to their disqualification. If undetected, these units were likely to suffer timing violations that could cause Silent Data Errors after some in-field usage.

Correlating your ATPG and functional tests to reflect real-life conditions

During New Product Introduction (NPI), it is essential to establish a solid test program for High Volume Manufacturing (HVM) testing with ATPG patterns and functional system-level tests (SLT), or even System tests. As explained above, the ATPG patterns might not reflect real workloads, unlike functional tests, potentially hurting yield and DPPM.

To mitigate this misalignment, proteanTecs helps to correlate ATPG patterns and functional workloads by comparing their timing margin measurements, provided by the Margin Agents, on the same devices. There are two options for the alignment process depending on the comparison results:

ATPG timing margins are worse than functional test ones: In this case, ATPG results may be overstressing (from a performance point of view). For example, running ATPG at-speed patterns on the entire chip can cause unnatural IR drops that won’t occur in functional tests. To fix the problem, the patterns can be adjusted to reduce false fallout without compromising quality.
Functional timing margins are worse than ATPG ones: This case is dangerously misleading, making it seem like the chip is doing well, as it passed all test patterns successfully. However, timing margin measurements would reveal insufficient ATPG at-speed coverage instead, calling for additional test patterns that reflect actual functionality.

The proteanTecs solution correlates margin agent data of wafer-level chip probing (left bar) and system-level test (right bar) to help reflect real-life conditions in ATPG patterns.

‍

For example, the Margin Agent measurements above show that wafer-level ATPG timing margins are much higher than functional ones on average. These results imply that ATPG patterns fail to reflect real workloads, potentially leading to systematic failures in the field. When the chipmaker noticed, the test engineering team worked to extend ATPG patterns until their margins were aligned with functional ones.‍

Taking ATPG to the field

You can also use timing margin monitoring when the chip is in the field beyond NPI and HVM. This approach is aligned with the trend of running ATPG in the field at some pre-defined testing cycles or during SLT. This is called “In-System Test.”

In case of In-System Test in the Field, the timing margin information provided by the Margin Agents can once again show how close to failure a device is, even if it passes the In-System Test. The Margin Agents are capable of measuring while the device is operating real workloads. In this case, the timing margin monitoring is available both while the device is operating and executing real workloads and when running deterministic ATPG tests during In-System test cycles.‍

In case a malfunctioning chip returns as RMA, you can compare its timing margins across three different measurements:

Original ATPG results during HVM
Functional mode in the field
Post-RMA ATPG results

This approach can accelerate root-cause analysis, supporting test program improvements and design optimizations.

Ready to cut your DPPM and shift left defect detection? Download our exclusive whitepaper or contact our team today at this link.

By integrating with Arm SMCF, proteanTecs strengthens its offering of predictive deep data solutions that span the entire deployment of Neoverse CSS-based custom SoCs. | proteanTecs Blog

Admin — Mon, 20 Jan 2025 11:15:00 GMT

Rising Complexity in System Design

In an era where system complexity is scaling rapidly, real-time monitoring and predictive analytics play a pivotal role in maintaining lifetime performance and reliability. At proteanTecs, we are committed to enabling advanced diagnostics, predictive maintenance, and on-chip actionable visibility for today’s mission-critical systems, across high-performance industries.