How to Prevent Memory Failures in Your Data Center

Jul 11, 2022

This blog post will discuss preventing memory failures in your data center and maintaining RAS.

Cloud data center managers have their hands full dealing with various hardware failures that can impact service availability and revenue. Unfortunately, as data center operators know all too well, memory failures are one of the top hardware failures. Unlike some other hardware failures, a memory failure can have a devastating effect without giving an early enough warning of a future outage to take preemptive action.

By using machine learning to analyze real-time memory health data, it is possible to predict such failures ahead of time. Machine learning helps to find hidden patterns and insights in data sets to predict future events. So, by applying machine learning to memory health data, it is possible to detect issues early on and predict when a failure is likely to occur. This gives data center operators the time they need to act and prevent an outage from occurring. And that, in turn, leads to better uptime for data center operations.

Intel’s Memory Resilience Technology predicts these failures before they happen, using pattern matching based on historical data. It uses a multi-dimensional model and algorithms to predict when a memory is likely to fail. Memory Resilience Technology is a core technology that every data center and cloud service provider should utilize to reduce total cost of ownership and improve system uptime. This ultimately results in improved data center SLAs, reduced memory failure rates and proactive memory health evaluation.

When it comes to tracking and analyzing memory errors, you need a BIOS that can work closely with your BMC firmware. That’s where the AMI solution comes in. AMI’s Aptio UEFI captures errors and passes the relevant data to our MegaRAC BMC firmware. AMI’s MegaRAC then uses Intel’s Memory Resilience Technology engine to calculate a health score for the affected memory module. This way, AMI’s technology tracks each memory module’s health over time and exposes the results for the data center operator to review.

So, what are you waiting for? With Memory Resiliency Technology we’ve got you covered whether you’re dealing with a few isolated errors or a full-blown memory crisis.

Resources

Memory Resilience Technology overview video co-presented by Intel and AMI (English, Mandarin, Japanese)
Data Center cost savings calculator using Memory Resilience Technology
AMI Firmware Solutions for Intel Memory Resilience Technology Data Sheet

← Previous Next →

Trusted for What’s Critical

AMI is your low-risk partner for high-stakes innovation. Our firmware solutions drive performance, reliability and time to market when it matters most.

When you work with AMI, you get deep expertise, proven stability and hands-on support throughout your development journey. Contact us to learn how AMI firmware solutions can help you reduce risk, simplify complexity and scale with confidence.