How to Prevent Memory Failures in Your Data Center

Jul 11, 2022 | Tech Blog

Dos Terasaka

Dos Terasaka

Aptio Product Manager, Global Product Group

This blog post will discuss preventing memory failures in your data center and maintaining RAS.

Cloud data center managers have their hands full dealing with various hardware failures that can impact service availability and revenue. Unfortunately, as data center operators know all too well, memory failures are one of the top hardware failures. Unlike some other hardware failures, a memory failure can have a devastating effect without giving an early enough warning of a future outage to take preemptive action.

By using machine learning to analyze real-time memory health data, it is possible to predict such failures ahead of time. Machine learning helps to find hidden patterns and insights in data sets to predict future events. So, by applying machine learning to memory health data, it is possible to detect issues early on and predict when a failure is likely to occur. This gives data center operators the time they need to act and prevent an outage from occurring. And that, in turn, leads to better uptime for data center operations.

Intel’s Memory Resilience Technology predicts these failures before they happen, using pattern matching based on historical data. It uses a multi-dimensional model and algorithms to predict when a memory is likely to fail. Memory Resilience Technology is a core technology that every data center and cloud service provider should utilize to reduce total cost of ownership and improve system uptime. This ultimately results in improved data center SLAs, reduced memory failure rates and proactive memory health evaluation.

When it comes to tracking and analyzing memory errors, you need a BIOS that can work closely with your BMC firmware. That’s where the AMI solution comes in. AMI’s Aptio UEFI captures errors and passes the relevant data to our MegaRAC BMC firmware. AMI’s MegaRAC then uses Intel’s Memory Resilience Technology engine to calculate a health score for the affected memory module. This way, AMI’s technology tracks each memory module’s health over time and exposes the results for the data center operator to review.

So, what are you waiting for? With Memory Resiliency Technology we’ve got you covered whether you’re dealing with a few isolated errors or a full-blown memory crisis.

Resources

About AMI

AMI is Firmware Reimagined for modern computing. As a global leader in Dynamic Firmware for security, orchestration and manageability solutions, AMI enables the world’s compute platforms from on-premises to the cloud to the edge. AMI’s industry-leading foundational technology and unwavering customer support have generated lasting partnerships and spurred innovation for some of the most prominent brands in the high-tech industry. AMI is also a critical provider to the Open Compute ecosystem and is a member of numerous industry associations and standards groups, such as the Unified EFI Forum (UEFI), PICMG, National Institute of Standards and Technology (NIST), National Cybersecurity Excellence Partnership (NCEP), and the Trusted Computing Group (TCG).

You May Also Like…