Dos Terasaka
Aptio Product Manager, Global Product Group
This blog post will discuss preventing memory failures in your data center and maintaining RAS.
Cloud data center managers have their hands full dealing with various hardware failures that can impact service availability and revenue. Unfortunately, as data center operators know all too well, memory failures are one of the top hardware failures. Unlike some other hardware failures, a memory failure can have a devastating effect without giving an early enough warning of a future outage to take preemptive action.
By using machine learning to analyze real-time memory health data, it is possible to predict such failures ahead of time. Machine learning helps to find hidden patterns and insights in data sets to predict future events. So, by applying machine learning to memory health data, it is possible to detect issues early on and predict when a failure is likely to occur. This gives data center operators the time they need to act and prevent an outage from occurring. And that, in turn, leads to better uptime for data center operations.
When it comes to tracking and analyzing memory errors, you need a BIOS that can work closely with your BMC firmware. That’s where the AMI solution comes in. AMI’s Aptio UEFI captures errors and passes the relevant data to our MegaRAC BMC firmware. AMI’s MegaRAC then uses Intel’s Memory Resilience Technology engine to calculate a health score for the affected memory module. This way, AMI’s technology tracks each memory module’s health over time and exposes the results for the data center operator to review.
So, what are you waiting for? With Memory Resiliency Technology we’ve got you covered whether you’re dealing with a few isolated errors or a full-blown memory crisis.
Resources
- Memory Resilience Technology overview video co-presented by Intel and AMI (English, Mandarin, Japanese)
- Data Center cost savings calculator using Memory Resilience Technology
- AMI Firmware Solutions for Intel Memory Resilience Technology Data Sheet