Rami Radi
Sr. Product Manager and Solution Architect
Challenges of Managing AI Cluster Density
AI clusters demand careful management of power, temperature, and component health to prevent issues like overheating and downtime. Traditional data center tools were not built for such densely packed clusters with thousands of GPUs and compute nodes. Managing these environments efficiently requires a new, integrated approach to keep every component optimized.
The Wiwynn-AMI Solution
AMI and Wiwynn have collaborated so that AMI’s Data Center Manager (DCM) and Wiwynn’s Universal Management System (UMS100) work in tandem to create a management platform specifically for high-density liquid-cooled AI clusters. Wiwynn’s UMS100 handles liquid cooling units, providing real-time monitoring to maintain ideal operating temperatures for GPUs and other components which reduces the risk of thermal issues and prolongs equipment life. And on the other hand, AMI DCM v6.0 (soon to be released) serves as a central platform that tracks power, thermal, health, and carbon metrics across the entire cluster. This allows data center managers to monitor and adjust resources in real-time, ensuring efficient and reliable performance.
Key capabilities of AMI DCM include:
- Enhanced GPU Management in DCM v6.0: Given the importance of GPUs in AI clusters, DCM v6.0 introduces new GPU management features, including the ability to monitor GPU utilization, temperatures, and power use, along with diagnostics such as GPU resets and power capping. These features in combination with UMS100 capabilities provide additional insights such as leak detection, reservoir level notifications, and changes in flow rates, allowing for actionable preemptive measures to be taken.
- Meeting Performance and Sustainability Goals: Balancing AI cluster performance with environmental responsibility is a key challenge. Together, AMI DCM and Wiwynn UMS100 optimize power and cooling use, helping data centers reduce excess energy consumption while ensuring efficient operation. This integrated approach supports data centers in achieving both operational excellence and sustainability goals, crucial for meeting today’s industry standards.
Addressing the Firmware Conundrum
By centralizing and automating firmware management, AMI DCM eliminates the complexities of manual tracking and updates, enabling IT administrators to maintain consistent firmware versions across all devices. The tool’s powerful features not only improve security and performance but also minimize downtime by allowing batch and scheduled updates. Because AMI DCM’s firmware management is done out-of-band, administrators can check firmware versions and perform updates without needing physical access to servers or installing anything on their operating systems. This feature is particularly useful in distributed environments where accessing systems directly is difficult or impractical.
Learn More
Download the whitepaper and discover how AMI and Wiwynn are redefining AI cluster management with DCM v6.0. For more information about AMI DCM, download the brochure and schedule a demo today!