Discover how the HPE Cray XD670 delivers strong results in the MLPerf Training v4.0 benchmark, introduces support for NVIDIA H200 Tensor Core GPUs, and advances sustainability with optional direct liquid cooling.
In the race to deploy and successfully implement AI environments, two key requirements are becoming increasingly critical for service providers and AI model builders:
- Achieving the highest possible performance with the latest accelerator technologies
- Addressing the escalating cooling needs of these accelerators
On the first requirement, we are pleased to share that HPE’s premier AI training platform, the HPE Cray XD670, has once again demonstrated strong performance results, this time in the recently published MLPerf Training 4.0 benchmarks. HPE submitted nine performance results against five AI models in three categories: LLM fine-tuning, NLP training, and Computer Vision training. HPE Cray XD670, in single- and double-node configurations, delivered strong on-premise training and fine-tuning performance including:
- Overall #2 fastest single-node system when compared to other 8x NVIDIA H100 SXM5 80GB servers in both NLP (BERT) training and LLM (Llama 2 70B for LoRA) fine-tuning.
- The 2-node HPE Cray XD670 configuration, with a total of 16 NVIDIA H100 SXM5 80GB GPUs, outperformed a 16-node server configuration, with a total of 64 NVIDIA L40S GPUs on Llama 2 fine-tuning tasks.
Based off limited comparable results, HPE Cray XD670 was the overall fastest 2-node configuration in all submitted models.
These results are in addition to prior MLPerf Inference v4.0 benchmark results published in March, where HPE Cray XD670 achieved the #1 spot for Natural Language Processing (NLP with Bert 99.0 Offline scenario) and was also a top performer in all the categories in which it participated, including GenAI, computer vision and large language models. (Read our blog “Boost AI performance with the leading server for natural language processing” for more details.)
Implementing the latest accelerator technology is another contributing factor to delivering high performance, so we expect future benchmark results to get even better as HPE Cray XD670 now supports eight NVIDIA H200 SXM Tensor Core GPUs. Additionally, our portfolio of AI training solutions will continue to evolve, and we plan to be time-to-market with future key GPU releases, as shared by our CEO Antonio Neri during his HPE Discover keynote.
Liquid cooling: Efficient today, essential tomorrow
With heat from powerful CPUs and GPUs soon to draw over 500 watts, traditional air-cooling setups are being strained. Organizations are starting to realize that a different way to cool these environments is necessary, especially as energy requirements are only expected to rise, as technology evolves. HPE Cray XD670 comes with a direct liquid cooling option that addresses the power and cooling needs of today and tomorrow.
This option provides direct liquid cooling to the hottest components in the server such as the GPUs and CPUs, about 70%, while using about 30% air cooling to cool the remaining low-heat components. The racks come pre-filled with coolant and ready to plug into facility water connections.
These racks are fully integrated, installed, and supported by HPE; they can become 100% liquid cooled when combined with HPE liquid-to-air cooling solutions: HPE Rear Door Heat Exchanger (RDHX) or HPE Adaptive Rack Cooling Solution (ARCS). These work with facility-chilled water that provides cold air where it is needed most in the rack.
Some of the benefits of direct liquid cooling over air cooling include:
- Lower operational costs and improved efficiencies. In an analysis conducted by HPE, liquid cooling was shown to deliver about 20% more performance per kW and reduce chassis power requirements by about 15%, over 5 years. Although this study was performed using a different HPE platform, the results are representative of the expected benefits of a liquid cooled set up.
- Reduced environmental impact. Reducing power consumption with more efficient cooling can help organizations meet environmental, societal, and governance (ESG) goals and reduce their data center’s CO2 equivalent (CO2e) footprint.
- Higher density that can defer expensive data center upgrades. In space-constrained data centers, liquid cooling can enable denser rack configurations, helping maximize available space.
- Improved reliability and predictability. Liquid cooling can prolong component life by providing stable operating temperatures, avoiding overheating conditions, and improving overall availability.
The results of the analysis conducted by HPE include opportunities for an 87.3% reduction in carbon emissions and power consumption due to cooling, and a potential 77.5% reduction in data center space requirements and are summarized in the chart below.
HPE future-focused cooling strategy advances sustainability and increases efficiencies. Our expertise in liquid cooling and leadership portfolio is second to none, a culmination of over five decades of experience and innovation. Today, we continue to lead the industry in preparing for the next generation of liquid cooling.
Achieve leading-edge, sustainable performance for AI training and tuning
In today’s fast-paced and complex world of AI deployments, performance and sustainability are critical for organizations looking to get ahead of their competition. HPE’s proven expertise in setting up highly performant, large-scale. HPE’s proven expertise in setting up high-performing AI clusters with bespoke cooling solutions positions us as a leader to help you in your journey. The HPE Cray XD670 proves this with the latest MLPerf benchmarks demonstrating exceptional performance results in AI training and fine-tuning across various models. Additionally, our commitment to liquid cooling addresses escalating cooling demands, not only with immediate efficiency gains, but also by future-proofing against rising energy requirements.
Ready for more?
Visit the webpage to see the explainer video, view a demo, and read the solution brief about the HPE Cray XD670.
Check out Jason Zeiler’s Tech Talk from HPE Discover, “The future of liquid cooling for data centers” for an overview of HPE’s liquid cooling solutions.
This article, republished with permission, originally appeared at https://community.hpe.com/t5/ai-unlocked/maximize-performance-sustainably-for-ai-training-with-hpe-cray/ba-p/7222696 on August 12, 2024