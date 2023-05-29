



New Tech demonstrates the synergies that can be achieved in AI by combining CPUs, DPUs and GPUs. Especially in very large AI.

After leading a 25% gain in NVIDIA shares last week, CEO Jensen Huang flew to Taipei’s Computex to see how the company continues to lead as it approaches the $1 trillion market capitalization milestone. We have launched a number of new products that show what we are going to do. These products alone won’t get the company to $2 trillion, but they show how important generative AI is to getting there.

What did NVIDIA announce?

NVIDIA’s major announcements are usually reserved for the annual GTC event in Silicon Valley. But this year is different. This is the year that AI graduates from being a great technology to being a must-have solution with near-human intelligence and boundless knowledge that every Global 500 company on the planet needs to master.

NVIDIA has announced products ranging from AI supercomputers to talking game characters, but here we focus on three technologies directly impacting this AI moment. Grace Hopper based 256 GPU GH200, MGX platform for system builders. , and the new Spectrum-X Ethernet networking that ties it all together.

GH200

NVIDIA CEO Jensen Huang has long told us that NVIDIA is in the business of selling optimized data centers, not chips and components. If I misunderstood that as an exaggeration, I probably would have missed the boat. NVIDIA has been selling DGX SuperPods since his A100, and while these systems could get pretty big, NVLink-enabled optimizations (shared memory) were limited to his 8 GPUs. rice field. Programmers can treat the 8 GPUs as one big GPU sharing a large memory pool.

But when GPT-4 hit Silicon Street, the definition of large changed dramatically, rumored to contain a trillion parameters. Training these huge AI models and running inference queries requires large amounts of high-speed shared memory (HMB). Customers such as Google, Meta, and Microsoft/Open AI need much larger footprints.

This time, Jensen announced the DGX GH200, a large memory supercomputer for generative AI. Leverage Grace Hopper and his NVLink to train large-scale (that word again) AI models and drive AI innovation. What DGX did for small AI, GH200 can do for large AI builders. Interconnected with his NVLink, the GH200 delivers 1 exaflops of AI (low-precision) performance and nearly 500 times more shared memory at 144 terabytes than his previous-generation NVIDIA DGX A100, which launched in 2020. Offers. This provides a 7x increase in bandwidth between GPU and CPU compared to his predecessor NVIDIA DGX A100. The latest PCIe technology reduces interconnect power consumption by more than 5x and provides the DGX GH200 supercomputer with a 600 GB hopper architecture GPU building block.

DGX200 is based on Grace CPU, Hopper GPU, NVLink switch and increases system memory by 500 memory. [+] Collapsed for research and deployment of very large AI models.

Nvidia

When it comes to hyperscalers, NVIDIA named Google, Meta, and Microsoft, all of which endorsed them with great citations in NVIDIA’s press release. So we would be shocked if these AI leaders didn’t deploy multiple GH200s. Amazon AWS in particular was MIA for him. We believe AWS intends to go it alone in favor of its own AI chips, Inferentia and Trainium. I wish you the best. These chips simply cannot compete with NVIDIA. AWS networking he does not compete with NVLINK, not to mention the new NVIDIA Ethernet technology announced at Computex.

NVLInk-connected DGX GH200 can deliver 2-6 times more AI performance than H100 clusters. [+] infiniband

Nvidia

As always, NVIDIA is their own biggest customer for the GH200. NVIDIA has announced that he is building the Helios supercomputer to advance AI research and development. This private Selene-like supercomputer features four DGX GH200 systems interconnected with NVIDIA Quantum-2 InfiniBand networking to accelerate data throughput for training large-scale AI models. The 1,024 Grace Hopper superchip system is expected to be operational by the end of the year.

MGX

NVIDIA should also be able to help partners who play a vital role in expanding NVIDIA’s market reach into Tier 1 and Tier 2 CSPs. For system vendors such as HPE, Dell, Lenovo, and Penguin, NVIDIA has created an HGX reference board that enables 8-way GPU connectivity. But to make the technology more flexible to mix and match, Jensen unveiled what he calls the MGX. MGX enables over 100 unique configurations of NVIDIA CPU, GPU and DPU components to meet customers’ individual needs. NVIDIA has announced that his six partners, typically involved in building hyperscale infrastructure, are the first to sign up for his MGX.

The new NVIDIA MGX is a reference architecture that standardizes on a single architecture for multiple systems. [+] CPU, GPU, and DPU generations for system OEM partners.

NVIDIA Spectrum-X

NVIDIA acquired InfiniBand technology three years ago when it acquired Mellanox. Infiniband is ideal for supercomputers because it is lossless, unlike Ethernet, which recovers lost packets by retries. and again. And so on until the lost network packet finally reaches its destination. No problem with cloud services, no problem with HPC. And apparently, it’s not suited for large-scale AI.

Massive AI needs Infiniband’s performance and lossless packet delivery, but prefers low-cost and ubiquitous Ethernet networking for data center operations. As shown in the upper right of the figure below, the TCP/IP protocol is tolerant of frequent packet drops, resulting in large fluctuations in Ethernet bandwidth. And that’s a problem for giant AI.

Spectrum-4 Ethernet switches provide lossless Ethernet with consistent double performance.

Nvidia

NVIDIA’s solution is to offer these customers Spectrum-X, a new Spectrum-4 Ethernet switch combined with a high-performance BlueField-3 DPU. The combination of Spectrum-4 switches, BlueField NICs, and the NVIDIA network software stack delivers a 1.7x improvement in overall AI performance and power efficiency, along with the consistency provided by lossless networking. that’s right. NVIDIA promises an Ethernet network that never drops packets. Sounds magical to me, but this is exactly what enterprise and hyperscale data centers have been asking for for decades.

Conclusion

NVIDIA takes a holistic approach to training and running large-scale AI models and has laid out a reference architecture that provides 256 GPU building blocks for large-scale AI users to adopt. Combined with recent Dell announcements, Grace’s supercomputing traction, Microsoft Azure support for NVIDIA Enterprise AI, the new MGX reference architecture, NVLink, and Spectrum-X lossless Ethernet solutions, it’s the perfect fit for anyone doing serious AI jobs. has only one logical choice. NVIDIA.

Could this change? Yes. The competition will always be after NVIDIA. AMD plans to enter the MI300 market in earnest later this year. RISC-V solutions such as Tenstorrent and Esperanto are gaining attention and attention, but not in the market Jensen focuses on: his Foundation model at scale. Intel may take out the rabbit with his Gaudi3 or the forever delayed PonteVecchio GPU.

But, as I said before, NVIDIA’s superior hardware combined with the widest breadth of NVIDIA software for optimizing AI and HPC applications gives all competitors a combined 10% of the market. There is a possibility. A $75 billion market might be enough to float more boats.

But NVIDIA is the only multi-trillion dollar player, and we don’t expect that to change.

