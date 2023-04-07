



A new scientific paper from Google details the performance of the Cloud TPU v4 supercomputing platform, claiming it delivers exascale performance for machine learning with improved efficiency.

The authors of the research paper claim that the TPU v4 is 1.2×1.7x faster and consumes 1.3×1.9x less power than the Nvidia A100 in a similarly sized system. The paper points out that Google didn’t compare his TPU v4 to his new Nvidia H100 GPU. Due to limited availability of TPU v4 and 4nm architecture (vs TPU v4s 7nm architecture).

As machine learning models grow in size and complexity, so does their need for computing resources. Google’s Tensor Processing Unit (TPU) is a specialized hardware accelerator used to build machine learning models, especially deep neural networks. They are optimized for tensor manipulation and can greatly improve the efficiency of training and inferring large-scale ML models. Google says its performance, scalability, and availability make the TPU supercomputer the mainstay of his large-scale language models such as LaMDA, MUM, and PaLM.

The TPU v4 supercomputer contains 4,096 chips interconnected via a unique optical circuit switch (OCS). Google claims it’s faster, cheaper, and uses less power than his InfiniBand, another popular interconnect technology. Google claims that OCS technology is less than 5% of the cost and power of his TPU v4s system, dynamically reconfiguring the supercomputer’s interconnect topology to achieve scale, availability, utilization, modularity, It claims to improve deployment, security, power, and performance.

Google engineers and paper authors Norm Jouppi and David Patterson wrote in a blog post that Google Cloud TPU v4 outperforms TPU v3 in ML system performance, thanks to key innovations in interconnect technology and domain-specific accelerators (DSAs). He also explained that the performance was dramatically improved by nearly 10 times. It also improved energy efficiency by about 2-3x compared to current ML DSA and reduced CO2e by about 20x compared to his DSA.

The TPU v4 system has been operating at Google since 2020. The TPU v4 chip was announced at the company’s 2021 I/O developer conference. According to Google, supercomputers are actively used by leading AI teams for ML research and production across language models, recommender systems, and other generative AI.

As for recommendation systems, Google says its TPU supercomputer provides hardware support for embedding, a key component of deep learning recommendation models (DLRM) used in advertising, search rankings, YouTube, and Google Play. There is also the first one. This is because each TPU v4 has SparseCores. This is a dataflow processor that accelerates embedding-dependent models by 5x7x, but uses only 5% die area and power.

One-eighth of the TPU v4 pods from Google’s ML cluster in Oklahoma, the company claims, run on about 90% carbon-free energy. (Source: Google)

Midjourney, a text-to-image AI startup, recently chose TPU v4 to train the fourth version of its image generation model. Midjourney founder and CEO David Holz said in a Google blog post: Latest with JAX From training his fourth version of the algorithm on the v4 TPU to running inference on the GPU, he’s impressed with the speed with which the TPU v4 enables users to bring their vibrant ideas to life. I was.

The TPU v4 supercomputer is available to AI researchers and developers at the Google Clouds ML cluster in Oklahoma, which opened last year. With a total peak performance of 9 exaflops, Google believes the cluster is the largest publicly available ML hub operating on 90% carbon-free energy. Read the TPU v4 research paper here.

