



Google may be buying a paradise that only knows the number of GPUs running HPC and AI workloads in its eponymous public cloud. Recently, it has encouraged the industry to innovate at the SoC level and designed its own computing engine, which supports the TensorFlow machine learning framework and applications driven within Google as a service for Google Cloud customers. , We are building our own Tensor Processing Unit (TPU for short).

If you were hoping for a big announcement of the TPUv4 architecture from search engine giants and machine learning pioneers at this week’s Google I / O 2021 conference, you’re definitely disappointed, like us. In a two-hour keynote, Google CEO Sundar Pichai, who is also the CEO of Google’s parent company Alphabet, briefly talked about the TPU v4 custom ASIC designed by Google and probably manufactured by Taiwan. Semiconductor Manufacturing Corp is like all other state-of-the-art computing engines on the planet. As the name implies, the TPUv4 chip is Google’s 4th generation machine learning Bfloat processing beast that, in combination with host systems and networks, creates the equivalent of a custom supercomputer.

This is Google’s fastest system ever deployed and a historic milestone, Pichai explained in a keynote. Previously, you had to build a custom supercomputer to get Exaflops. But today, many of these are already deployed. Soon, dozens of TPU v4 pods will be installed in the data center, many running on nearly 90% carbon-free energy. In addition, the TPUv4 pod will be available to cloud customers later this year. It’s very exciting to see this pace of innovation.

First, no matter what Pichai says, when Google installs a TPU pod in the data center to run its own AI workload and allow other users to run using Google Cloud and its AI platform stack. You can immediately see what you are building. The service is absolutely a custom supercomputer. In fact, this is the very definition of a custom supercomputer. As you can see from typos and typos, here at The Next Platform, we certainly need more coffee days, but it’s running fast every day, running Google with a team of speechwriters, and pre-recording. We are holding an event recorded in. .. Sundal, drink a little more coffee. Please send a Starbucks card. Have a good twig and then really tell us about the new TPU v4 chip. (In fact, Urs Hlzle, Google’s Senior Vice President of Technology Infrastructure, has promised a briefing on TPUv4, which is now officially reminding him.)

Pichai didn’t talk much about the TPU v4 architecture, but he can infer a few things from what he said, and he doesn’t even need a TPU ASIC to make the inferences.

This chart literally shattered us with its sparseness and strange inaccuracies, unless we could guess what Pichai must have meant. As we said, we’re a little disappointed, given that there’s something ridiculously oversimplified and this is considered the Google I / O 2021 Geekfest. I will. In any case, the chart actually shows TPUv3 with 5 units of performance and TPUv4 with 10 units of performance. This is exactly twice the performance. However, some people may be confused because the label says it’s more than twice as fast.

If this is a real technical presentation, Pichai might have said that TPU v4 runs at the same clock speed, thanks to process shrinkage that allows each TPU socket to have twice as many compute elements. It means that it has twice as many compute units as it does. Balance with twice the HBM2 memory and at least twice the total bandwidth. But Pichai didn’t say anything about it.

But we do, and that’s what we think Google did in essence. And frankly, if that’s all Google has done to move from TPUv3 to TPUv4, technically it’s not that big of a deal. Hopefully there are more. Scalar / vector processor with 128×128 Bfloat 16 matrix math engine and some HBM2 memory.

Perhaps some reviews are needed, and then we’ll work on the implications of being more than twice as fast.

This is a chart that summarizes the previous TPUv2 and TPUv3 units and the server boards that used them.

The basic TPU core is a scalar / recently called CPU, given that Intel, AMD, Power, and Arm processors all have a combination of these elements with a Bfloat matrix computing unit that Google calls MXU. It is a vector unit. The TPU chip has two cores. The MXU can handle 16,384 Bfloat-style floating-point operations per clock, and with the TPUv2 core, it can drive 23 teraflops of Bfloat operations, or 46 teraflops per chip. I didn’t know the clock speed, but I think it’s somewhere north of 1GHz and south of 2GHz, just like the GPU. In fact, the TPUv2 estimate is 1.37 GHz and the TPUv3 estimate is about 1.84 GHz. This section details the TPUv2 and TPUv3 architectures. Also, if you want to know more about the complexity of the very clever Bfloat format, read it. Estimates of wattage for TPUv3 were fairly low. It is believed that TPUv2 was etched in a 20 nanometer process and TPUv3 was etched in a 16 nanometer or perhaps 12 nanometer process. Google estimates that it has shrunk to 7 nanometers with TPU v4 and has a thermal envelope of 450 watts per socket. Requires TPU v3 pod. I don’t think there is much room to increase the clock speed with TPUv4. sorry. Currently, it can reach 500 watts with more memory.

Anyway, in TPUv3, the process shrinkage allowed Google to place two MXUs on the scalar / vector unit, doubling the raw performance per core at a given frequency. I think Google could also goose the clock speed a little. TPUv3 has 2 cores per chip, doubling the memory of HBM2 up to 16GB per core compared to 8 GB per core on the TPUv2 chip.

So, using a handy dandy ruler and a double multiplier, it’s likely that Google has moved to 7 nanometers and has four cores on the dice. You may do this by creating a monolithic TPU v4 die, or you may experiment with chiplets to create interconnects that link two or four chiplets to each other in a socket. It really depends on how the latency sensitive workload is in the socket. HBM2 memory hangs from the MXU, so as long as the MXU has its own HBM2 controller, that doesn’t seem to be that important. So if you’re doing this and want to increase the yield on your TPUv4 die and lower the cost of your chip (but want to repay part of it in the chiplet package), split the four TPUv3 cores into chipsets. Create. TPU v4 socket. However, Google seems to stick to a monolithic design.

It also pushes the thermal up as high as possible. The TPUv2 weighs 280 watts and the TPUv3 cranks in up to 450 watts for 123 teraflops of performance. (This means a 33.7% increase in clock speed moving from TPUv2 to TPUv3, but pays for it with a 60.7% increase in power from 280 watts to 450 watts.)

I think the HBM memory of the TPUv4 device has doubled, but the HBM2 memory per core could be the same at 16GB per core. That’s 64GB per device, which is a fair amount. (Yes, we know that Nvidia can run 80 GB per device.) Google could push this up to 128 GB per device, or 32 GB per core. It really depends on heat and cost. But what we do know is that Google and other AI researchers want more HBM2 memory available on these devices. It is highly unlikely that the clock speed of a TPU v4 device will increase significantly. Who wants 600 watt parts?

Now let’s talk about comments that are more than twice as fast as above. Last July, Google released some initial data comparing the performance of TPUv4 with the TPUv3 device in the AI ​​benchmark MLPerf suite. please look:

Performance improvements from a 64-chip (128-core) TPUv3 machine to a 64-chip (and 128-core) TPUv4 machine with various components of the MLPerf Machine Learning Training Benchmark range from 2.2x to 3.7x, averaging about. It was 2.7 times. For these 5 tests. In other words, it may be more than twice as fast as Pichai is talking about. But that’s not what his chart showed. The difference between the 2x hardware peak performance capacity and the 2.7x average increase in MLPerf performance is presumed to be software optimization.

The TPU pod is effectively carved as follows: The TPUv2 pods are:

And here is the TPU v3 pod:

The largest TPUv2 image was 512 cores and 4TB of HBM2 memory, and the largest TPUv3 image was 2,048 cores and 32TB of memory.

Now, Pichai said that assuming that the TPUv4 pod has 4,096 chips and does not mean the core, it means that there are 4,096 sockets and each socket has a monolithic chip. This is in line with what Pichai said, making the TPU v4 pod a little over 1 exaflops with Bfloat 16 accuracy. (By comparison, the TPUv2 pod could only scale to 256 chips and 11.8 petaflops, and the TPUv3 pod could only scale to 1,024 chips and 125.9 petaflops.) This 1 exaflops has about the same clock speed and heat as the TPUv4 socket. Is assumed to be. TPU v3 sockets and Google have quadrupled the number of sockets.

We also believe that TPU instances can be expanded to all 4,096 chips and sockets in a single system image using at least 64 TB of aggregated HBM2 memory. And, thanks to software improvements, many of its peaks drive workloads. How much will we see when Google actually tells us more.

Finally, Pichai also states that the TPUv4 pods have 10x the interconnect bandwidth per chip, which is large compared to other network technologies. Looking at the TPUv4 server card compared to the TPUv3 card in the figure above, it looks like each TPUv4 socket has its own network interface. The TPUv3 card had four sockets that shared two interconnects. (Or it looks like that. I’m not sure if this is correct. It could be a 2-port router chip.) I’m looking forward to learning more about TPUv4 interconnects.

