



NVIDIA is working with Google as a launch partner to deliver Gemma. Gemma is a new, optimized open model family built from the same research and technology used to create Gemini models. Optimized release with TensorRT-LLM allows users to run desktops with NVIDIA RTX GPUs.

Created by Google DeepMind, the first models in the Gemma 2B and Gemma 7B series deliver high throughput and cutting-edge performance. Gemma is accelerated by TensorRT-LLM, an open source library for optimizing inference performance, and is compatible across NVIDIA AI platforms, from the data center and cloud to your local PC.

Previously, optimizing and deploying LLM was very complex. TensorRT-LLM's simplified Python API makes quantization and kernel compression easy. Python developers can customize model parameters, reduce memory footprint, increase throughput, and improve inference latency for the most popular LLMs. The Gemma model uses a vocabulary size of 256K and supports context lengths up to 8K.

Safety is built into the Gemma model through extensive data curation and safety-focused training methodologies. Personally Identifiable Information (PII) filtering removes identifiers (such as social security numbers) from pre-training and instruction conditioning datasets. Furthermore, extensive fine-tuning from human feedback and reinforcement learning (RLHF) aligns instruction-tuned models to responsible behavior.

Trained with over 6 trillion tokens, developers can confidently build and deploy high-performing, reliable, and advanced AI applications.

TensorRT-LLM accelerates Gemma models

TensorRT-LLM includes numerous optimizations and kernels that improve inference throughput and latency. Three features specific to TensorRT-LLM that improve Gemma's performance are FP8, XQA, and INT4 Activation-aware Weight quantization (INT4 AWQ).

FP8 is a natural progression for accelerating deep learning applications beyond the 16-bit format common on modern processors. FP8 allows for increased throughput for matrix multiplication and memory transfers without sacrificing accuracy. This is useful for both small batch sizes in models with limited memory bandwidth and large batch sizes when computational density and memory capacity are important.

TensorRT-LLM also provides FP8 quantization for KV caches. KV caching differs from regular activation, which occupies non-negligible persistent memory with large batch sizes or long context lengths. Switching to the FP8 KV cache improves performance by allowing you to run 2-3x batch sizes.

XQA is a kernel that supports both group query attention and multiquery attention. The new NVIDIA AI development kernel, XQA, provides optimizations during the generation phase to optimize beam search. NVIDIA GPUs reduce data load and transformation times, increasing throughput within the same latency budget.

INT4 AWQ is also supported by TensorRT-LLM. AWQ provides good performance for workloads with small (<= 4) batch sizes. Network memory usage is reduced, significantly improving performance for applications with limited memory bandwidth. AWQ is a low bit weight only quantization method that reduces quantization error. Utilize activation to protect salient weights.

Combining the strengths of INT4 and AWQ, the TensorRT-LLM custom kernel for INT4 AWQ compresses LLM weights down to 4 bits based on relative importance and performs computations in FP16. This provides greater precision than other 4-bit methods, while reducing the memory footprint and providing significant speedups.

Real-time performance of over 79,000 tokens per second

Powered by NVIDIA H200 Tensor Core GPUs, TensorRT-LLM delivers superior performance on both Gemma 2B and Gemma 7B models. A single H200 GPU delivers over 79,000 tokens per second on the Gemma 2B model and approximately 19,000 tokens per second on the larger Gemma 7B model.

To put this performance into context, a Gemma 2B model with TensorRT-LLM deployed on one H200 GPU can serve over 3,000 concurrent users, all with real-time latency.

Get started now

Experience Gemma directly from your browser with NVIDIA AI Playground. Coming soon, you'll also be able to experience Gemma in the NVIDIA Chat with RTX demo app.

Figure 1. NVIDIA-optimized Gemma small language model support

Several Gemma-2B and Gemma-7B model checkpoints optimized for TensorRT-LLM (including pre-trained and instruction-tuned versions) are now available on NGC and NVIDIA GPUs, including consumer-grade RTX systems. You can run the optimized model with

Coming soon, you'll be able to experience FP8 quantized versions of TensorRT-LLM-optimized models in Hugging Face's Optimum-NVIDIA library. Integrate fast LLM inference with just one line of code.

Developers can use the NVIDIA NeMo framework to customize Gemma and deploy it into production. The NeMo framework includes common customization techniques such as supervised fine-tuning and parameter-efficient fine-tuning using LoRA and RLHF. It also supports his 3D parallel processing for training. Check out the notebook and start coding with Gemma and NeMo.

Start customizing Gemma using the NeMo framework today.

