in Blog

March 26, 2024

Google’s Gemma LLM Explained: Revolutionizing Large Language Model Benchmarks

Author:

Edwin Lisowski

CSO & Co-Founder

Reading time:

15 minutes

The large language model (LLM) market is poised for significant growth, with experts predicting that it may reach a staggering $259,817.73 million by 2030, up from $1,590.93 in 2023. [1] This comes as no surprise considering the widespread utilization of LLMs in various industries owing to their remarkable capabilities in addressing complex linguistic tasks.

Google, in its efforts to keep up with the ever-growing competition in the field, has released yet another open-access large language model called the Gemma model. Based on the Gemini models, Gemma shows significant potential to revolutionize LLM utilization across various sectors and alleviate some of the challenges associated with previous models, particularly around performance.

This guide will dive into the inner workings of the Gemma models, evaluating everything from their architecture to how they’re trained and how to use them.

What is Gemma?

Gemma is the most recent lineup for LLMs developed as part of the Gemini Models initiative. It comes as a family of large language models trained on up to 6 trillion tokens of text, making it significantly lighter than the Gemini models.

Read more: Google Gemini AI Explained

The Gemma large language model family comprises two sizes, each suited to specific applications and devices. The lighter version, a 2-billion parameter model, is designed for CPU and on-device applications, while the larger, 7-billion parameter model can be efficiently deployed on GPU and TPU.

One of the greatest selling points of the Gemma family models is their strong generalistic capabilities and SoTA performance in understanding and reasoning tasks at scale. When pitted against similar-sized models, Gemma LLM exhibits significantly better performance in various natural language processing tasks including commonsense reasoning, question answering, science, mathematics, and coding.

During its development, Google leveraged some of the most recent advancements in transformers, sequence models, large-scale training, and deep learning at scale to create a model capable of outperforming most models on the market.

According to Google, Gemma models are meant to provide equitable access to LLM technology, improve the safety of frontier models, pave the way for the rigorous evaluation and analysis of current technologies, and promote the development of future technologies.

What is Gemma’s model architecture?

The Gemma model utilizes a decoder-only transformer architecture that was introduced in 2017. Both the 7-billion and 2-billion parameter models have a context length of 8192 tokens and a vocabulary size of 256,000 tokens.

It also includes recent advancements in transformer model architectures, including:

Rope Embeddings

Most transformer-based models use Absolute Position Encodings. [2] Unfortunately, this presents a few limitations including limited sequence length and independence of positional embeddings – factors that could negatively impact the performance of LLMs.

In a bid to overcome these limitations, Google chose to employ rotary positional embeddings in each layer, which provide several benefits, including the flexibility to expand to any sequence length.

This significantly contributes to the model’s performance in natural language processing tasks, particularly around text generation, summarization, question answering, and machine translation.

Multi-Query Attention

The 2-billion parameter model uses multi-query attention, an attention mechanism that treats each input word’s representation as a query to access and incorporate information from a set of values. [3]

This significantly increases the speed of generating tokens in the encoder without negatively impacting model performance.

Contrarily, the 7-billion parameter model utilizes a multi-head attention mechanism. The multi-head attention mechanism runs through an attention mechanism several times in parallel then concatenates and linearly transforms the outputs into the expected dimensions. [4]

Ultimately, this enables the model to handle various input sequence segments in various ways resulting in better performance.

Geglu Activations

Unlike most models, which use the regular ReLU activation function, the Gemma models utilize the GeGLU activation function, resulting in better performance.

In a nutshell, the GeGLU activation function marries the capabilities of GELU and GLU activations, resulting in a unique mechanism for controlling the flow of information through a network. [5]

When employed in a large language model, GeGLU activations can help neural networks learn more complex patterns, resulting in better performance in natural language processing tasks.

Normalizer Location

Previous language models have primarily used the goto practice, which is not quite reliable in complex programs as unstructured jumps can make it harder to understand what is happening within the program.

Gemma, on the other hand, utilizes RMSNorm to normalize each transformer sub-layer input and output, thereby stabilizing layer conversions and improving model convergence. [6]

Check out case study: LLM-based Assistance Bot to enhance airport operations

What training datasets were employed in Gemma?

The Gemma models were trained on a massive text dataset, encompassing up to 6 trillion tokens. The datasets comprised several key components, including:

Web Documents
Google used a massive array of web text sources to expose the model to a rich spectrum of vocabulary, linguistic styles, and topics.
Mathematics
Incorporating mathematical text in the models’ training data helps the Gemma LLM acquire skills related to symbolic representation, logical reasoning, and handling mathematical queries, resulting in a well-rounded model capable of performing diverse tasks.
Code
Gemma was integrated with various code samples to enable it to grasp programming language syntax and patterns. By doing so, Google developers significantly enhanced their capability to comprehend code-related queries and generate code.

How was Gemma trained?

Both models were trained on 2-trillion and 6-trillion tokens respectively containing primarily English datasets obtained from web documents, code, and mathematics. During training, Google developers took a different approach from the Gemini models, which are optimized for multilingual tasks and contain multimodal elements.

Instead, the developers focused on training the models to perform English tasks. Before exposing the model to training data, the data was first filtered to remove any unwanted or unsafe content, including sensitive data and personal information. The filtering process primarily involved model-based classifiers and heuristic methods to ensure the safety and quality of the training dataset.

Both models also underwent reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT) to further enhance their performance. In both instances, the supervised fine-tuning involved a mix of English-only, text-only, synthetic, and human-generated prompt-response pairs.

To further enhance model performance, developers and data scientists at Google carefully selected the data mixtures for fine-tuning based on LM-based side-by-side evaluations. In this method, different prompt sets are designed to highlight specific model capabilities including factuality, instruction following, safety, and creativity.

The synthetic data used in the training process also underwent several stages of filtering to remove any examples containing toxic inputs and personal information, much like what happened with the Gemini models. Ultimately, this resulted in improved model performance without negatively impacting safety.

Developers also used the Bradley-Terry model as a reward function during the reinforcement learning stage conducted by human raters. They also further optimized the model using a type of REINFORCE to mitigate potential issues like reward hacking as well as improve the models’ performance.

Gemma’s benchmarks and performance metrics

When compared to Mistral, Gemma outperforms it on five out of six benchmarks, with the only exception being the HellaSwag benchmark, where both models get similar results. Gemma’s undeniable dominance in these benchmarks is clearly evident in tasks like TruthfulQA and ARC-c, where it surpasses Mistral by 2.5% and 2% respectively in both accuracy and F1 scores.

In MMLU benchmarks, Gemma achieves lower perplexity, indicating its unmatched capabilities in grasping language patterns.

Considering its results in these benchmarks, it’s clearly evident that Gemma is a powerful language model capable of handling various natural language processing tasks with excellent efficiency and accuracy.

Getting started with Gemma

Gemma LLM has a pretty straightforward interface, but you still need some technical knowledge to use it correctly. There are several ways you can access and utilize both models. They include Kaggle, Vertical AI, and Hugging Face.

We chose to go with Kaggle in this tutorial because it offers free access and does not require a cloud account.

To use Kaggle, you’ll first need to register at kaggle.com, open the Gemma model card, and request access. You’ll also have to accept the terms and conditions stipulated on the website.

Once you have your Kaggle account all set up, you can proceed with the following steps in the installation process.

Install Dependencies

pip install --upgrade keras-nlp
pip install --upgrade keras

To effectively run Gemma LLM on Kaggle, you first need to install KerasNLP. However, you may need to reinstall Keras 3 since TensorFlow is currently pinned to Keras 2. The reasoning behind this crucial step is that KerasNLP currently relies on TensorFlow-text. TensorFlow-text automatically installs TensorFlow 2.15, which, unfortunately, overwrites your Keras installation with Keras 2.15. [7] However, you won’t need to do this once TensorFlow 2.16 is released.

Import Packages

import keras
import keras_nlp
import os

Choose a Backend

os.environ["KERAS_BACKEND"] = "torch"  # Or "tensorflow" or "jax".

Keras 3 gives you the flexibility to select your preferred backend. In this regard, you have three choices: JAX, TensorFlow, or PyTorch. Additionally, when utilizing Keras 3, you can initiate a model as a PyTorch module, export it as a TensorFlow SavedModel, or even initiate it as a stateless JAX function.

This unmatched versatility allows you to maintain a single implementation of your components and use it across all frameworks without impacting numerical results.

Curate the Model

KerasNLP features various implementations of popular model architectures. For Gemma, you’ll find components like the GemmaPreprocessor layer, GemmaCausalLM model, GemmaTokenizer, GemmaBackbone model, and GemmaCausalLMPreprocessor layer.

The GemmaBackbone component serves as the foundation for the Gemma models. It encompasses both transformer layers and embedding lookups. Rather than generating predictions across the entire vocabulary space, it produces final hidden states for each token.

Considering its complexity, we chose to use GemmaCausalLM for this tutorial. GemmaCausalLM is an end-to-end Gemma model that’s specifically tailored for casual language modeling, which typically involves predicting the subsequent token based on previous tokens.

To initiate the GemmaCausalLM model, you’ll first need to utilize a preset architecture along with any weights associated with it.

GemmaCausalLM.from_preset()

The arguments required for this function include:
Preset: this is a string specifying either ‘gemma_2b_en,’ ‘gemma_7b_en,’ ‘gemma_instruct_7b_en,’ or ‘gemma_instruct_2b_en’ presets.

Load_weights: this parameter primarily determines whether pre-trained weights should be loaded into the model.

gemma_model = keras_nlp.models.GemmaCausalLM.from_preset(
    "gemma_2b_en",
    load_weights=False
)

You can also get more information on the model using a summary.

gemma_model.summary()

Generating Text

As a large language model, Gemma provides an easy way to generate various types of text through the generate method. You can also specify the maximum length of the generated sequence of text using the max_length argument.

Here’s how it works:

The generate method analyzes your input and generates text based on its understanding of the input. You can also use the samples to generate text using the compile () method. Basically, if you provide your inputs as tf. data. Dataset, the model will generate its output in batches and then concatenate them. However, it treats all inputs as a single batch.

If you attach a preprocessor to the Gemma model, it preprocesses the input within the generate () function. These inputs are typically raw strings and should not match the structure expected by the preprocessor layer. However, if you don’t attach a preprocessor, all inputs should align with the structure expected by the backbone.

prompt = "i am listening to"


gemma_model.compile(sampler="top_k")


gemma_model.generate(prompt, max_length= 10)
#output
'i am listening toांत Painting poses perbaikan poses'
#changing the sampler
gemma_model.compile(sampler=keras_nlp.samplers.BeamSampler(num_beams=2))


gemma_model.generate(prompt, max_length=0)
#output
'i am listening to leds leds RSVP RSVP RSVP'

The model also provides the flexibility to recompile it with various Keras_nlp. samplers Objects to fine-tune the generated text. By default, Gemma utilizes ‘greedy‘ sampling, but you can also experiment with various other samplers including beam, random, CustomSampler, and top_p.

You can also access the quality of the generated sequence based on the tokens ID provided. However, to make it easier for you to understand the functionality, we will artificially curate them as TensorFlow add-ons using the score method.

GemmaCausalLM.score(
    token_ids,
    padding_mask=None,
    scoring_mode="logits",
    layer_intercept_fn=None,
    target_ids=None,
)

The arguments required for this function include:

token-aids: this is typically a tensor with the shape [batch_size, num_tokens] that contains the tokens to be scored. This tensor generally captures the output from a call to GemmaCausalLM.generate() that contains tokens for both the input text and the generated text.

Padding_mask: this is a tensor with the shape [batch_size, num_tokens] that indicates that the tokens should be preserved during text generation. However, it primarily serves as an artifact required by the GemmaBackbone and hence, does not significantly impact the computation of this function. If you don’t include it, the function will typically generate a tensor of appropriate shape using Keras.ops.ones().

Scoring_mode: this argument specifies the type of scores to return. It can specify them as either ‘logits’ or ‘loss’, but both will be typically computed by input token.

Layer_intercept_fn: this is an optional function designed to augment activations with additional computation. As such, it is typically used for interpretability in research. The function typically passes activations as its first parameter then a numeric index associated with the specific backbone layer.

Target_ids: this is a tensor with the shape [batch_size, num_tokens]. It contains the predicted tokens that the model computes the loss against. If you provide a span of tokens, the model will compute the loss as an aggregate of the tokens.

import tensorflow as tf
generations = gemma_model.generate(
    ["what is", "Where are you"],
    max_length=10
)
preprocessed = gemma_model.preprocessor.generate_preprocess(generations)
generation_ids = preprocessed["token_ids"]
padding_mask = preprocessed["padding_mask"]
target_ids = tf.random.uniform(shape=(8192,), dtype=tf.int32, minval=0, maxval=8192)
# Convert the tensor to int32
target_ids = tf.cast(target_ids, dtype=tf.int32)
losses = gemma_model.score(
        token_ids=generation_ids,
        padding_mask=padding_mask,
        scoring_mode="loss",
        target_ids=random_tensor,
    )

On the downside, to get an output with these arguments, you may need to upgrade to Google Cloud AI Platform Notebooks. This was just to give you an essence of how you can calculate the losses.

Using Gemma with Transformer

To do this, you’ll first need to log into the HuggingFace platform and access Gemma. Once you’re all set up, use the following code

!pip install -U "transformers==4.38.1" --upgrade
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!huggingface-cli login --token <Huggingface_token>

Fine-Tunning Gemma 7B with Qlora and Unsloth

Before you can delve into the fine-tuning process, you must first allow the Notebook to install the Unsloth library and import the necessary modules.

Here’s a code to instantiate a component of Unsloth called FastLanguageModel. The component includes specific configurations such as data type, sequence length, and 4-bit loading.

!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git" --q

PEFT (Parameter-Efficient Fine-Tunning)

PEFT strategies selectively fine-tune several additional parameters without impacting a majority of the pre-trained LLM parameters. Ultimately, it significantly reduces storage and computational costs and addresses the challenges of catastrophic forgetting commonly observed when fine-tuning LLMs.

Similarly, the FastLanguageModel object provides a get_peft_model attribute that allows you to configure various parameters for fine-tuning, including target modules, the number of attention heads, LoRa alpha, drop rate, and more.

Data Preparation and Formatting

To effectively fine-tune the model, you need to give it access to an unseen dataset with a prompt template. For instance, you can use the databricks/databricks-dolly-15k [8] text generation dataset.

However, to make the data accessible by the model, you have to map all data points into a standardized format. This typically involves ensuring a consistent prompt structure with labeled inputs and outputs. For instance, you can categorize them as context, instruction, and response.

Fine-tuning the LLM

The final step in the fine-tuning process is initializing a supervised fine-tuning trainer to aid in the fine-tuning process. The trainer initializes the model along with the training dataset, a tokenizer, and all the required arguments such as learning rate, weight decay, learning steps, and optimization.

Final Thoughts

Gemma’s open-source nature, along with its remarkable performance has generated considerable excitement within the LLM community. Consequently, it may lead to further advancements in the LLM landscape and provide an optimistic outlook for future advancements in the technology.

Additionally, by offering unmitigated access and flexibility, Gemma will provide advanced AI capabilities to a wider audience, fueling innovation and collaboration, which will ultimately advance progress in natural language processing and shape the future of AI.

References

[1] Linkedin.com. Large Language Model (LLM) Market Growth Analysis, Market Dynamics, Key Players and Innovations, Outlook and Forecast 2024-2030. URL:
https://www.linkedin.com/pulse/large-language-model-llm-market-growth-analysis-wfjef. Accessed on March 16, 2024
[2] Paperswithcode.com, Absolute Position Encodings. URL:
https://paperswithcode.com/method/absolute-position-encodings. Accessed on March 16, 2024
[3] Medium.com, Understanding Attention and Multi Query Attention. URL:
https://medium.com/@qinliu.cn/understanding-attention-and-multi-query-attention-7b931fd10e53. Accessed on March 16, 2024
[4] Paperswithcode.com. Multi-Head Attention. URL:
https://paperswithcode.com/method/multi-head-attention. Accessed on March 16, 2024
[5] Linkedin.com. URL: Unlocking the Power of GeGLU: Advanced Activation Functions in Deep Learning. URL:
https://tiny.pl/dtn78. Accessed on March 17, 2024
[6] github.com. URL: https://github.com/bzhangGo/rmsnorm. Accessed on March 17, 2024
[7] Keras.io. URL: Getting started with Keras. URL:
https://keras.io/getting_started. Accessed on March 17, 2024
[8] hugging face.co. URL: https://huggingface.co/datasets/databricks/databricks-dolly-15k. Accessed on March 17, 2024