Meet ContextCheck: Our Open-Source Framework for LLM & RAG Testing! Check it out on Github!

in Blog

April 24, 2024

Introducing DBRX: Exploring Databricks Open-Source LLM

Author:




Edwin Lisowski

CSO & Co-Founder


Reading time:




11 minutes


The past few years have seen a surge in the number of open LLMs released by top tech companies. This new era of open LLMs is poised to revolutionize the use of AI across various sectors. By democratizing access to advanced AI tools, more people can be able to create more innovative AI applications, thus paving the way for further developments.

Databricks, better known for its data processing, management, and governance tools, has recently released DBRX, a state-of-the-art language model that outperforms most established models on various benchmarks.

Generative-AI-banner-CTA

Built upon a fine-grained MoE architecture, DBRX is quite efficient, particularly in training and inference performance. In fact, it is two times faster than LLaMA2-70B[1] and 60% smaller than Grock-1 in terms of both total and active parameter counts.

Overall, DBRX introduces a unique blend of technologies and superior performance, setting a new standard for open-source LLMs. This guide will explore DBRX in its entirety, evaluating everything from how it was built to its features and how it compares to other leading open-source models.

What is DBRX?

DBRX is an open-source LLM developed by teams from Mosaic ML and Databricks. It boasts a transformer-based decoder-only architecture trained using next-token prediction. DBRX uses a fine-grained mixture of experts with an outstanding 132 billion parameters, of which 36 billion are active on any input.

Read more: Best practices for Databricks PoC (Proof of Concept)

The fine-grained nature of DBRX enables it to use a larger number of smaller experts, resulting in improved model quality. When compared to other open models that use an MoE architecture like Grok-1 and Mixtral, DBRX produces up to 65 times more combinations of experts. To put this into perspective, Grok-1 and Mixtral have eight experts and choose two, while DBRX has a total of 16 experts and chooses 4, resulting in improved model quality.

Additionally, DBRX uses gated linear units (GLU), which facilitates the selection of words or features that are important for predicting the next word, resulting in more coherent and accurate outputs. [2] DBRX also utilizes grouped query attention (GQA) and rotary position encodings (RoPE), which reduce computation requirements and improve flexibility for varying sequences respectively. [3][4]

DBRX was pre-trained on 12 trillion tokens of code and text data, with a maximum context length of 32,000 tokens. According to developers, the training data used on DBRX is two times better than the data used to pre-train the MPT family models.

The training dataset employed in DBRX was created using a full suite of Databricks tools, including Unity Catalog for data management and governance, Databricks notebooks for data processing, MLflow for experiment tracking, and Apache Spark™.

How was DBRX built?

DBRX is a product of months of data research, science, and scaling experiments. Its creation process leveraged Databricks’ vast experience in developing large language models such as the MPT and Dolly projects, not to mention the thousands of models the company has brought into production with its customers.

DBRX was trained on 3072 NVIDIA H100s connected by 3.2Tbps Infiniband. The main creation process, including evaluation, pre-training, post-training, refining, and red-teaming, took only three months, showcasing the company’s capabilities in promptly bringing a large competitive language model to market.

During the building process, the company leveraged the same catalog of Databricks tools available to its customers. For instance, they governed and managed their training data using Unity Catalog, processed and cleaned it using Databricks Notebook and Apache Spark™, and explored the training data using the newly acquired Lilac AI.

DBRX was trained using optimized versions of Databricks’ open-source training libraries, including LLM Foundry, Streaming, Mega Bloks, and Composer. The company also utilized the Mosaic AI training service to facilitate large-scale model training and fine-tuning across thousands of GPUs, and the results were then logged using MLflow.

After the building process was complete, the development teams used Databricks Playground to manually experiment with DBRX. According to the company, the teams found that the tools were the best-in-class for each of their purposes, which may explain their effectiveness in facilitating model development and training.

Read more: Cost optimization in Databricks through resource optimization

DBRX Training Efficiency

When evaluating model quality, one must also consider how easy it is to train and use. This is especially notable at Databricks, where they build models to establish a process for their customers to train their own foundational models.

When training DBRX, developers found that training MoE models provided significant improvements in computing efficiency for training.

For instance, when training a smaller member of the DBRX models family called DBRX MoE-B, with a total of 23.5 billion parameters, of which 6.6 billion are active, developers found that DBRX MoE-B required 1.7 times fewer FLOPs to reach a score of 45.5% on the Databricks LLM Gauntlet compared to LLaMA2-13B to reach 43.8%. DBRX MoE-B also has half as many active parameters as LLaMA2-13B.

Historically speaking, the company’s end-to-end model pre-training computation has become nearly four times more compute-efficient over the past year. On May 5 2023, the company released MPT-7B, a 7-billion parameter model trained on 1 trillion tokens, which reached a Databricks LLM Gauntlet score of 30.9%.

Similarly, DBRX MoE-A, a smaller member of the DBRX models family with a total of 7.7 billion parameters, of which 2.2 billion are active, achieved a Databricks Gauntlet score of 30.5% with 3.7 fewer FLOPs.

This impressive markup in efficiency can be attributed to utilizing the MoE architecture, better optimization strategies, better pre-training data, and other architecture changes to the network.

Using better pre-training data also made a significant impact on model quality. For instance, the company trained DBRX Dense-A, a 7-billion parameter model with 1 trillion tokens with the DBRX pre-training data. As a result, the model achieved a Databricks Gauntlet score of 39.0%, compared to MPT-7B’s 30.9% score.

Read more: Mastering Databricks Deployment: A Step-by-Step Guide

According to the company’s model development team, their new pre-training data is at least two times better token-for-token than the data used to train MPT-7B. This means that they only need half as many tokens to reach the same model quality. Considering they applied the same analogy when training DBRX, it’s no wonder it performs so impressively.

How does DBRX compare to other models?

According to Databricks, DBRX instruct is the leading model on numerous benchmarks, surpassing several established models like GPT-3.5 with capabilities that can only be said to rival Gemini 1.0 pro.

In the following sections, we’ll evaluate how DBRX models perform on various benchmarks against leading open and closed models:

Quality of Benchmarks: DBRX Instruct vs Open Models

When evaluated against some of the leading open-source models, DBRX Instruct outperforms them on several benchmarks, including programming, mathematics, composite benchmarks, and MMLU. It also surpasses all instruction fine-tuned and chat models on standard benchmarks.

  • Composite Benchmarks

The Databricks development team evaluated DBRX Instruct on two composite benchmarks: the Databricks Model Gauntlet and the Hugging Face Open LLM Leaderboard.

The Databricks Model Gauntlet comprises a suite of more than 30 tasks spanning six categories, including knowledge understanding, commonsense reasoning, symbolic problem solving, world knowledge, programming, and reading comprehension.

The Hugging Face Open LLM Leaderboard, on the other hand, features the average of ARC-challenge, WinoGrande, HellaSwag, GSM8k, and MMLU.

Among the open models evaluated, DBRX has the highest score on two composite benchmarks: the Databricks Gauntlet, with a 66.8% score against the highest scoring model, and Mixtral Instruct, with a 60.7% score. On the Hugging Face Open LLM Leaderboard, DBRX Instruct has a 74.5% score against Mixtral Instruct, with a 72.7% score.

  • Programming and Mathematics

When it comes to programming and mathematics, only a few open models could possibly match DBRX. For instance, it scores much higher than the models the Databricks team evaluated on HumanEval, with a 70.1% score against Mixtral Instruct’s 54.8%, Grock-1’s 63.25 and 32.2% for LLaMA2-70B’s best-performing variant.

In the GSM8k benchmark, DBRX Instruct has a score of 66.9% against Grock-1’s 62.9%, Mixtral Instruct’s 61.1%, and 54.1% for LLaMA2-70B’s best-performing variant.

DBRX Instruct’s impressive performance against these models comes from the fact that some of them are specially configured to outperform similar models on certain benchmarks. For instance, it outperforms Grok-1 even though it has 2.4 times more parameters. Similarly, DBRX Instruct surpasses CodeLLaMA-70B Instruct on the HumanEval benchmark, which is built specifically for programming tasks – DBRX has a 70.1% score against CodeLLaMA-70B Instruct’s 67.8% score on the HumanEval benchmark. [5]

  • MMLU

DBRX has a 73.7% score on the MMLU benchmark, which is significantly higher than the other models evaluated. Mixtral Instruct and Mixtral Base models have a 71.4% and 71.9% score respectively, and Grock-1 has a 73.0% score.

Quality on benchmarks: DBRX vs. closed models

DBRX Instruct’s impressive performance doesn’t just outmatch open models. It also surpasses several closed models as well. According to the scores reported by OpenAI and Google, DBRX surpasses GPT-3.5 and is competitive with Mixtral Medium and Gemini 1.0 Pro.

Across all the benchmarks considered by the Databricks team, DBRX Instruct surpasses, or at worst matches GPT-3.5’s score. For instance, on the MMLU benchmark, which evaluates general knowledge, DBRX Instruct has a 73.7% score against GPT-3.5’s 70.0% score.

Similarly, DBRX Instruct has an 89.9% score on the HellaSwag test that evaluates commonsense reasoning against GPT-3.5’s 85.5% score. In the WinoGrande benchmark, DBRX Instruct closely matches GPT-3.5 with an 81.8% score against GPT-3.5’s 81.6% score.

When it comes to mathematics and programming, DBRX Instruct surpasses GPT-3.5 by a long shot. For instance, it scores a 70.1% score on the HumanEval benchmark against GPT-3.5’s 48.1% and 72.8% on the GSM8k benchmark against GPT-3.5’s 57.1% score.

The only models that seem to be giving DBRX Instruct a run for its money are Mixtral Medium and Gemini 1.0 pro, where DBRX gets a higher score in several benchmarks, including HumanEval, Inflection Corrected MTBench, HellaSwag, and MMLU. Gemini 1.0 Pro surpasses DBRX Instruct on the GSM8k benchmark.

When compared to Mixtral Medium, Mixtral Medium outperforms DBRX Instruct on the MMLU benchmark but gets the shorter end of the leash in Inflection Corrected MTBench, HumanEval, and Inflection Corrected MTBench, where DBRX Instruct scores slightly higher.

Quality on Long-Context Tasks and RAG

When compared to Mixtral Instruct and the latest versions of GPT-3.5 Turbo and GPT-4 Turbo, GPT-4 Turbo outperforms DBRX and GPT-3.5 Turbo on long context benchmarks. However, DBRX Instruct outperforms GPT-3.5 Turbo on all context lengths and all parts of the sequence. That said, DBRX and Mixtral Instruct have a similar overall performance on these benchmarks. [6].

Retrieval Augmented Generation (RAG) presents one of the most favorable ways to leverage a model’s context. RAG takes a unique approach to prompt retrieval, whereby all the content relevant to a prompt is retrieved from a database and presented alongside the prompt to give the LLM more information than it would have otherwise had.

In RAG, DBRX Instruct gets relatively similar scores to several open models like LLaMA2-70B Chat, Mixtral Instruct, and the current version of GPT-3.5 Turbo.

Getting started with DBRX on Databricks

You can get started working with DBRX Instruct right away with Databricks Mosaic AI Foundation Model APIs. The company offers a pay-as-you-go pricing model that allows you to query DBRX from the AI Playground chat interface.

If you plan to use DBRX models for production applications, the company offers a provisioned throughput option to provide support for fine-tuned models, performance guarantees, and additional security and compliance.

Databricks Services CTA

Wrapping up

By democratizing access to AI tools, tech companies are paving the way for widespread AI utilization and the development of more innovative applications. In this regard, DBRX has set an industry standard by surpassing established open models on compute efficiency, performance, and ease of training.

Databricks has also demonstrated how high-quality pre-training data coupled with innovative technologies like the MoE architecture can significantly improve model quality and performance. In the near future, we expect to see other tech companies taking a similar approach to improve their models.

References

[1] Slashdot.org. DBRX Reviews. URL. https://slashdot.org/software/p/DBRX/. Accessed on April 16th 2024
[2] Paperswithcode.com. Gated Linear Unit. URL: https://tiny.pl/dwgws. Accessed on April 16th 2024
[3]Serp. ai. Rotary Position Embedding. URL: https://serp.ai/rotary-position-embedding/,Accessed on April 16th 2024
[4]Deci. ai. Grouped Query Attention (GQA). URL: https://tiny.pl/dwgwb. Accessed on April 16th 2024
[5] Ai.mea.com. Llama LLM Coding. URL: https://ai.meta.com/blog/code-llama-large-language-model-coding/. Accessed on April 16th 2024
[6] Arxiv.org. Lost in the Middle: How Language Models Use Long Contexts. URL: https://arxiv.org/abs/2307.03172 Accessed on April 17, 2024



Category:


Generative AI