in Blog

March 06, 2024

Google Gemini vs. GPT-4: Comparison


Edwin Lisowski

CSO & Co-Founder

Reading time:

6 minutes

The battle for the AI throne is on, and Google’s Gemini and OpenAI’s GPT-4 are some of the biggest contenders. Just a few weeks ago, Google released its Gemini Nano and Pro models into the market. One of its biggest selling points was that the Gemini model outperformed GPT-4 in 30 of the 32 most widely used tests to measure LLM capabilities.

Not that long after, Microsoft and OpenAI countered these claims by utilizing Medprompt, a series of strategies used to prompt LLMs for better results. [1] The results were phenomenal, with GPT-4 beating the Gemini Ultra model with a 90.10 score against a 90.04 score on the MMLU benchmark.

Considering the numerous similarities between the two models; both in functionality and capabilities, it’s quite difficult to settle down on one of them for the top spot. All we can do is figure out which performs better on specific tasks.

In that regard, here is a detailed Google Gemini vs. GPT 4 comparison evaluating the models’ performance and data processing capabilities.

Google Gemini vs. GPT 4: A comparison

Google Gemini is a revolutionary, cutting-edge AI product. This natively multimodal model can process different types of data, including text, images, audio, and video. It also comes as a family of models, each designed to perform specific tasks.

The Gemini Nano model, for instance, is designed to handle on-device applications. Similarly, the Pro model is designed to power AI tools, while the Ultra model is designed to handle more complex tasks.

Gemini Ultra model performs exceedingly well in different tasks, including code generation, image analysis, and various benchmarks, including MMLU, DROP, HellaSwag, MATH, and Natural2Code, among others. [2]

Chat GPT-4, on the other hand, is Microsoft’s and OpenAI’s latest development in the GPT series. It is significantly more powerful than its predecessors and can handle more types of data, including images, making it a multimodal model. [3]

One of the greatest perks offered by the GPT model is its impressive natural language processing capabilities. The model can understand, generate, and summarize text in various languages, making it an incredibly useful tool in language-intensive tasks like chatbots.

Read more about: Google Gemini: How Can It Be Used?

Performance: Google Gemini vs. GPT 4

Gemini, particularly the Ultra model, outperforms GPT-4 in various benchmarks, achieving state-of-the-art performance in various benchmarks, including MMLU, Big-Bench Hard, and DROP.

The MMLU benchmark evaluates performance in Multitask Language Understanding. In this benchmark, Gemini achieved a 90.0% score against GPT’s 86.4% score. It also outperformed GPT on Big Bench Hard and DROP benchmarks by a margin of 0.5% and 1.5%, respectively.

That said, GPT managed to outperform its contender in the HellaSwag test, meant to analyze performance in terms of commonsense reasoning for everyday tasks, with a 95.3% score against Gemini’s 87.8% score.

Both models seem to have a tie in mathematical reasoning benchmarks like GSM8K and MATH. The Gemini Ultra model got the upper hand in the GSM8K benchmark, with an aggregate score of 94.4% against GPT’s 90.0% score. Similarly, GPT came out on top of the MATH benchmark with a 52.9% score against a 53.2% score.

When it comes to code generation, the Gemini Ultra model significantly outperforms GPT on all benchmarks, including HumanEval and Natural2Code, with a score of 74.4% and 74.9% against GPT’s 67.0% and 73.9% scores, respectively.

As a natively multimodal model, Gemini Ultra exhibits sophisticated reasoning capabilities that enable it to extract insights from multiple data sources. It also performs exceedingly well in math, physics, and coding tasks. [4]

That said, GPT outperforms the Gemini Ultra model in advanced reasoning tasks like scheduling meetings based on the availability of multiple individuals.

You might be interested in the article: Google Gemini API vs. Open AI API: Main Differences

Gemini Ultra and Chat-GPT 4 in text processing

Up until a few months ago, the GPT series was the top contender for the #1 spot in text processing capabilities. However, the tables seem to have turned based on the latest results from the MMLU benchmark that evaluates general language understanding. After evaluating both models’ ability to interpret questions across 57 diverse subjects, Gemini scored an impressive 90.0% score against GPT-4’s 86.4% in a 5-shot setting [5].

Similarly, Gemini achieves a score of 83.6% against GPT’s 83.1% score in the Big-bench Hard benchmark. This benchmark assesses a model’s capabilities in multistep reasoning in various challenging tasks.

Gemini also seems to get the upper hand in reading and comprehension capabilities, with a score of 82.4% against GPT’s 80.9% score in the DROP benchmark. The DROP (Discrete Reasoning Over Paragraphs) benchmark assesses reading and comprehension capabilities.

In the GSM8K benchmark, which assesses model capabilities in solving basic arithmetic and grade school math problems, Gemini achieves a score of 94.4% against GPT’s 92.0% in a 5-shot COT setting.

The MATH benchmark is the only test both models seem to go hand in hand. The benchmark evaluates model capabilities in solving more complex math problems, including geometry and algebra. In the benchmark, Gemini Ultra slightly outperforms GPT with a score of 53.2% against the contender’s 52.9% score in a 4-shot setting.

Gemini Ultra and Chat-GPT 4 in text processing

The GPT-4 model is the first model in the GPT series with multimodal capabilities. Unlike its predecessors, it can understand and interpret different types of data, including images and videos. Although the GPT model does not have direct video processing capabilities, it can generate fairly accurate analysis of frame-by-frame representations of video inputs.

That said, Gemini, with its native multimodal training, significantly outperforms GPT in several multimedia benchmarks. For instance, the Ultra model achieves a 59.4% score on the MMMU benchmark that evaluates model performance on multi-discipline college-level reasoning problems. GPT falls slightly behind with a 56.8% score.

Similarly, in the VQAV2 benchmarks that evaluate model performance on natural language understanding, Gemini Ultra slightly surpasses GPT-4, with a score of 77.8% against 77.2%. It also surpasses GPT’s document understanding capabilities, with a score of 90.9% against GPT’s 88.4% in the DOCVQA benchmark.

One of the areas where Google Gemini shows significant improvements over GPT’s capabilities is in video processing. Its impressive multimodal capabilities give it an upper hand in various video processing-related benchmarks, including MathVista, which tests mathematical reasoning in video contexts; VATEX, which evaluates performance in English video captioning; and CoVoST 2, which evaluates automatic speech translation.

Generative AI - banner - CTA

Final thoughts

The steep competition between Microsoft and OpenAI’s GPT-4 and Google’s Gemini is a testament to rapid advancements in AI. Both models show impressive capabilities in various tasks, evidenced by their high scores in various benchmarks.

For instance, Gemini Ultra’s performance on benchmarks like HumanEval, MMLU, and GSM8K highlights its capabilities in understanding and generating complex code and text. In the same measure, GPT-4 has a competitive edge in the HellaSwag benchmark that evaluates commonsense reasoning for everyday tasks.


[1] Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine,
URL: Accessed on February 28, 2024
[2] Code Benchmark Comparison Between Gemini Ultra and GPT-4 in 2024.
URL: Accessed on February 28, 2024
[3] GPT-4 URL: Accessed on February 28, 2024
[4]Nimblechapps. com. Google Deepmind’s Gemini AI: A Truly Universal AI Model. URL,Accessed on February 28, 2024
[5] hgs.cxl. Gemini vs ChatGPT: Which is Better? URL: Accessed on February 28, 2024



Generative AI