in Blog

December 11, 2023

Google Gemini AI Explained


Edwin Lisowski

CSO & Co-Founder

Reading time:

5 minutes

Gemini AI is a cutting-edge AI model developed by Google DeepMind, capable of sophisticated reasoning and understanding information across various modalities, including text, images, video, audio, and code. According to Google, it surpasses the capabilities of GPT-4. But is this Google’s wishful thinking, or has the company – although late to the game – cooked up something truly special?

This cutting-edge technology is poised to significantly revolutionize the methodologies employed by developers and business clients in the development and expansion of AI applications.

– stated Demis Hassabis, the co-founder and CEO of Google DeepMind.

The Gemini framework has been tailored in three distinct editions: Gemini Nano, which is fine-tuned for use on mobile devices; Gemini Pro, engineered to handle a broad array of operations; and Gemini Ultra, the most expansive variant of the model, crafted to manage exceptionally intricate tasks.

Generative AI - banner - CTA

Gemini Nano

This is a smaller version of the AI model designed to work as part of smartphone features, available now in the Pixel 8 Pro

Gemini Pro

A version of the Gemini model, called Gemini Pro, is available inside the Bard chatbot, which is accessible for free. It is also available for enterprise customers using Vertex AI, Google’s fully managed machine learning platform.

Gemini Ultra

This is the first and most capable version of Gemini, which has outperformed human experts on Massive Multitask Language Understanding (MMLU), a popular method for testing AI models’ knowledge and problem-solving abilities. It is designed to handle complex tasks across text, images, audio, video, and code, making it a truly universal AI mode.

What is MMLU?

Massive Multitask Language Understanding (MMLU) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings.

It covers 57 subjects across STEM, the humanities, the social sciences, and more, ranging in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem-solving ability.

The benchmark is ideal for identifying a model’s blind spots and is used to measure a text model’s multitask accuracy. To attain high accuracy on this test, models must possess extensive world knowledge and problem-solving ability.

Multimodality is a Google’s keyword

Gemini AI is designed for multimodality, allowing it to reason seamlessly across different forms of input and to produce outputs in various ways.

Its initial training encompassed various types of data. It has the capability to simultaneously process and understand a diverse array of information, including text, sound, visuals, video content, and computer programming. In contrast, competing models, such as OpenAI’s ChatGPT, primarily focus on text processing and require additional extensions to handle tasks related to image analysis and web navigation.

Gemini is not the first AI model developed by Google, but it is apparently the most efficient one. Trained on Google’s own Tensor Processing Units, it is supposed to be faster and more cost-effective compared to models like PaLM.

GPT-4 vs. Gemini

Google conducted 32 well-established benchmarks to compare the two models, covering a wide range of tests from the Multi-task Language Understanding benchmark to one assessing each model’s ability to generate Python code.

The results? Gemini outperformed GPT-4 in 30 of the 32 benchmarks.

Gemini’s primary advantage lies in its ability to understand and interact with both video and audio. This built-in multimodality is Gemini’s main distinguishing feature, and it has been Google’s goal from the outset. In contrast, OpenAI created separate models—DALL-E for images and Whisper for audio—whereas Gemini is designed to be multisensory from the ground up.

However, this remains theoretical until proven in practice.

It might be interesting for you: Open-Source Large Language Models (LLM) in 2023: A Comprehensive guide

Gemini in practice

The initial reception to Google’s Gemini – particularly the Gemini Pro – has not been positive, according to users (via Techcrunch).

Despite Google’s claims that Gemini Pro would bring enhanced capabilities to Bard, their ChatGPT rival, with advanced reasoning, planning, and understanding, users have reported issues with the AI’s performance.

Gemini Pro, which was meant to outperform older AI models like GPT-3.5 in certain benchmarks, has been criticized for errors in basic facts, struggles with translation, and outdated responses to news summarization requests.

Specifically, users have noted that Gemini Pro has provided incorrect information about topics such as the 2023 Oscar winners and has failed to perform simple translations correctly. Additionally, it appears to avoid commenting on potentially controversial news topics, suggesting users to look up the information themselves.Gemini in practice - tweet

Source: @vitor_dlucca

Gemini in practice - tweet

Source: @benjaminnetter

Moreover, even in the domain of coding, where improvements were promised, Gemini Pro seems to struggle with basic functions, as evidenced by users’ experiences with tasks like writing Python code or creating simple games and clocks in HTML.

What Gemini’s launch means for companies

Gemini, according to Google, possesses superior computing power and a larger dataset, enabling faster processing and more complex analyses. However, given the doubts surrounding Gemini’s outputs, as well as its less-than-impressive demo, it appears that Google may have some refining to do with Gemini.

In contrast, GPT-4 is a more mature and tested product that is currently available to the public. It generates more accurate and consistent text. Yet, assuming Google addresses these issues, Gemini may become a viable option for clients seeking faster processing and the ability to generate more creative and informative content.

For now, if you are considering implementing a Large Language Model (LLM) and wondering which one would be best, the final decision should be made after thoroughly investigating your business needs and the capabilities of your infrastructure.

If you would like to learn more, please don’t hesitate to contact us. We’d be delighted to assist you in making the right decision.


Generative AI