Author:
CSO & Co-Founder
Reading time:
Multimodal AI is revolutionizing the AI landscape. In just a few years, the world has shifted from simple, analytics-based AI models that, at one time, seemed to be the epitome of technological advancement, to large language models (LLMs) that can handle a myriad of tasks, sometimes even rivaling human creativity. What’s even more impressive is these models can streamline the development of advanced AI models.
Take the Gemini API, for instance. Developed with multimodal capabilities, the LLM can handle various data types, facilitating the development of more advanced, multimodal large language models that will shape the future of AI utilization across the board.
This article will help you create unique large language models using the Gemini API. But first, here’s some helpful information to get you started.
The Gemini PRO model is the latest addition to Google DeepMind’s assortment of LLMs. It is significantly larger and better performing than their previous project, PaLM, which didn’t perform quite well compared to similar models of the same category.
That said, Gemini boasts a wide range of capabilities, most notable of which is its multimodality, which enables it to handle various data types, including text, images, audio, and video. It also performs quite well in tasks related to physics, math, code, and other technical fields. In fact, Gemini outperformed OpenAI’s GPT-4 in these and several other areas. [1]
Gemini is currently available through integrations with Google’s Pixel 8, Google Bard, and the Gemini API. According to Google, the company plans to make it available on other Google service platforms in the near future.
Besides its multimodality, the other thing that sets Gemini AI apart from other similar models is its flexibility and scalability. It comes in three sizes, enabling seamless utilization in several platforms and architectures, including data centers and mobile devices. The various sizes or iterations include:
Image credits: Google
The Nano model is the smallest in the Gemini series. It is designed to run on smartphones like the Google Pixel 8, where it can perform several on-device tasks that don’t require a connection to external servers. Currently, the Nano model can perform simple tasks such as text summarization and suggesting replies in chat applications. [2]
Gemini Pro is much larger than the Nano model. As such, it cannot be utilized on on-device applications. Instead, it runs on Google’s data centers and is currently integrated into Google Bard. Its complex architecture and large dataset enable it to perform seamlessly on tasks that require the understanding of complex queries and fast response time.
Gemini Ultra is still under development and thus not available for public use. However, the company describes it as its most effective model, exceeding the current SoTA in 30 out of the 32 academic benchmarks used in LLM research and development.[3]
According to Google, Gemini Ultra is designed for widespread use in complex tasks. The model is scheduled for release after completion of its current testing phase. Unfortunately, the company has not yet set a release date, but it’s estimated to be released shortly.
Read more about: Google Gemini: How Can It Be Used?
The flexibility and performance provided by Gemini’s multimodal capabilities make it one of the most versatile tools for building Large Language Models. In the sections below, we will explore how you can develop a unique LLM using the Gemini API platform, including Langchain interactions and how to leverage the platform effectively for various use cases.
What you need to start with Gemini:
The Gemini Pro API is set up in a way that enables you to build deployable LLMs seamlessly in just a few simple steps. Here’s how to do it:
Step 1: Set Up Your Development Environment
Like with any other AI project, you first need to create a new directory for your project, then navigate to it using your command prompt or terminal as shown below.
mkdir LLM_ cd Project LLM_Project
Step 2: Install Dependencies
Once you have your development environment up and running, you need to set up the dependencies you need to develop your LLM. These dependencies include:
Once you have the libraries stored in a single location, you can install them with the code below:
pip install google-generativeai langchain-google-genai streamlit pillow
Alternatively, you can create a virtual environment to install the required libraries and manage your project dependencies as shown below:
python -m venv venv source venv/bin/activate #for ubuntu venv/Scripts/activate #for windows
However, if you choose to go with the method described above, you need to launch the streamlit App and create an application file (app.py) using your code editor. Once you have that up and running, you can run the application locally.
That said, if you choose to go with the initial method, proceed with step 3.
Step 3: Configure the Free API Key and Initialize the Gemini Model
Google is currently offering a free Google API key at Google Makersuite. After obtaining it, you need to store it in an environment variable. For this instance, let’s name it “GOOGLE_API_KEY”. However, to work with the Gemini AI model, you first need to import configurations from Google’s Genai library and transfer the Google API key you obtained to the api_key variable. Here’s how to do it:
import os import google.generativeai as genai os.environ['GOOGLE_API_KEY'] = "Your API Key" genai.configure(api_key = os.environ['GOOGLE_API_KEY']) model = genai.GenerativeModel('gemini-pro')
Once you have your Gemini LLM model up and running, you can start utilizing it for text generation. For this, you first need to import the Markdown class from IPython. Importing the Markdown class helps display generated output in a Markdown format.[5]
Additionally, to create a working model, you also need to ‘call’ the GenerativeModel class from genai, then the GenerativeModel.generate_content() function to facilitate the processing of user queries so that the model can generate an appropriate response.
Here’s how to do it:
from IPython.display import Markdown model = genai.GenerativeModel('gemini-pro') response = model.generate_content("List 5 planets each with an interesting fact") Markdown(response.text)
There are currently two models available in the Gemini LLM series;
Gemini Pro is an unimodal, text-based model. This means that it can only process text inputs and generate output in textual format. The model has an input token length of 30 tokens and can generate an output of up to 2000 tokens, making it suitable for creating chat applications.
The Gemini Pro Vision model is a multimodal model capable of processing both text and image-based inputs.[6] It closely resembles OpenAI’s GPT-4 model, with their most significant difference lying in their context length. In this regard, Gemini Pro Vision has an input context length of 12,000 tokens and can generate an output of up to 4,000 tokens.
Depending on your specific use case, you can ask the model (Gemini Pro) to generate text by asking it a simple question such as ‘Name the eight planets of the solar system’, or input an image (for the vision model) and ask the model questions pertaining to the image. For instance, you can input an image of a cat and ask the model to recognize it. You can also take your queries a step further by asking it to further elaborate on them.
The Gemini model does not refer to its output as an output. Instead, it uses the word candidates. As such, any response from the model may contain the word candidate. According to Google, the model can generate multiple candidates from a single prompt, giving you the chance to choose the best answer among a multitude of choices.
Unfortunately, this capability has not yet been implemented, which means you can only generate a single candidate per query. On the upside, there’s speculation that upcoming updates may provide multiple responses from a single query.
As a multimodal model, the Gemini LLM can process and analyze different types of input data. In this instance, we’ll test its chat function with an image query. That said, it is important to note that only the Gemini Pro Vision model can handle image inputs.
Here’s an example of how you can use the Gemini LLM to process and analyze an image input.
Use the PIL library to load an image. lem
Create a new vision model called ‘Gemini-pro-vision’ using the GenerativeModel class.
Using the GenerativeModel.generative_content() function, input the image along with the necessary textual prompts.
Here’s a simple code for this operation:
import PIL.Image image = PIL.Image.open('random_image.jpg') vision_model = genai.GenerativeModel('gemini-pro-vision') response = vision_model.generate_content(["Write a 100 words story from the Picture",image]) Markdown(response.text)
The code above will prompt the model to generate a text based on the image. However, you can take it a step further by prompting the model to generate a JavaScript Object Notation (JSON) response.
In this instance, the model will inspect the objects in the image and deliver an appropriate response. For example, if you input an image of a busy street, the model may be able to name the objects in the image, including people, vehicles, and other elements.
Besides the regular text generation model described above, the Gemini LLM also offers a dedicated chat version, taunted to generate coherent, human-like responses. You can launch the chat version of Gemini Pro using the code below:
chat_model = genai.GenerativeModel('gemini-pro') chat = chat_model .start_chat(history=[])
It is important to note the difference in code when generating plain text and launching the chat version. Instead of using the GenerativeModel.generate_text() line, you should use GenerativeModel.start_chat() to start a chat.
Like with the GPT-4 model, you can ask the Gemini LLM to elaborate further on the response provided or tweak it in a way that suits your preference. You can also type chat. history to track all previous messages in the chat session.
Langchain gained widespread popularity for its integration with the OpenAI API, which facilitates the seamless development of chatbots and other LLMs. Shortly after Gemini was released, Langchain provided improvements for integrating Gemini into its platform.[7] Here’s how to get started with Gemini on the Langchain platform.
Here’s a simple example code for the operation:
from langchain_google_genai import ChatGoogleGenerativeAI llm = ChatGoogleGenerativeAI(model="gemini-pro") response = llm.invoke("Explain Quantum Computing in 50 words?") print(response.content)
Building a simple ChatGPT-like application with Gemini LLM and Streamlit is pretty straightforward. Here’s how to do it:
The Gemini PRO API is poised to revolutionize the AI landscape. Its multimodal capabilities make it perfectly suited for developing advanced LLMs that can process and analyze various data types. Like the OpenAI API, the Gemini API has multiple integration capabilities with numerous platforms, including Streamlit and Langchain, that further streamline the model development process.
Currently, developers can only work with the Gemini Pro Vision and Gemini Pro models to develop LLMs. However, the Gemini Ultra model is scheduled for release and will present further development capabilities, perhaps even surpassing OpenAI’s API.
[1] Deepmind. Google, Gemini. URL: https://deepmind.google/technologies/gemini/. Accessed on January 12, 2024
[2] Storage. Google. Gemini Nano Now Running on Pixel 8 Pro. URL: https://store.google.com/intl/en/ideas/articles/pixel-feature-drop-december-2023/. Accessed on January 12, 2024
[3] Theverge.com. Google’s Bard Chatbot is Getting Way Better Thanks to Gemini. URL: https://www.theverge.com/2023/12/6/23989744/google-bard-gemini-model-chatbot-ai. Accessed on January 13, 2024
[4] Blog. Google. Introducing Gemini: Our Largest and Most Capable AI Model. URL: https://blog.google/technology/ai/google-gemini-ai/. Accessed on January 13, 2024
[5] Markdownguide. Getting Started. URL: https://bit.ly/3SsfmPf. Accessed on January 13, 2024
[6] Pyjmagesearch.com. Introduction to Gemini Pro Vision. URL: https://pyimagesearch.com/2024/01/01/introduction-to-gemini-pro-vision/. Accessed on January 13, 2024
[7] Protecto.ai. This Week in AI – LangChain Integrates Gemini Pro and More. URL: https://bit.ly/3SfwnLr. Accessed on January 13, 2024
Category: