in Blog

January 14, 2022

Computer Vision Case Study: Image Generation Process (Step-By-Step)


Adam Komorowski

Data Scientist

Reading time:

6 minutes

Due to the rapid technological and innovative development of artificial intelligence in the field of deep learning, computer vision has become the number one tool that transforms many industries.

Computer vision solutions are based on artificial intelligence and have many practical applications in retail, insurance, security, agriculture, construction and more. According to Forbes, the computer vision market will reach $49 billion in 2022.

In this video, you can find a real-life computer vision example of how this technology can be implemented in the retail industry and how companies can maximize sales with advanced shelf analytics:


Despite the trend of implementing computer vision technology, the computer vision market faces many challenges. Below you will find a computer vision case study that shows one of the biggest challenges in today’s world of artificial intelligence – the problem of generating images based on natural language.

Computer Vision Case Study by Addepto’s Data Scientist

This issue is interesting because it brings together many areas of the machine learning world such as: Natural Language Processing and Computer Vision. Recently a state-of-the-art solution to this problem was published by the Open-AI team. They named their model DALL-E [1], which has 2 key features:

• It is a transformer based on their earlier popular model GPT-3.
• It is learned from a huge (over 200 million records) dataset of image-description pairs, most of which were scraped from the web.

Such models can have many applications, both commercial and scientific. In the world of artificial intelligence, they can help in so-called data augmentation – for example, to generate larger training datasets of images in computer vision problems.

When it comes to commercial applications – models of this type can be used by designers for initial visualization of their ideas or used to generate so-called stock images based on article content.

Images generated for “A blond girl with a smile” text

Figure 1 – Images generated for “A blond girl with a smile” text


This solution consists of three key elements that form a final framework:

• CLIP model
• Evolutionary algorithm
• Pre-trained GAN model (Generative Adversarial Network)

They create a pipeline to generate batches of images corresponding to the input text snippet passed in English.

CLIP is a part of the DALL-E model made available by its authors – it provides a link between image encodings and text encodings. It allows us to evaluate the generated image quality in consecutive iterations of the model by calculating fitness function values based on it.

The most important element of the whole framework is the pre-trained GAN model. GAN model is a so-called generative network, introduced many years ago to the ML world. These networks allow generating various images from the domain specified by the dataset (for example pictures of cats, faces, or cars). The larger the training set of GANs, the more advanced the model is and the greater the variety of generated images.

Evolutionary algorithms are population-based heuristic optimization methods, which are designed to replicate the way species evolve as found in nature. Therefore, using numerical equivalents of operations such as mutation, crossover, and reproduction, aims for an ever-increasing population consisting of strong, fit members.


Evolutionary Algorithm scheme

Figure 2 – Evolutionary Algorithm scheme

How does it work?

The process begins with the user providing input in English. It can be a single word or longer text and it is encoded using the CLIP model to a vector. The first population of images is generated using the selected GAN model by passing a random vector to it. The evolutionary algorithm tries to maximize the chosen fitness function by iteratively adjusting the populations of new images to the encoding of the user’s initial text. So, to calculate the value of the fitness function, each image from the population is encoded using the CLIP model to the vector of the same length like input text encoding. The algorithm loop repeats until it is stopped or the fitness function converges. The final population of images is the framework’s output.


Final Framework scheme

Figure 3 – Final Framework scheme

The first population of images generated by the phrase a yellow car in the city is presented below. The result is a batch of car images, generated randomly by the GAN network.

yellow cars, computer vision GAN network

For the next 100 iterations, the model adjusts the successive populations of images to best reflect the given input text. The results of such operations are presented below.

yellow cars, yellow city, computer vision use case

The evolutionary algorithm was able to understand the most important features mentioned in the input text, such as the city or the yellow color. On the other hand, it can be seen that the model did not understand the context, i.e. the fact that only the car was supposed to be yellow.

These images clearly show how the described framework works, showing also the direction of development of these types of solutions in the future.

Addepto specializes in such computer vision solutions as object detection, image and video pre-processing and scene segmentation. With these computer vision services, your company can take advantage of new hi-tech opportunities, improve operations, provide a unique customer experience and most importantly, reduce the cost of employing and training people to do tasks that computers will perform much faster.

How Can Companies Benefit from Computer Vision Technology?

We want to show you our computer vision case study as an example of how companies can benefit from high-tech solutions.

The Challenge: The Polish retailer has thousands of images of product labels in its database. The main problem of the retailer was the inability to quickly and automatically find a place on the image where the ingredients of the product are described in detail and convert the image into text.

Our Solution: After analyzing possible solutions to this problem, we decided that it is best to use a modified convolutional neural network VGG16 to find blocks containing information about ingredients. Using Google OCR, we were able to extract fields with text data, and then process the text using NLP methods and classify the product. At the request of the company, we also integrated the created algorithm with a mobile application.

Benefits: Our technology allows the company to process images considerably faster (with 91% accuracy) than it is possible manually which helps people to choose the right food if someone has allergies. Moreover, the implemented algorithm assists in checking whether the product changes over time.

Computer Vision Case Study Wrap-Up

As you’ve learnt from the computer vision case study presented above, this technology has a great potential to help businesses to grow, offering benefits like:

  • improving the quality of customer service and user experience
  • providing retailers with smooth process management and automatic quality control,
  • assisting doctors and nurses spend more time on patient care, helping reduce stress for providers and improving outcomes for patients,
  • helping farmers to monitor livestock in large-scale farming to analyze risks of biological hazards,
  • increasing productivity and reducing costs in the construction industry.

If you are looking for a trusted partner to cooperate with, contact us and arrange a consultation on how computer vision technology can be implemented in your business.


Artificial Intelligence