Technology companies Generative AI Consulting

Text Summarization with the OpenAI API

Text summarization plays a crucial role in news aggregation, document indexing, information retrieval, and content generation by helping users quickly grasp the essence of content without reading the entire text. It aims to save time and effort.

Summarization techniques are employed in natural language processing (NLP) and artificial intelligence (AI) systems to facilitate automated processing of large volumes of textual data.

Due to emerging Large Language Models (LLMs), we will have an explosion of “summarization” in the coming years, but – given the position of ChatGPT – most of them will likely be based on the GPT models developed by OpenAI.

We took advantage of the OpenAI ecosystem to build a tool that delivers the abstractive summarization of the contents covered in any given PDF, with the additional question-driven feature designed to extract the specific information from the text.

Case Study Shortcut

Challenge

Handling Diverse Document Formats

Text content can exist in PDFs, Word documents, scanned images, HTML files, and more. Extracting text consistently across these formats is challenging due to varying structures, encodings, and embedded elements like tables or images.

Filtering Noise and Preprocessing Raw Content

Documents often contain irrelevant elements such as headers, footers, watermarks, or metadata. Extracting only meaningful content requires robust preprocessing to remove noise without losing critical information.

Ensuring Clean and Structured Output for Summarization

Summarization algorithms rely on well-structured text, sentences, paragraphs, and semantic clues. Poorly extracted or unstructured text can degrade the quality of summaries, making it essential to maintain textual integrity during extraction.

Goal

Once the text is prepared (extracted and parsed) to be summarized, there is a time to harness OpenAI API to gain access to the chosen LLM. In the put case, we used GPT-3.5 Turbo.

How to get access to the GPT-3.5 Turbo through OpenAI API:

Sign up on the OpenAI website (https://openai.com)
Set up your API credentials, including an API key or token
Integrate the API into your text summarization tool (it involves setting up the appropriate HTTP client or SDK to make communications between your tool and OpenAI API possible)
Prepare the input for text summarization
Construct API requests to send the input text to the OpenAI API

Enable Access to a Powerful LLM for Summarization

Configure and Optimize API Usage for Chapter-Based Content

Control Summary Depth and Length Based on Token Constraints

Outcome

We wanted to make sure the user would find it useful, so we designed the feature that generates a quiz-like series of control questions based on the general text summarization. The “quiz” is supposed to come in handy to enable users to extract the key information from the texts in order to grasp its essence.

Before

Text summarization API was supposed to abstract and summarize any PDF content (paper scan included but after OCR processing) with a linked Table of Contents (it must contain redirections to individual chapters).

The workflow:

Uploading PDF document
Getting Text Summarization
Information Extraction

After

The developing process was divided into four main steps:

Text extraction with PyMuPDF library
Text parsing
Subchapters* summarization based on GPT-3.5 Turbo
General chapters summarization based on GPT-4
Quiz-Like Information Extraction

Integrate those solutions in your company

Contact below and let us design and integrate solutions tailored to your business needs

Let's talk

Case Study Details

Approach

TOC-Based Parsing Strategy

The parsing process begins by accessing and analyzing the Table of Contents (TOC).
Titles and page numbers are extracted and structured to form a hierarchical map of the document.

Mapping TOC Entries to Document Sections

Extracted TOC entries are matched with the actual content in the document.
This establishes a clear correlation between TOC items and corresponding text sections.

Targeted Section Extraction

With the mapping in place, the parser can selectively extract specific chapters or sections.
Extraction is guided by start and end points, often derived from page numbers or content markers in the TOC.

Granular Chunking for LLM Input

Instead of full chapters, the system extracts manageable chapter chunks, aligning with token limits for GPT-3.5 Turbo.
These chunks are optimized for efficient and coherent summarization by the language model.

Take the next step

Schedule an intro call to get know each other better and understand the way we work

Let's talk

Text Summarization with the OpenAI API

Case Study Shortcut

Challenge

Handling Diverse Document Formats

Filtering Noise and Preprocessing Raw Content

Ensuring Clean and Structured Output for Summarization

Goal

Outcome

Before

After

Integrate those solutions in your company

Case Study Details

Approach

TOC-Based Parsing Strategy

Mapping TOC Entries to Document Sections

Targeted Section Extraction

Granular Chunking for LLM Input

Take the next step

About Addepto

We are recognized as one of the best AI, BI, and Big Data consultants

We helped multiple companies achieve their goals, but - instead of making hollow marketing claims here - we encourage you to check our Clutch scoring.

Let's discuss
a solution
for you

Case Study Shortcut

Challenge

Handling Diverse Document Formats

Filtering Noise and Preprocessing Raw Content

Ensuring Clean and Structured Output for Summarization

Goal

Outcome

Before

After

Integrate those solutions in your company

Case Study Details

Approach

TOC-Based Parsing Strategy

Mapping TOC Entries to Document Sections

Targeted Section Extraction

Granular Chunking for LLM Input

Take the next step

About Addepto

We are recognized as one of the best AI, BI, and Big Data consultants

We helped multiple companies achieve their goals, but - instead of making hollow marketing claims here - we encourage you to check our Clutch scoring.

Let's discuss a solution for you

Other case studies

Building a Real-Time Fraud Detection Platform for Renewable Energy Certificates

Optimizing Aircraft Turnaround Through Practical AI

Let's discuss
a solution
for you