Text Summarization with the OpenAI API

Text summarization plays a crucial role in news aggregation, document indexing, information retrieval, and content generation by helping users quickly grasp the essence of content without reading the entire text. It aims to save time and effort.

Summarization techniques are employed in natural language processing (NLP) and artificial intelligence (AI) systems to facilitate automated processing of large volumes of textual data.

Due to emerging Large Language Models (LLMs), we will have an explosion of “summarization” in the coming years, but – given the position of ChatGPT – most of them will likely be based on the GPT models developed by OpenAI.

We took advantage of the OpenAI ecosystem to build a tool that delivers the abstractive summarization of the contents covered in any given PDF, with the additional question-driven feature designed to extract the specific information from the text.




Case Study Shortcut


Challenge


icon

Handling Diverse Document Formats


Text content can exist in PDFs, Word documents, scanned images, HTML files, and more. Extracting text consistently across these formats is challenging due to varying structures, encodings, and embedded elements like tables or images.

icon

Filtering Noise and Preprocessing Raw Content


Documents often contain irrelevant elements such as headers, footers, watermarks, or metadata. Extracting only meaningful content requires robust preprocessing to remove noise without losing critical information.

 

 

 

icon

Ensuring Clean and Structured Output for Summarization


Summarization algorithms rely on well-structured text, sentences, paragraphs, and semantic clues. Poorly extracted or unstructured text can degrade the quality of summaries, making it essential to maintain textual integrity during extraction.

Goal


Once the text is prepared (extracted and parsed) to be summarized, there is a time to harness OpenAI API to gain access to the chosen LLM. In the put case, we used GPT-3.5 Turbo.

How to get access to the GPT-3.5 Turbo through OpenAI API:

  1. Sign up on the OpenAI website (https://openai.com)
  2. Set up your API credentials, including an API key or token
  3. Integrate the API into your text summarization tool (it involves setting up the appropriate HTTP client or SDK to make communications between your tool and OpenAI API possible)
  4. Prepare the input for text summarization
  5. Construct API requests to send the input text to the OpenAI API

  • Enable Access to a Powerful LLM for Summarization

  • Configure and Optimize API Usage for Chapter-Based Content

  • Control Summary Depth and Length Based on Token Constraints

Outcome


We wanted to make sure the user would find it useful, so we designed the feature that generates a quiz-like series of control questions based on the general text summarization. The “quiz” is supposed to come in handy to enable users to extract the key information from the texts in order to grasp its essence.



Before


Text summarization API was supposed to abstract and summarize any PDF content (paper scan included but after OCR processing) with a linked Table of Contents (it must contain redirections to individual chapters).

The workflow:

  • Uploading PDF document
  • Getting Text Summarization
  • Information Extraction


After


The developing process was divided into four main steps:

  • Text extraction with PyMuPDF library
  • Text parsing
  • Subchapters* summarization based on GPT-3.5 Turbo
  • General chapters summarization based on GPT-4
  • Quiz-Like Information Extraction

Integrate those solutions in your company


Contact below and let us design and integrate solutions tailored to your business needs


Let's talk

Case Study Details


Approach


TOC-Based Parsing Strategy


  • The parsing process begins by accessing and analyzing the Table of Contents (TOC).
  • Titles and page numbers are extracted and structured to form a hierarchical map of the document.

Mapping TOC Entries to Document Sections


  • Extracted TOC entries are matched with the actual content in the document.
  • This establishes a clear correlation between TOC items and corresponding text sections.

Targeted Section Extraction


  • With the mapping in place, the parser can selectively extract specific chapters or sections.
  • Extraction is guided by start and end points, often derived from page numbers or content markers in the TOC.

Granular Chunking for LLM Input


  • Instead of full chapters, the system extracts manageable chapter chunks, aligning with token limits for GPT-3.5 Turbo.
  • These chunks are optimized for efficient and coherent summarization by the language model.

Take the next step


Schedule an intro call to get know each other better and understand the way we work


Let's talk

About Addepto



Addepto, a fast-paced, growing company focused on innovations in AI-related and data-oriented areas, supports digital transformation at companies working on electronics manufacturing services.


Here you can learn more about the technologies used in this project:



We help them find ways to use their data effectively with data lakes, data platforms, data engineering and so on.


About us


We are recognized as one of the best AI, BI, and Big Data consultants


We helped multiple companies achieve their goals, but - instead of making hollow marketing claims here - we encourage you to check our Clutch scoring.

Our customers love to work with us

Let's discuss
a solution
for you



Edwin Lisowski

will help you estimate
your project.













Required fields

For more information about how we process your personal data see our Privacy Policy





Message sent successfully!