Client: NDA

Case Study: Text Summarization with the OpenAI API

Case study details


Text summarization is the process of condensing a longer piece of text (it can be an article, document, web page, eBook, or any other piece of content) into a shorter version while preserving its core information, main ideas, and critical points.

Two different kinds of text summarization:

  • Extractive summarization involves identifying and extracting essential sentences or phrases directly from the original text to construct the summary.
  • Abstractive summarization generates a summary by understanding the meaning of the text and producing new sentences that convey the main ideas.


Challenge


Text summarization plays a crucial role in news aggregation, document indexing, information retrieval, and content generation by helping users quickly grasp the essence of content without reading the entire text. It aims to save time and effort.



Approach


Summarization techniques are employed in natural language processing (NLP) and artificial intelligence (AI) systems to facilitate automated processing of large volumes of textual data.

Due to emerging Large Language Models (LLMs), we will have an explosion of “summarization” in the coming years, but – given the position of ChatGPT – most of them will likely be based on the GPT models developed by OpenAI.



Goal


Take advantage of the OpenAI ecosystem to build a tool that delivers the abstractive summarization of the contents covered in any given PDF with the additional question-driven feature designed to extract the specific information from the text.



Challenge

Text extraction


Text extraction serves as a necessary preprocessing step for content summarization, enabling access to the textual content, filtering noise, preprocessing the text, handling various document formats, and providing flexibility for different summarization techniques.

By extracting the text from a document, the algorithms have direct access to the cleaned textual information that needs to be processed. This ensures that the summarization algorithm can work with the raw material, including sentences, paragraphs, headings, and other textual elements.

Extracted text becomes agnostic to the specific document format and can be further analyzed in content summarization.


Addepto Data Scientists opt for PyMuPDF for PDF text extraction


During the extraction process, we perform a series of operations that involve analyzing the text layout, identifying paragraphs or text blocks, and using bounding boxes to determine the exact position and boundaries of the text elements.


PymuPDF empowers efficient text extraction with OpenAI text summarization


The outcome of this step is a JSON file. It provides a standardized and versatile format for representing and sharing the extracted text and associated metadata, allowing for easy processing, integration, and interoperability.

 

 

 


Approach

Text parsing


Text parsing involves breaking down the text into its constituent parts, such as words, sentences, paragraphs, or other meaningful units, to understand their relationships, roles, and characteristics within the context of the text.


During project development, our team worked on:


  • Our solution parses text based on the Table of Contents (TOC). Once the TOC is accessed, the parsing process begins.
  • This involves extracting the titles and page numbers from the TOC, organizing them into a structured format, and potentially creating a hierarchical representation of the document's structure.
  • By matching the extracted titles with the actual content, the parser can establish a mapping between the TOC entries and the corresponding sections in the document.
  • Extracting targeted sections: With the mapping established, the parser can selectively extract the content associated with specific sections or chapters based on user requirements or predefined criteria.
  • This involves locating each section's start and end points, usually based on the page numbers or other indicators provided in the TOC.
  • The outcome: Chapter chunks (not entire chapters) with GPT-3.5 Turbo.

Goal

Chapters OpenAI summarization based on GPT-3.5 Turbo


Once the text is prepared (extracted and parsed) to be summarized, there is a time to harness OpenAI API to gain access to the chosen LLM. In the put case, we used GPT-3.5 Turbo.

How to get access to the GPT-3.5 Turbo through OpenAI API:

  1. Sign up on the OpenAI website (https://openai.com)
  2. Set up your API credentials, including an API key or token
  3. Integrate the API into your text summarization tool (it involves setting up the appropriate HTTP client or SDK to make communications between your tool and OpenAI API possible)
  4. Prepare the input for text summarization
  5. Construct API requests to send the input text to the OpenAI API



After setting the access for GPT-3.5 Turbo, we needed to decide how exactly our text summarization API should work: how long it should be and how in-depth it must be.


We used 4k tokens to summarize every part of chapters (there is a limit of 4,096 tokens, which is approximately ~3000 words for the combined prompt and the resulting generated completion - up to 512 tokens).


We used the Tiktoken tool made and suggested by OpenAI API to count the tokens.


Outcome

OpenAI text summarization API - outcome


The last step is to create a chapters summary. We used GPT-4 model (8k) as GPT-4 better understands context and distinguishes nuances, resulting in more accurate and coherent responses.

 

Quiz-like Information Extraction

 

The general summarization of the eBook is not the final outcome delivered by our solution. We wanted to make sure the user would find it useful, so we designed the feature that generates a quiz-like series of control questions based on the general text summarization. The “quiz” is supposed to come in handy to enable users to extract the key information from the texts in order to grasp its essence.



Before


Project description

Text summarization API was supposed to abstract and summarize any PDF content (paper scan included but after OCR processing) with a linked Table of Contents (it must contain redirections to individual chapters).

The workflow:

  • Uploading PDF document
  • Getting Text Summarization
  • Information Extraction


After


Addepto approach

The developing process was divided into four main steps:

  1. Text extraction with PyMuPDF library
  2. Text parsing
  3. Subchapters* summarization based on GPT-3.5 Turbo
  4. General chapters summarization based on GPT-4
  5. Quiz-Like Information Extraction

We are recognized as one of the best AI, BI, and Big Data consultants


We helped multiple companies achieve their goals, but - instead of making hollow marketing claims here - we encourage you to check our Clutch scoring.

Let's discuss
a solution
for you



Edwin Lisowski

will help you estimate
your project.










Required fields

For more information about how we process your personal data see our Privacy Policy





Message sent successfully!