Text summarization is the process of condensing a longer piece of text (it can be an article, document, web page, eBook, or any other piece of content) into a shorter version while preserving its core information, main ideas, and critical points.
Two different kinds of text summarization:
Text summarization plays a crucial role in news aggregation, document indexing, information retrieval, and content generation by helping users quickly grasp the essence of content without reading the entire text. It aims to save time and effort.
Summarization techniques are employed in natural language processing (NLP) and artificial intelligence (AI) systems to facilitate automated processing of large volumes of textual data.
Due to emerging Large Language Models (LLMs), we will have an explosion of “summarization” in the coming years, but – given the position of ChatGPT – most of them will likely be based on the GPT models developed by OpenAI.
Take advantage of the OpenAI ecosystem to build a tool that delivers the abstractive summarization of the contents covered in any given PDF with the additional question-driven feature designed to extract the specific information from the text.
Text extraction serves as a necessary preprocessing step for content summarization, enabling access to the textual content, filtering noise, preprocessing the text, handling various document formats, and providing flexibility for different summarization techniques.
By extracting the text from a document, the algorithms have direct access to the cleaned textual information that needs to be processed. This ensures that the summarization algorithm can work with the raw material, including sentences, paragraphs, headings, and other textual elements.
Extracted text becomes agnostic to the specific document format and can be further analyzed in content summarization.
During the extraction process, we perform a series of operations that involve analyzing the text layout, identifying paragraphs or text blocks, and using bounding boxes to determine the exact position and boundaries of the text elements.
The outcome of this step is a JSON file. It provides a standardized and versatile format for representing and sharing the extracted text and associated metadata, allowing for easy processing, integration, and interoperability.
Text parsing involves breaking down the text into its constituent parts, such as words, sentences, paragraphs, or other meaningful units, to understand their relationships, roles, and characteristics within the context of the text.
Once the text is prepared (extracted and parsed) to be summarized, there is a time to harness OpenAI API to gain access to the chosen LLM. In the put case, we used GPT-3.5 Turbo.
How to get access to the GPT-3.5 Turbo through OpenAI API:
The last step is to create a chapters summary. We used GPT-4 model (8k) as GPT-4 better understands context and distinguishes nuances, resulting in more accurate and coherent responses.
The general summarization of the eBook is not the final outcome delivered by our solution. We wanted to make sure the user would find it useful, so we designed the feature that generates a quiz-like series of control questions based on the general text summarization. The “quiz” is supposed to come in handy to enable users to extract the key information from the texts in order to grasp its essence.
Project description
Text summarization API was supposed to abstract and summarize any PDF content (paper scan included but after OCR processing) with a linked Table of Contents (it must contain redirections to individual chapters).
The workflow:
Addepto approach
The developing process was divided into four main steps: