Juan Pablo Balarini • 04 JUN 2024

Text summarization through AI: the power of clustering

Frequently, enterprises find it quite challenging to handle their documentation repositories, find accurate information, and organize files.

Condensing hundreds of pages into a concise summary can be quite tedious. Although today, you can take a 10-page document and ask ChatGPT to summarize it for you, the issue with this is that OpenAI charges you per token, which can quickly become expensive, especially if you have many documents of hundreds or thousands of pages. Similarly, it may not be feasible if you want to summarize a book using ChatGPT.

We’ve faced this difficulty with many of our projects, and here’s where AI comes into play. A technique called clustering harnesses AI's power to distill vast amounts of text into bite-sized summaries, categorize content, and handle documentation easily.

Clustering: how AI helps organize chaos into coherent summaries

Imagine we have 200 pages of text to summarize. Instead of scrutinizing each paragraph one by one, we categorize them into clusters based on their underlying themes. By pinpointing the central idea of each cluster, we effectively capture the essence of multiple paragraphs in one stroke.

Clustering, a family of machine learning algorithms, aims to organize points into distinct groups, optimizing similarity within each group while maximizing dissimilarity between groups. In text, this is achieved by analyzing the textual features, such as word frequencies or semantic similarities within each paragraph. However, this concept applies to any data points in space. Common techniques such as K-means or hierarchical clustering are two possible algorithms that make it possible.

Clustering extends naturally to text through embeddings. Embeddings represent text as vectors, capturing its semantic meaning. By leveraging text embeddings, we can cluster documents, identifying similar text fragments within each cluster. This approach enables the organization of textual data based on underlying semantic similarities, enhancing various natural language processing tasks.

Once the clustering is complete, we can identify the central idea or theme of each cluster (group of paragraphs). This allows us to summarize a large text volume efficiently, as each cluster represents a distinct aspect or topic covered in the document. By focusing on the central idea of each cluster (centroid), we can capture the essence of multiple paragraphs in a condensed form, facilitating easier understanding and analysis of the overall content and allowing different search features to be implemented.

In short, here’s how it works, step by step:

1. Chunking

As in most language processing applications, the first step is to subdivide the large document into smaller text fragments, called chunks. These can then be individually processed, analyzed, or indexed, allowing for more efficient handling and retrieval of information.

2. Vectorization

These fragments, or chunks, are then vectorized into embeddings, capturing the semantic meaning of the text.

3. Clustering

The embeddings are grouped into clusters based on their similarity using the K-Means algorithm or similar. Proximity indicates similarity among the corresponding text fragments.

4. Representative selection

Within each cluster, the most characteristic paragraph, usually the centroid, is chosen to encapsulate the central idea of that theme.

5. Summarization

We use these representative chunks to feed them into an LLM, such as GPT, to produce brief summaries.

Example


from sklearn.cluster import KMeans
import numpy as np

def divide_document(document):
    # divide the document into smaller text fragments
    return chunks

def embed(chunk):
    # embed the chunk with your favorite embedding model
    return embedding

def cluster_document(embeddings):
    # one of possible clustering algorithms
    kmeans = KMeans(n_clusters=10, random_state=0, n_init="auto")
    labels = kmeans.fit_predict(embeddings)
    # the label is to which cluster each chunk belongs to
    return labels, kmeans.cluster_centers_

def find_most_representatives(chunks, embeddings, center):
    # find the most representative chunk for each cluster
    # the closer to the center, the more representative
    representative_elements = []

    # calculate distances between each chunk and the center of the cluster
    distances = np.linalg.norm(embeddings - center[labels], axis=1)

    for cluster in labels.unique():
        cluster_data = chunks[labels == cluster]
        cluster_data_sorted = cluster_data[np.argsort(distances[labels == cluster])]

        representative_elements.append(cluster_data_sorted[0])

    return representative_elements


document = "Here goes your really long document"
chunks = divide_document(document)
embeddings = [embed(chunk) for chunk in chunks]
labels, center = cluster_document(embeddings)
representative_elements = find_most_representatives(chunks, embeddings, center)
# feed the LLM with the most representative elements to do the summarization

A closer look at clustering: pros and cons

As with any emerging technique, it's crucial to weigh its advantages and disadvantages before incorporating it into a product.

Clustering-based summarization offers several benefits:

Cost-effectiveness

Traditional methods often charge based on text length, whereas clustering reduces costs by summarizing only the most representative paragraphs.

Efficiency

This technique categorizes text into clusters according to underlying themes, streamlining the summarization process and saving time and effort.

Comprehensiveness

Clustering-based summarization captures the essence of multiple paragraphs within each cluster, providing comprehensive summaries that address various aspects of the text.

Simplification

Complex texts are simplified through clustering-based summarization, condensing them into easily digestible summaries for improved understanding and analysis.

However, while overall efficient, the summary may omit some themes or secondary data, leading to incomplete coverage of the full text.

Implementing clustering in our AI product: DocsHunter

At Eagerworks, we came up against the challenge of summarizing extensive documentation with DocsHunter. Our product addresses a common issue large enterprises encounter: navigating through thousands or even millions of internal documents to find a specific clause or piece of information. Thus, we seamlessly adopted this innovative approach, inspired by an idea from the online community, and transformed it into a practical solution. Here's how it unfolded:

The concept originated from a Reddit thread that sparked our curiosity. While showcasing the power of online collaboration and idea-sharing, it led us to explore clustering's potential.
We then tailored the concept to suit our needs, devising a step-by-step process for clustering paragraphs, selecting representatives, and generating summaries.
By leveraging the capabilities of open Large Language Models (LLMs), we automated the summarization process, making it more efficient and scalable.

We encountered both the benefits and limitations of clustering firsthand during our project.

With a staggering 4 million documents to sift through, equivalent to billions of pages, traditional summarization methods were simply not feasible. While clustering allowed us to condense this vast amount of data, some summaries ended up being somehow incomplete. Despite this drawback, clustering was the most viable option and proved to be quite effective overall.

Moving forward, optimization of this process could be achieved by adjusting the size of clusters or considering multiple chunks per cluster.

Clustering-based summarization leveraging AI

Clustering-based summarization represents a paradigm shift in how we distill information from extensive texts. While offering undeniable benefits in cost and time efficiency, it's essential to remain aware of its limitations regarding details and coverage.

To understand how clustering can complement other methodologies and AI processes, consider RAG (Retrieval-Augmented Generation) as an example. RAG is employed in text summarization tasks to retrieve relevant information from a large corpus, while clustering can help organize this information into coherent themes or topics. By combining these approaches, we can generate more focused and informative summaries that capture the essence of the input text. Typically, we begin by generating document summaries and subsequently utilize RAG to pinpoint documents closely matching the desired content, using these summaries as the basis rather than the entire document.

As we continue to explore new frontiers in AI-driven text analysis, this advance stands as an innovative outcome to keep helping our partners develop tailored solutions.

Feel free to dive into our blog to learn more about our experiences with AI and other software advancements.

Stay updated!

Juan Pablo Balarini

June 04, 2024