Skip to content

A Streamlit app that utilizes a pre-trained DistilBERT model and K-Means clustering to generate concise summaries from text data in various formats, including PDFs, images, and plain text.

Notifications You must be signed in to change notification settings

Nawap1/Nepali_Extractive_Summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Nepali_Extractive_Summarizer

This project is a Streamlit application that leverages a pre-trained DistilBERT model and K-Means clustering to extract concise summaries from text data in various formats, including PDF files, plain text, and images. The application seamlessly integrates cutting-edge natural language processing (NLP) techniques and computer vision technologies to provide an intuitive and user-friendly experience.

At the core of the application lies the DistilBERT model, a powerful language model pre-trained on a vast corpus of Nepali text data. This model is used to generate contextualized word embeddings, capturing the semantic and syntactic relationships within the input text. These embeddings serve as the foundation for the subsequent summarization process.

To extract the most salient and representative sentences from the input text, the application employs K-Means clustering, a widely-used unsupervised machine learning algorithm. The sentence embeddings generated by the DistilBERT model are clustered using K-Means, with the algorithm automatically identifying the optimal number of clusters. From each cluster, the sentence closest to the centroid is selected, effectively constructing a concise summary that captures the essence of the original text.

The application's versatility lies in its ability to handle text data from diverse sources. Users can upload PDF files, which are processed using the PyMuPDF library to extract the text content. For image files, the application leverages the EasyOCR library, a state-of-the-art optical character recognition (OCR) engine, to extract textual information from the images. Additionally, users can directly input plain text for summarization.

The Streamlit framework provides a user-friendly interface, allowing users to seamlessly interact with the application and view the summarized text alongside the original input. This project showcases the powerful combination of pre-trained language models, unsupervised learning techniques, and modern web technologies, enabling efficient and accurate text summarization for a wide range of applications.

Run Streamlit app

To run the app simply paste the following code in your terminal.

streamlit run .\Summarizer_App.py

About

A Streamlit app that utilizes a pre-trained DistilBERT model and K-Means clustering to generate concise summaries from text data in various formats, including PDFs, images, and plain text.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages