This project is a Streamlit application that leverages a pre-trained DistilBERT model and K-Means clustering to extract concise summaries from text data in various formats, including PDF files, plain text, and images. The application seamlessly integrates cutting-edge natural language processing (NLP) techniques and computer vision technologies to provide an intuitive and user-friendly experience.
At the core of the application lies the DistilBERT model, a powerful language model pre-trained on a vast corpus of Nepali text data. This model is used to generate contextualized word embeddings, capturing the semantic and syntactic relationships within the input text. These embeddings serve as the foundation for the subsequent summarization process.
To extract the most salient and representative sentences from the input text, the application employs K-Means clustering, a widely-used unsupervised machine learning algorithm. The sentence embeddings generated by the DistilBERT model are clustered using K-Means, with the algorithm automatically identifying the optimal number of clusters. From each cluster, the sentence closest to the centroid is selected, effectively constructing a concise summary that captures the essence of the original text.
The application's versatility lies in its ability to handle text data from diverse sources. Users can upload PDF files, which are processed using the PyMuPDF library to extract the text content. For image files, the application leverages the EasyOCR library, a state-of-the-art optical character recognition (OCR) engine, to extract textual information from the images. Additionally, users can directly input plain text for summarization.
The Streamlit framework provides a user-friendly interface, allowing users to seamlessly interact with the application and view the summarized text alongside the original input. This project showcases the powerful combination of pre-trained language models, unsupervised learning techniques, and modern web technologies, enabling efficient and accurate text summarization for a wide range of applications.
To run the app simply paste the following code in your terminal.
streamlit run .\Summarizer_App.py