This Python script extracts text from various document formats and splits them into segments using the unstructured-io library.
- Extracts text from PDF, MOBI, EPUB, and DJVU files.
- Splits documents into segments based on specified settings.
- Saves extracted segments as JSON files.
- Python 3.x
- unstructured-io (with
all-docspackage)
- Install dependencies:
pip install unstructured[all-docs] - Clone or download the repository.
- Run the script:
python bulk_text_extractor.py <directory>- Replace
<directory>with the path to your documents directory.
- Replace
- The script will process each document and save extracted segments in a subdirectory within the specified directory.
- You can modify the
BulkTextExtractclass to customize settings like chunking strategy, page break handling, etc. - Refer to the
unstructured-iodocumentation for more advanced functionalities.
MIT License