GitHub - FerrenF/bulkTextPartition: Uses unstructured.io library to batch process documents in a directory specified by command line

Bulk Text Extractor

This Python script extracts text from various document formats and splits them into segments using the unstructured-io library.

Features

Extracts text from PDF, MOBI, EPUB, and DJVU files.
Splits documents into segments based on specified settings.
Saves extracted segments as JSON files.

Requirements

Python 3.x
unstructured-io (with all-docs package)

Usage

Install dependencies: pip install unstructured[all-docs]
Clone or download the repository.
Run the script: python bulk_text_extractor.py <directory>
- Replace <directory> with the path to your documents directory.
The script will process each document and save extracted segments in a subdirectory within the specified directory.

Options

You can modify the BulkTextExtract class to customize settings like chunking strategy, page break handling, etc.
Refer to the unstructured-io documentation for more advanced functionalities.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dpsprep		dpsprep
.gitignore		.gitignore
debug.log		debug.log
main.py		main.py
progress.json		progress.json
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bulk Text Extractor

Features

Requirements

Usage

Options

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

FerrenF/bulkTextPartition

Folders and files

Latest commit

History

Repository files navigation

Bulk Text Extractor

Features

Requirements

Usage

Options

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages