BiblioForge is a versatile, cross-platform command-line tool designed to extract text from a wide array of document formats and intelligently organize them. It leverages multiple extraction engines, OCR capabilities, and AI-powered metadata analysis to sort and rename your documents into a structured library.
Status: This project is under active development. While functional, expect ongoing improvements and potential changes.
- Multi-Format Text Extraction: Supports PDF, EPUB, DJVU, MOBI, PowerPoint (PPTX), various text formats (TXT, MD), HTML, and common office formats (DOCX, DOC, RTF, ODT, FB2, PDB, LIT, LRF, CBZ, CBR, CHM, SNB, TCR via Calibre).
- Smart Fallbacks: Employs multiple extraction methods for each format, ensuring the best possible text recovery.
- OCR for Scanned Documents: Integrated OCR engines (Tesseract, PaddleOCR, EasyOCR, DocTR, Kraken) to process image-based PDFs and other scanned documents.
- Table Extraction: Capable of extracting tabular data from PDF files using Camelot.
- AI-Powered Sorting & Renaming:
- Utilizes Large Language Models (LLMs) to analyze content and extract key metadata (author, title, year, language).
- Generates shell scripts (
rename_commands.shand.batfor Windows) to organize files into anAuthor/Year Title.extstructure. - Option to automatically execute the rename script.
- Flexible LLM Integration: Supports local LLMs via Ollama (recommended for privacy), LlamaCPP (direct GGUF loading), local OpenAI-compatible servers, and various cloud-based providers.
- Customizable Processing: Offers fine-grained control over extraction methods, OCR engines, file types, and more.
- Cross-Platform: Designed to run on Windows, macOS, and Linux.
Python 3.8+ is recommended.
Create a virtual environment (recommended):
python -m venv biblioforge_env
# On macOS/Linux:
source biblioforge_env/bin/activate
# On Windows (cmd):
# biblioforge_env\Scripts\activate.bat
# On Windows (PowerShell):
# biblioforge_env\Scripts\Activate.ps1Install the required Python packages using pip:
# Core extraction dependencies
pip install pymupdf pdfplumber pypdf pdfminer.six pytesseract pdf2image tqdm \
ebooklib beautifulsoup4 html2text mobi chardet ftfy lxml requests \
python-pptx
# LLM providers
pip install ollama openai httpx groq cohere huggingface_hub
# Table extraction (optional but recommended)
pip install "camelot-py[cv]"
# OCR engines (choose based on your needs)
pip install easyocr paddleocr python-doctr ocrmypdf
# Advanced OCR (can be complex to install)
pip install kraken
# Local LLM support (optional)
pip install llama-cpp-python # Note: May require specific build configurations for GPU supportImportant Notes:
- Some packages like
paddleocr,easyocr, andllama-cpp-pythoncan be complex to install and may require system-level dependencies (CUDA, specific build tools, etc.). - If you encounter installation issues, consider installing packages individually and consulting their specific documentation.
- For
llama-cpp-pythonwith GPU support, you may need to install with specific CMAKE arguments or use pre-built wheels.
Certain functionalities, especially OCR and some format conversions, rely on external system tools.
Essential Recommendation:
- ebook-converter: For robust conversion of many formats (DOCX, RTF, etc.) to text. Install the ebook-converter application (based on Calibre codebase). BiblioForge will use the
ebook-convertercommand-line tool if it's in your system's PATH.
brew install tesseract poppler ghostscript djvulibre calibreEnsure tesseract data files for your desired languages are installed:
brew install tesseract-langsudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-all poppler-utils ghostscript \
djvulibre-bin calibre libgl1-mesa-glx libglib2.0-0Manual installation of the following is typically required. Ensure they are added to your system's PATH.
- Tesseract OCR: Download installer from UB-Mannheim Tesseract builds. Install language data during setup.
- Poppler for Windows: Download binaries from Poppler for Windows. Add the
bin/directory to PATH. - Ghostscript: Download installer from Ghostscript releases. Add its
bin/andlib/directories to PATH. - DjVuLibre for Windows: Available from DjView4 distribution which bundles command-line tools.
- Calibre: Download from calibre-ebook.com. Ensure
ebook-converter.exeis in PATH.
Choose one of the following options:
- Install Ollama: Follow instructions at ollama.ai.
- Pull a Model: A smaller, faster model is often sufficient for metadata tasks.
ollama pull llama3.2:3b # Fast and efficient for metadata extraction # or other options: ollama pull qwen2:1.5b # Very fast ollama pull phi3:mini # Microsoft's efficient model
- Ensure Ollama Server is Running: Usually runs as a background service after installation. If not, start manually with
ollama serve.
Download and use GGUF models directly from HuggingFace:
python BiblioForge.py --sort --llm-provider=llama_cpp \
--llamacpp-repo-id="microsoft/Phi-3-mini-4k-instruct-gguf" \
--llamacpp-gguf-filename="Phi-3-mini-4k-instruct-q4.gguf" \
documents/For LM Studio, text-generation-webui, or similar local servers:
python BiblioForge.py --sort --llm-provider=local_openai \
--local-openai-base-url="http://localhost:1234/v1" \
--llm-model="local-model" \
documents/Set the appropriate environment variable for your chosen provider:
# For OpenAI (e.g., GPT-4, GPT-3.5-turbo)
export OPENAI_API_KEY="your-openai-api-key"
# For Groq (fast LLM inference)
export GROQ_API_KEY="your-groq-api-key"
# For Cohere
export COHERE_API_KEY="your-cohere-api-key"
# For GLHF.chat (HuggingFace models via OpenAI-compatible API)
export GLHF_API_KEY="your-glhf-api-key"
# For HuggingFace Inference API
export HF_API_KEY="your-huggingface-api-key"
# For Poe.com API
export POE_API_KEY="your-poe-api-key"On Windows, use set or setx in Command Prompt, or $Env:VAR_NAME = "value" in PowerShell.
python BiblioForge.py [options] file_or_pattern1 [file_or_pattern2 ...]# Extract text from a single PDF to the current directory
python BiblioForge.py document.pdf
# Extract from all PDFs in current directory to an 'output' subdirectory
python BiblioForge.py -o output/ *.pdf
# Specify a preferred PDF extraction method
python BiblioForge.py --method=pymupdf document.pdf# Process all supported files recursively in 'my_library'
python BiblioForge.py -r my_library/
# Process only PDF and EPUB files recursively
python BiblioForge.py --file-types="pdf,epub" -r my_library/
# Process PowerPoint files
python BiblioForge.py --file-types="pptx" presentations/# Force OCR using Tesseract on a scanned PDF
python BiblioForge.py --force-ocr --ocr-method=tesseract scanned_document.pdf
# Use PaddleOCR for better multilingual support
python BiblioForge.py --force-ocr --ocr-method=paddleocr multilingual_doc.pdf# Analyze PDFs with Ollama, generate rename script to sort them
python BiblioForge.py --sort --noskip --output-dir ./organized_docs/ *.pdf
# Use a specific cloud provider and execute renames immediately
python BiblioForge.py --sort --llm-provider=groq --execute-rename \
--output-dir ./groq_sorted/ documents/*.epub
# Use local LlamaCPP with custom model
python BiblioForge.py --sort --llm-provider=llama_cpp \
--llamacpp-repo-id="TheBloke/phi-2-GGUF" \
--llamacpp-gguf-filename="phi-2.Q4_K_M.gguf" \
--temperature=0.2 documents/
# Use local OpenAI-compatible server (LM Studio, etc.)
python BiblioForge.py --sort --llm-provider=local_openai \
--local-openai-base-url="http://localhost:1234/v1" \
academic_papers/Important Notes:
--noskipis recommended with--sortto ensure all files are processed for metadata, even if .txt files exist.--output-dirwith--sortspecifies where the newAuthor/Year Title.extstructure will be created.
# Extract tables from financial reports
python BiblioForge.py --tables financial_report.pdf
# Save results including tables to JSON
python BiblioForge.py --tables --json=results.json quarterly_reports/*.pdf# See detailed logs for troubleshooting
python BiblioForge.py --debug document.pdf
# Debug with specific extraction method
python BiblioForge.py --debug --method=pdfminer --ocr-method=tesseract problematic.pdf| Argument | Short | Description | Default |
|---|---|---|---|
files |
Input files or patterns to process (e.g., *.pdf, "docs/*.epub"). |
(None - scans current directory) | |
--output-dir |
-o |
Base directory for extracted .txt files and sorted file structure. | . (current directory) |
--method |
-m |
Preferred primary extraction method (varies by file type). | (auto-detection) |
--ocr-method |
Preferred OCR method: auto, tesseract, paddleocr, doctr, easyocr, kraken, kraken_cli. |
auto |
|
--force-ocr |
Force OCR for all pages, even if a text layer is detected. | (False) | |
--recursive |
-r |
Process files recursively in subdirectories. | (False) |
--password |
-p |
Password for encrypted documents. | (None) |
--tables |
-t |
Attempt to extract tables (primarily for PDF files using Camelot). | (False) |
--json |
-j |
Save detailed processing results to a JSON file. Specify the file path. | (None) |
--workers |
-w |
Maximum number of worker threads for parallel processing. | (auto-detected) |
--noskip |
Re-process files even if .txt output already exists. Essential for --sort. |
(False) | |
--file-types |
Comma-separated list of file extensions (e.g., pdf,epub,pptx). |
(All supported) |
| Argument | Description | Default |
|---|---|---|
--sort |
Enable LLM-based metadata extraction and sorting. | (False) |
--rename-script |
Filename for the generated rename script (relative to --output-dir). |
rename_commands.sh |
--execute-rename |
Automatically execute the generated rename script after processing. | (False) |
--llm-provider |
LLM provider: ollama, groq, cohere, openai, glhf, huggingface, poe, local_openai, llama_cpp. |
ollama |
--llm-model |
Specific model name for the chosen provider. | (Provider default) |
--api-key |
API key for cloud-based providers (if not set as environment variable). | (None) |
--temperature |
LLM temperature for metadata extraction (0.0-2.0). | 0.7 |
--max-tokens |
LLM max tokens for metadata extraction. | 300 |
| Argument | Description | Default |
|---|---|---|
--ollama-host |
Host for Ollama server (e.g., http://localhost:11434). |
(Library default) |
--local-openai-base-url |
Base URL for local OpenAI-compatible servers. | http://localhost:1234/v1/ |
--llamacpp-repo-id |
HuggingFace Repo ID for LlamaCPP GGUF model. | TheBloke/phi-2-GGUF |
--llamacpp-gguf-filename |
Specific GGUF filename from the HF repo. | phi-2.Q4_K_M.gguf |
--llamacpp-n-ctx |
Context size for LlamaCPP. | 2048 |
--llamacpp-n-gpu-layers |
Number of layers to offload to GPU (-1 for all, 0 for CPU). | 0 |
--llamacpp-chat-format |
Chat format for LlamaCPP (e.g., llama-2, chatml). |
llama-2 |
| Argument | Short | Description | Default |
|---|---|---|---|
--verbose |
-v |
Increase verbosity: -v for INFO, -vv for DEBUG. |
(WARNING level) |
--debug |
-d |
Enable debug logging (shortcut for -vv). |
(False) |
BiblioForge tries methods in a preferred order, falling back if one fails.
- Text Layer:
pymupdf,pdfplumber,calibre,pypdf,pdfminer - OCR:
tesseract,easyocr,paddleocr,doctr,kraken,kraken_cli - Tables:
camelot(with lattice and stream methods)
- Methods:
ebooklib,bs4(with html2text),epub2txt,calibre,zipfile
- Methods:
djvulibre(Python bindings or djvutxt CLI),pdf_conversion(via ddjvu then PDF extraction),ocr(via ddjvu to images then Tesseract)
- Methods:
mobi,kindleunpack,calibre,zipfile
- Methods:
pptx(python-pptx library),calibre
- Methods:
bs4,html2text,lxml,regex
- Plain Text (TXT/MD):
direct,charset_detection,encoding_detection,calibre - Office Formats (DOCX, DOC, RTF, ODT, FB2, PDB, LIT, LRF, CBZ, CBR, CHM, SNB, TCR):
calibre(primary),charset_detection,encoding_detection,direct
When using --sort, files are organized into the --output-dir as follows:
<output_dir>/
├── <Author Name 1>/
│ ├── <Year> <Title>.<original_ext>
│ └── <Year> <Title>.txt (extracted text)
├── <Author Name 2>/
│ ├── <Year> <Another Title>.<original_ext>
│ └── <Year> <Another Title>.txt
├── rename_commands.sh (or .bat on Windows)
└── unparseables.lst (files with failed metadata extraction)
BiblioForge supports the following file extensions:
Documents: .pdf, .epub, .djvu, .djv, .mobi, .azw, .azw3, .azw4
Text: .txt, .text, .md
Web: .html, .htm, .xhtml
Office: .docx, .doc, .rtf, .odt
Presentations: .pptx, .ppt
Other Ebook: .fb2, .pdb, .lit, .lrf, .cbz, .cbr, .chm, .snb, .tcr
Dependencies: Double-check that all Python and system dependencies are correctly installed and accessible in your system's PATH.
LLM Issues:
- For Ollama, ensure the server is running (
ollama serve) and the model is pulled. - For cloud providers, verify your API key and account status/credits.
- Try a different model or provider if one is consistently failing.
- For LlamaCPP, ensure you have sufficient RAM/VRAM for the model.
OCR Problems:
- Verify Tesseract is installed and language data is available.
- For GPU-based OCR (PaddleOCR, EasyOCR), ensure CUDA is properly configured.
- Large documents may require significant memory - consider processing smaller batches.
File Access:
- Ensure BiblioForge has read/write permissions for input/output directories.
- Check that encrypted PDFs have the correct password provided via
-p.
Use detailed logging for troubleshooting:
python BiblioForge.py --debug --sort your_file.pdf > debug_output.log 2>&1- Use
--workersto control parallelism (fewer workers for memory-constrained systems). - For large document sets, consider processing in smaller batches.
- Local LLMs (Ollama, LlamaCPP) are generally faster than cloud APIs for batch processing.
Contributions, bug reports, and feature requests are welcome! Please open an issue or pull request on the GitHub repository.
This project is licensed under the MIT License - see the LICENSE file for details.
BiblioForge relies on numerous excellent open-source libraries and tools. Credit and thanks to all their developers, including but not limited to:
- PyMuPDF, pdfplumber, pypdf, pdfminer.six for PDF processing
- ebooklib for EPUB handling
- Tesseract, PaddleOCR, EasyOCR, DocTR, Kraken for OCR capabilities
- Camelot for table extraction
- python-pptx for PowerPoint processing
- BeautifulSoup, lxml for HTML/XML parsing
- Ollama, OpenAI, and other LLM providers for metadata extraction
- Calibre project for universal document conversion