Skip to content

Runs on your computer! Quickly reads pdf and extracts text using google tesseract

License

Notifications You must be signed in to change notification settings

KSEGIT/QuickPdfOcr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

98 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

QuickPdfOcr

A simple and intuitive PDF OCR application built with PySide6 (Qt6) and Tesseract OCR.

πŸš€ Quick Start for End Users

Download and run - no installation required!

The pre-built executables are 100% standalone and include:

  • βœ… Python interpreter
  • βœ… All Python packages
  • βœ… Poppler (PDF processing)
  • βœ… Tesseract OCR (text recognition)

No additional software installation needed! Just download and run.

See Installation below for download links.

Features

  • πŸ“„ Drag & Drop Interface - Simply drag PDF files into the window
  • πŸ“ File Browser - Or use the file picker to select PDFs
  • πŸ” OCR Processing - Extract text from scanned PDFs using Tesseract
  • πŸ“Š Progress Feedback - Real-time status updates during processing
  • πŸ“‹ Copy to Clipboard - One-click copy functionality (macOS/Linux/Windows)
  • πŸ”„ Error Recovery - Retry or start over options on failure
  • 🎨 Modern UI - Clean, user-friendly interface with visual feedback
  • πŸ“¦ Fully Standalone - Zero dependencies, zero installation required

Prerequisites

For Pre-built Binaries (Recommended)

Nothing required! The executable includes everything you need - Python, Poppler, and Tesseract OCR are all bundled.

Just download and run! πŸŽ‰

For Running from Source

macOS:

brew install tesseract poppler

Linux (Ubuntu/Debian):

sudo apt-get install tesseract-ocr poppler-utils

Windows:

  • Install Tesseract OCR:
    • Recommended: Using winget: winget install --id UB-Mannheim.TesseractOCR
    • Or download from Tesseract OCR
  • Install Poppler
  • Optional: For WSL users, you can also install via: wsl sudo apt-get install tesseract-ocr poppler-utils

Installation

Option 1: Download Pre-built Binary (Recommended)

100% Standalone - No installation required!

  1. Download the latest release for your platform from Releases

    • Windows: QuickPdfOcr.exe
    • macOS: QuickPdfOcr.app (ARM64 or Intel)
    • Linux: QuickPdfOcr
  2. Run the application! That's it! πŸŽ‰

What's Included:

  • βœ… Python interpreter (no Python installation needed)
  • βœ… All Python packages (PySide6, pytesseract, pdf2image, Pillow, PyPDF2)
  • βœ… Poppler binaries (for PDF processing)
  • βœ… Tesseract OCR with English language data (for text recognition)

Note: The bundled Tesseract includes English language data by default. For other languages, you can still install Tesseract system-wide and the app will use it instead.

Option 2: Run from Source

  1. Clone the repository:
git clone https://github.com/KSEGIT/QuickPdfOcr.git
cd QuickPdfOcr
  1. Install Python dependencies:
pip install -r requirements.txt
  1. Install system dependencies (see above)

Option 3: Build Your Own Binary

  1. Clone and install dependencies (see Option 2)

  2. Build executable:

python build.py
  1. Find your executable in the dist/ folder

Usage

GUI Application

Run the graphical interface:

python main.py

Workflow:

  1. Drag and drop a PDF file or click "Open PDF File"
  2. Click "Start OCR" to begin text extraction
  3. Wait for processing (progress updates shown)
  4. Copy extracted text or start over with a new file

Command Line (Legacy)

You can also use the OCR processor directly from command line:

python components/pdf_ocr.py document.pdf output.txt

Options:

  • --dpi <value> - Set DPI for conversion (default: auto-detect)
  • --lang <code> - Set language for OCR (default: eng)

Examples:

# Auto-detect DPI
python components/pdf_ocr.py document.pdf

# Manual DPI and output file
python components/pdf_ocr.py document.pdf output.txt --dpi 400

# French language
python components/pdf_ocr.py document.pdf --lang fra

Common language codes:

  • eng - English
  • fra - French
  • deu - German
  • spa - Spanish
  • chi_sim - Chinese Simplified
  • jpn - Japanese

Project Structure

QuickPdfOcr/
β”œβ”€β”€ main.py                    # Application entry point
β”œβ”€β”€ components/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ pdf_ocr.py            # OCR processor component
β”‚   └── ocr_worker.py         # Background worker for GUI
β”œβ”€β”€ ui/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── main_window.py        # Main application window
└── requirements.txt          # Python dependencies

Technologies Used

  • PySide6 - Qt6 framework for Python (GUI)
  • Tesseract OCR - Open-source OCR engine
  • pdf2image - PDF to image conversion
  • PyPDF2 - PDF manipulation and analysis
  • Pillow - Image processing

Requirements

System Requirements

  • Tesseract OCR (must be installed on your system)
  • Poppler (bundled with pre-built binaries, or install separately if running from source)

Python Dependencies (for source installation)

See requirements.txt for Python package versions:

  • pytesseract>=0.3.10
  • pdf2image>=1.16.0
  • Pillow>=10.0.0
  • PyPDF2>=3.0.0
  • PySide6>=6.6.0
  • pyinstaller>=6.0.0 (for building binaries)

License

This project is open source and available under the MIT License.

See the LICENSE file for details.

For third-party component licenses (including Poppler), see THIRD_PARTY_LICENSES.md.

Building & Releases

Local Build

Build for your current platform:

pip install -r requirements.txt
python build.py

The executable will be in the dist/ folder.

Note: If you want to bundle Poppler with your local build, you need to:

  1. Install Poppler on your system (see Prerequisites above)
  2. The build script will automatically detect and bundle it

Alternatively, you can manually create a poppler_binaries directory in the project root and place the Poppler binaries there before building.

Automated Builds (GitHub Actions)

The project includes GitHub Actions workflow that automatically builds executables for all platforms when you:

  1. Push a tag starting with v (e.g., v1.0.0)
  2. Manually trigger the workflow

The workflow automatically downloads and bundles Poppler for each platform.

To create a release:

git tag v1.0.0
git push origin v1.0.0

Artifacts will be available in the GitHub release.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

Issue: "Tesseract not found"

  • Make sure Tesseract is installed and in your system PATH
  • macOS: brew install tesseract
  • Linux: sudo apt-get install tesseract-ocr
  • Windows: winget install --id UB-Mannheim.TesseractOCR or download from here

Issue: "Failed to convert PDF to images"

  • If using pre-built binary: This should not occur as Poppler is bundled
  • If running from source: Ensure Poppler is installed
    • macOS: brew install poppler
    • Linux: sudo apt-get install poppler-utils
    • Windows: Install from here

Issue: Poor OCR quality

  • Try increasing DPI (e.g., --dpi 400)
  • Ensure the PDF has good scan quality
  • The system auto-detects optimal DPI based on page size

Author

Created by KSEGIT

About

Runs on your computer! Quickly reads pdf and extracts text using google tesseract

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages