A lightweight, dependency-minimal bash script to convert scanned PDFs into searchable PDFs using Tesseract OCR.
copyable-pdf takes a PDF input, converts each page to an image, performs OCR (Optical Character Recognition) using Tesseract, and merges them back into a single, searchable PDF document.
- OCR: Make scanned documents searchable and copyable.
- Parallel Processing: Uses multiple cores for faster OCR.
- Dependency Check: Automatically checks for missing tools.
- Customizable: Set language and DPI.
brew tap maxgfr/homebrew-tap
brew install copyable-pdf- Clone the repository:
git clone https://github.com/maxgfr/copyable-pdf.git cd copyable-pdf - Make the script executable:
chmod +x script.sh
- (Optional) Move to your bin directory:
mv script.sh /usr/local/bin/copyable-pdf
Ensure you have the following installed:
- tesseract: For OCR.
- poppler: For
pdftoppmandpdfunite.
On macOS (Homebrew):
brew install tesseract popplerOn Ubuntu/Debian:
sudo apt-get install tesseract-ocr poppler-utilscopyable-pdf [options] input.pdf| Option | Description | Default |
|---|---|---|
-l, --lang <code> |
Language code (e.g., fra, eng) |
eng |
-o, --output <path> |
Custom output file path | input_ocr.pdf |
-d, --dpi <num> |
DPI resolution for OCR | 300 |
-j, --jobs <num> |
Number of parallel jobs | Auto-detect |
-t, --text |
Generate an additional .txt file | false |
-m, --markdown |
Generate an additional .md file | false |
-k, --keep |
Keep temporary files (debug) | false |
-v, --verbose |
Verbose output | false |
-h, --help |
Show help message | - |
Basic usage:
copyable-pdf document.pdfSpecify language (French) and higher DPI:
copyable-pdf -l fra -d 600 document.pdfExplicitly set output filename:
copyable-pdf -o searchable_doc.pdf scan.pdfMIT