This is a very productive scanning and OCR setup, intended to speed up the scanning process and produce a CBZ file and an archive of extracted text as fast as possible. Just follow these steps:
- install the required packages
- plug in your scanner
- edit
config.shaccording to your needs (see Configuration) - run
./1-scan.sh - do any necessary renaming and extra scanning (see Naming convention)
- run
./2-ocr.sh - run
./3-bundle.sh
This setup was inspired by How to scan and OCR like a pro with open source tools. The article also explains a few things not included in these scripts, like how to remove page numbers and unnecessary line feeds. Add these parts in if you need to.
In Debian:
sudo apt install sane sane-utils imagemagick unpaper tesseract-ocr
Also, install the Tesseract language package(s) you need. Select from:
apt search tesseract-ocr-- https://packages.debian.org/search?keywords=tesseract-ocr-
If you have an old version of Debian, install the newer Tesseract language package(s) from backports. Example (for Debian 9): add deb https://deb.debian.org/debian stretch-backports main to /etc/apt/sources.list, then run:
sudo apt -t stretch-backports install tesseract-ocr-eng
Before using the scripts, you must edit config.sh according to your needs. You need to change at least the following options:
device: runscanimage -Lto find the device id. Ex:device='genesys:libusb:001:004'.widthandheight: measure the pages' width and height in millimeters. Images will be cropped to this size automatically.first_pageandlast_page.first_pagecan be a negative number, if needed (see below).
Other important options are:
language: the language setting for OCR must correspond to the document's language. Ex:'eng'for English,'ron'for Romanian.rotate: angle for clockwise auto-rotation of every page. Possible values are 0, 90, 180 and 270.resolutionin DPI, defaults to 300.
For clarity, we want file names to match page numbers: 001.pnm for page 1, etc. As for the unnumbered pages (covers, inserts, folds, etc), we must name them in a way that preserves page order. This is especially important when generating CBZ files, in which page order is determined by file names. We have two main situations:
- The cover and first few pages might not be numbered. In this case, set
first_pageto a negative number. Before reaching page 1, files will be named000_1.pnm,000_2.pnm, etc. - In case of other unnumbered pages (inserts, folds, etc), skip them on the first run and scan them separately, using the command
./1-scan.sh filename_without_extension. For ordering to be consistent, name the files as in the following examples:- If there is an insert between pages 45 and 46, use this convention:
045_0.pnm, 045_1.pnm, 045_2.pnm, 046.pnm. So after the first run, rename045.pnmto045_0.pnm, then scan the insert by running./1-scan.sh 045_1and./1-scan.sh 045_2. - If leaf 45/46 is folded and actually contains 4 pages, use this convention:
045_1.pnm, 045_2.pnm, 046_1.pnm, 046_2.pnm. So after the first run, rename045.pnmto045_1.pnmand046.pnmto046_1.pnm, then scan the extra pages in the fold by running./1-scan.sh 045_2and./1-scan.sh 046_2. - If leaf -1/0 (the front cover) is folded and actually contains 4 pages, use this convention:
000_1_1.pnm, 000_1_2.pnm, 000_2_1.pnm, 000_2_2.pnm. So after the first run, rename000_1.pnmto000_1_1.pnmand000_2.pnmto000_2_1.pnm, then scan the extra pages in the fold by running./1-scan.sh 000_1_2and./1-scan.sh 000_2_2.
- If there is an insert between pages 45 and 46, use this convention:
Important: you must do all renaming before running ./2-ocr.sh!