paragraph tags in hOCR output #65

gitAlwin · 2026-01-30T17:10:13Z

gitAlwin
Jan 30, 2026

Hello all,

I am very happily using scribejs to convert a scanned book to digital form which keeps the formatting intact. I have a question concerning the hOCR output. In the web user interface you can choose to outline paragraphs, additionally to words and lines. It identifies the paragraphs really well but unfortunately they are ignored in the hOCR output. There is an info icon in the web interface explaining that paragraphs are only relevant when exporting to .docx or .txt. The last meta tag in the head of the hOCR does mention a ocr_par capability though

<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf ocrp_lang ocrp_dir ocrp_font ocrp_fsize'/>

Does this mean it is in principle possible to include ocr_par tags in the hOCR output? If it is a lot of work to implement such a thing and there is not really any demand for it, it is not super important to me. For the book I'm converting now, new paragraphs are indented so that I can detect them using the line bounding boxes that are in the hOCR files. Or use the newlines in the .txt files which indicate new paragraphs.

But perhaps it is very easy to add?

Either way, thanks for developing this tool :).

p.s. I wasn't sure if I should post it here or in the webinterface repo, to me it seems a algorithm question, not an interface question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paragraph tags in hOCR output #65

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

paragraph tags in hOCR output #65

Uh oh!

gitAlwin Jan 30, 2026

Replies: 0 comments

gitAlwin
Jan 30, 2026