You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am very happily using scribejs to convert a scanned book to digital form which keeps the formatting intact. I have a question concerning the hOCR output. In the web user interface you can choose to outline paragraphs, additionally to words and lines. It identifies the paragraphs really well but unfortunately they are ignored in the hOCR output. There is an info icon in the web interface explaining that paragraphs are only relevant when exporting to .docx or .txt. The last meta tag in the head of the hOCR does mention a ocr_par capability though
Does this mean it is in principle possible to include ocr_par tags in the hOCR output? If it is a lot of work to implement such a thing and there is not really any demand for it, it is not super important to me. For the book I'm converting now, new paragraphs are indented so that I can detect them using the line bounding boxes that are in the hOCR files. Or use the newlines in the .txt files which indicate new paragraphs.
But perhaps it is very easy to add?
Either way, thanks for developing this tool :).
p.s. I wasn't sure if I should post it here or in the webinterface repo, to me it seems a algorithm question, not an interface question.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello all,
I am very happily using scribejs to convert a scanned book to digital form which keeps the formatting intact. I have a question concerning the hOCR output. In the web user interface you can choose to outline paragraphs, additionally to words and lines. It identifies the paragraphs really well but unfortunately they are ignored in the hOCR output. There is an info icon in the web interface explaining that paragraphs are only relevant when exporting to .docx or .txt. The last meta tag in the head of the hOCR does mention a
ocr_parcapability thoughDoes this mean it is in principle possible to include
ocr_partags in the hOCR output? If it is a lot of work to implement such a thing and there is not really any demand for it, it is not super important to me. For the book I'm converting now, new paragraphs are indented so that I can detect them using the line bounding boxes that are in the hOCR files. Or use the newlines in the .txt files which indicate new paragraphs.But perhaps it is very easy to add?
Either way, thanks for developing this tool :).
p.s. I wasn't sure if I should post it here or in the webinterface repo, to me it seems a algorithm question, not an interface question.
Beta Was this translation helpful? Give feedback.
All reactions