-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Thank you for sharing this corpus.
Creating GT is not an easy job. I took a random look at the page files from the pageXmlTranskribusCorrected folders.
I noticed the following problems:
- the entire text of a line was encoded at the Word level, as a single Word.
Solution: Convert Word ind line - often the drop-capital are annotated as Graphic
- many separators can be seen as so called fake separators and should be corrected
- a wish, Transkribus does not create valid page instances, of course such annotations as:
<TranskribusMetadata docId="188203" .../> can be commented out.
but:
- open type="" attributes
- open id="" Attributes should be corrected to.
-
- the Alto format files contain very deeply structured data, unfortunately when converting to Page-XML format this information was not included.
I will be very welcome to help you to improve the data within my possibilities.
Thanks again for everything
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels