Skip to content

Bugs and improvements #7

@tboenig

Description

@tboenig

Thank you for sharing this corpus.
Creating GT is not an easy job. I took a random look at the page files from the pageXmlTranskribusCorrected folders.

I noticed the following problems:

  1. the entire text of a line was encoded at the Word level, as a single Word.
    Solution: Convert Word ind line
  2. often the drop-capital are annotated as Graphic
  3. many separators can be seen as so called fake separators and should be corrected
  4. a wish, Transkribus does not create valid page instances, of course such annotations as:
    <TranskribusMetadata docId="188203" .../> can be commented out.
    but:
  • open type="" attributes
  • open id="" Attributes should be corrected to.
    • the Alto format files contain very deeply structured data, unfortunately when converting to Page-XML format this information was not included.

I will be very welcome to help you to improve the data within my possibilities.
Thanks again for everything

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions