Skip to content

[META] Summary of HPLT requested changes #40

@jelmervdl

Description

@jelmervdl
  • Further extend the JSONL output to contain all text and metadata so it forms the complete output, without base64 encoding on the document. We'll need proper JSON escaping to deal with non-unicode data, or guarantee all text coming out of warc2text is always valid unicode. See alternative output format based on JSONlines #34 and Add --jsonl option #35.
  • For each text segment (i.e. line) in the text, also mark the block-level tag it was found in. This should help identify the short <li> and <td> data, although I would not be surprised if we'll see a lot of <div>. Track html tags #46
  • Output crawl timestamp with metadata. Add --jsonl option #35
  • Output byte offset of where the gzip compressed warc record begins.
  • Replace fasttext with fastertext. It's free speed. Except that that repo is currently missing the string_view modification.
  • Add an option to skip langid entirely, and just write a single output. We can then do langid downstream if we decide to. The idea being that any mistake we make with langid in warc2text is irrecoverable: once a document is wrongly classified, the only correction we can do is remove it at the end. We don't have a method of moving the document into the correct stream. We discussed improving the langid inside warc2text, but the argument was that developing good langid in just C++ was harder.
    Right now you could decide to ignore the language attribute in the JSON output, since that doesn't get split into multiple files anyway. I don't think current lang-id is slow enough to add a special bypass option for it.
  • Add an option a la pdf-pass to write the robots.txt responses to a separate warc. Also include 404s etc, so we know which domains were asked but did not give us a robots.txt (which we'll interpret as crawling allowed). Shunt robots.txt responses to separate warc #41
  • Boilerplate detection` like trafilatura might work but is relatively expensive since it needs to build a proper DOM tree, and will be a lot of work to port over to C++. We will try some simpler rule/classification-based document prefix/suffix removal on the text data itself first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions