[META] Summary of HPLT requested changes

- [x] Further extend the JSONL output to contain all text and metadata so it forms the complete output, without base64 encoding on the document. We'll need proper JSON escaping to deal with non-unicode data, or guarantee all text coming out of warc2text is always valid unicode. See #34 and #35.
- [ ] For each text segment (i.e. line) in the text, also mark the block-level tag it was found in. This should help identify the short `<li>` and `<td>` data, although I would not be surprised if we'll see a lot of `<div>`. #46 
- [x] Output crawl timestamp with metadata. #35 
- [x] Output byte offset of where the gzip compressed warc record begins.
- [x] Replace fasttext with fastertext. It's free speed. Except that that repo is currently missing the `string_view` modification.
- [x] Add an option to skip langid entirely, and just write a single output. We can then do langid downstream if we decide to. The idea being that any mistake we make with langid in warc2text is irrecoverable: once a document is wrongly classified, the only correction we can do is remove it at the end. We don't have a method of moving the document into the correct stream. We discussed improving the langid inside warc2text, but the argument was that developing good langid in just C++ was harder.
Right now you could decide to ignore the language attribute in the JSON output, since that doesn't get split into multiple files anyway. I don't think current lang-id is slow enough to add a special bypass option for it.
- [x]  Add an option a la `pdf-pass` to write the robots.txt responses to a separate warc. Also include 404s etc, so we know which domains were asked but did not give us a robots.txt (which we'll interpret as crawling allowed). #41 
- [ ] Boilerplate detection` like [trafilatura](https://trafilatura.readthedocs.io/en/latest/) might work but is relatively expensive since it needs to build a proper DOM tree, and will be a lot of work to port over to C++. We will try some simpler rule/classification-based document prefix/suffix removal on the text data itself first.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[META] Summary of HPLT requested changes #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[META] Summary of HPLT requested changes #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions