Skip to content
This repository was archived by the owner on Aug 9, 2024. It is now read-only.
This repository was archived by the owner on Aug 9, 2024. It is now read-only.

extracting features from xml dump #420

@leojoubert

Description

@leojoubert

Hi,

Thank you for you work on this package. For research purpose, I would like to get features (and eventually reproduce classification) on the entire XML dump of french wiki (20181101 for instance). Of course, this can hardly be done with API queries.

Is there a way to extract feature while parsing XML dump, for instance with mediawiki-utilities :) I can imagine that it can by done by changing this line in the example code :

  extractor = Extractor(mwapi.Session(host="https://en.wikipedia.org",
                                          user_agent="revscoring demo"))

but not being a Python star (more like a R guy !), I'm quite confused. Can you show me just a little example of how to parse for instance 5 first revisions of a little dump file ?

Thank you again for this work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions