Skip to content

PaschalisAg/WikiTextGraph

Repository files navigation

WikiTextGraph: A Python Tool for Parsing Multilingual Wikipedia Text and Graph Extraction

Zenodo: DOI
Paper: DOI

WikiTextGraph is a Python package for parsing Wikipedia dumps, cleaning article texts, and generating graph representations of Wikipedia's link structure.

WikiTextGraph currently supports 11 languages but is constantly updated. The language-specific settings are stored in LANG_SETTINGS.yml.

If you want to add a new language and enter the Pantheon of Acknowledgments below, go to Adding a new language or email us.

Supported Languages

Flag Language Code
🇬🇧 English en
🇪🇸 Spanish es
🇬🇷 Greek el
🇵🇱 Polish pl
🇩🇪 German de
Basque eu
🇳🇱 Dutch nl
🇮🇳 Hindi hi
🇮🇹 Italian it
🇻🇳 Vietnamese vi
🇺🇦 Ukrainian uk

Acknowledgements

A huge thank-you to all the native (L1) speakers who generously shared their time and expertise to help evaluate the results for each language during the development of WikiTextGraph.

In the same order as the languages appear in the app’s interface, here are the amazing people who helped us out:

  • Spanish (es) & **Basque (eu)**󠁥 : Amaia Elizaran Mendarte and Ane Escobar Fernández
  • Polish (pl): Adam Olejniczak and Zuzanna Lawera
  • Italian (it): Valerio Di Lisio
  • Hindi (hi): Anish Rao
  • German (de): Balthasar Braunewell
  • Vietnamese (vi): Phuong Thu Le
  • Ukrainian (uk) : Kateryna Domina

We couldn’t have done it without you — thank you all! ❤️

Installation

Prerequisites

  • Python 3.9 or higher

Installation Steps

  1. Clone the repository:

    git clone https://github.com/yourusername/WikiTextGraph.git
    cd WikiTextGraph
  2. Install the package and its dependencies:

    pip install -r requirements.txt

    This will install all the required dependencies listed in requirements.txt.

Usage

For Non-Technical Users (GUI)

  1. Launch the application:

    python wikitextgraph.py
  2. Follow the steps in the GUI:

    • Step 1: Select the compressed XML dump file (*.bz2)
    • Step 2: Select a base directory for output files
    • Step 3: Select your language from the dropdown
    • Step 4: Choose whether to generate a graph
    • Click "Confirm Selection" to start processing

GUI

For Technical Users (Command Line)

Use the command-line interface for automation or batch processing:

python main.py --dump_filepath /path/to/dump.bz2 --language_code EN --base_dir /path/to/output --generate_graph

Options:

  • --dump_filepath: Path to the compressed Wikipedia XML dump file
  • --language_code: Language code
  • --base_dir: Base directory for output files (defaults to current directory)
  • --generate_graph: Flag to generate the graph (optional)

You can also use the installed command-line tool:

python wikitextgraph.py \
  --dump_filepath /path/to/dump.bz2 \
  --language_code en \
  --base_dir /path/to/output \
  --generate_graph

Output

The tool creates the following directory structure:

base_dir/
└── en/
    ├── output/
    │   └── {language_code}_WP_titles_texts.parquet
    └── graph/
        ├── redirects_rev_mapping.pkl.gzip
        ├── {language_code}_id_node_mapping.parquet
        └── {language_code}_wiki_graph.parquet
  • {language_code}_WP_titles_texts.parquet: Contains the titles and cleaned text of each Wikipedia article.
  • redirects_rev_mapping.pkl.gzip: Mappings for redirect resolution.
  • {language_code}_id_node_mapping.parquet: Contains the id and its corresponding string value for easier access.
  • {language_code}_wiki_graph_.parquet: The final graph representation with Source/Target pairs.

Adding a new language

To add a new language:

  1. Edit LANG_SETTINGS.yml and add a new entry with:

    • section_patt: Regular expression for identifying non-content sections
    • filter_out_patterns: Patterns for non-content pages to filter out
    • redirect_keywords: Keywords indicating redirect pages
  2. Update the language choices in the code (in main.py and gui.py)

  3. Pull a request if you want to contribute to this project. We will evaluate your changes and if they align with the objective, we will accept and merge them to the algorithm.

License

WikiTextGraph is licensed under the Apache License 2.0.

Under this license, you are free to:

  • Use the software for any purpose.
  • Modify and distribute the software.
  • Integrate it into your own projects.
  • Commercialize derived works.

However, the following conditions apply:

  • Attribution: You must provide appropriate credit to the original authors, include the license notice, and indicate if changes were made.
  • No Warranty: The software is provided "as is," without any express or implied warranties.
  • Patent Grant: If you contribute to the project, you grant a license to use any of your patents related to the contributed code.

For full details, see the Apache License 2.0.

Contributing

Contributions via pull requests, issue reports, and feature suggestions are highly encouraged. Please adhere to established coding guidelines and conventions. If you find a bug or have a request, open a GitHub issue with a clear description.

Citation

If you use WikiTextGraph in your research, we kindly request that you cite this repository:

@misc{WikiTextGraph,
  author = {Paschalis Agapitos, Juan Luis Suárez, Gustavo Ariel Schwartz},
  title = {WikiTextGraph: A Multi-Language Wikipedia Graph Parser},
  year = {2025},
  howpublished = {\url{https://github.com/PaschalisAg/WikiTextGraph}},
}

Contact

For questions, suggestions, or collaborations, feel free to open an issue or reach out via email at pasxalisag9@gmail.com.