WikiTextGraph is a Python package for parsing Wikipedia dumps, cleaning article texts, and generating graph representations of Wikipedia's link structure.
WikiTextGraph currently supports 11 languages but is constantly updated. The language-specific settings are stored in LANG_SETTINGS.yml.
If you want to add a new language and enter the Pantheon of Acknowledgments below, go to Adding a new language or email us.
| Flag | Language | Code |
|---|---|---|
| 🇬🇧 | English | en |
| 🇪🇸 | Spanish | es |
| 🇬🇷 | Greek | el |
| 🇵🇱 | Polish | pl |
| 🇩🇪 | German | de |
| Basque | eu | |
| 🇳🇱 | Dutch | nl |
| 🇮🇳 | Hindi | hi |
| 🇮🇹 | Italian | it |
| 🇻🇳 | Vietnamese | vi |
| 🇺🇦 | Ukrainian | uk |
A huge thank-you to all the native (L1) speakers who generously shared their time and expertise to help evaluate the results for each language during the development of WikiTextGraph.
In the same order as the languages appear in the app’s interface, here are the amazing people who helped us out:
- Spanish (es) & **Basque (eu)** : Amaia Elizaran Mendarte and Ane Escobar Fernández
- Polish (pl): Adam Olejniczak and Zuzanna Lawera
- Italian (it): Valerio Di Lisio
- Hindi (hi): Anish Rao
- German (de): Balthasar Braunewell
- Vietnamese (vi): Phuong Thu Le
- Ukrainian (uk) : Kateryna Domina
We couldn’t have done it without you — thank you all! ❤️
- Python 3.9 or higher
-
Clone the repository:
git clone https://github.com/yourusername/WikiTextGraph.git cd WikiTextGraph -
Install the package and its dependencies:
pip install -r requirements.txt
This will install all the required dependencies listed in
requirements.txt.
-
Launch the application:
python wikitextgraph.py
-
Follow the steps in the GUI:
- Step 1: Select the compressed XML dump file (*.bz2)
- Step 2: Select a base directory for output files
- Step 3: Select your language from the dropdown
- Step 4: Choose whether to generate a graph
- Click "Confirm Selection" to start processing
Use the command-line interface for automation or batch processing:
python main.py --dump_filepath /path/to/dump.bz2 --language_code EN --base_dir /path/to/output --generate_graphOptions:
--dump_filepath: Path to the compressed Wikipedia XML dump file--language_code: Language code--base_dir: Base directory for output files (defaults to current directory)--generate_graph: Flag to generate the graph (optional)
You can also use the installed command-line tool:
python wikitextgraph.py \
--dump_filepath /path/to/dump.bz2 \
--language_code en \
--base_dir /path/to/output \
--generate_graphThe tool creates the following directory structure:
base_dir/
└── en/
├── output/
│ └── {language_code}_WP_titles_texts.parquet
└── graph/
├── redirects_rev_mapping.pkl.gzip
├── {language_code}_id_node_mapping.parquet
└── {language_code}_wiki_graph.parquet
{language_code}_WP_titles_texts.parquet: Contains the titles and cleaned text of each Wikipedia article.redirects_rev_mapping.pkl.gzip: Mappings for redirect resolution.{language_code}_id_node_mapping.parquet: Contains the id and its corresponding string value for easier access.{language_code}_wiki_graph_.parquet: The final graph representation with Source/Target pairs.
To add a new language:
-
Edit
LANG_SETTINGS.ymland add a new entry with:section_patt: Regular expression for identifying non-content sectionsfilter_out_patterns: Patterns for non-content pages to filter outredirect_keywords: Keywords indicating redirect pages
-
Update the language choices in the code (in
main.pyandgui.py) -
Pull a request if you want to contribute to this project. We will evaluate your changes and if they align with the objective, we will accept and merge them to the algorithm.
WikiTextGraph is licensed under the Apache License 2.0.
Under this license, you are free to:
- Use the software for any purpose.
- Modify and distribute the software.
- Integrate it into your own projects.
- Commercialize derived works.
However, the following conditions apply:
- Attribution: You must provide appropriate credit to the original authors, include the license notice, and indicate if changes were made.
- No Warranty: The software is provided "as is," without any express or implied warranties.
- Patent Grant: If you contribute to the project, you grant a license to use any of your patents related to the contributed code.
For full details, see the Apache License 2.0.
Contributions via pull requests, issue reports, and feature suggestions are highly encouraged. Please adhere to established coding guidelines and conventions. If you find a bug or have a request, open a GitHub issue with a clear description.
If you use WikiTextGraph in your research, we kindly request that you cite this repository:
@misc{WikiTextGraph,
author = {Paschalis Agapitos, Juan Luis Suárez, Gustavo Ariel Schwartz},
title = {WikiTextGraph: A Multi-Language Wikipedia Graph Parser},
year = {2025},
howpublished = {\url{https://github.com/PaschalisAg/WikiTextGraph}},
}For questions, suggestions, or collaborations, feel free to open an issue or reach out via email at pasxalisag9@gmail.com.
