-
Notifications
You must be signed in to change notification settings - Fork 28
Add flores analysis #727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add flores analysis #727
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds comprehensive analysis and visualization tooling for the Flores-200 Restructured dataset. The changes introduce several Python scripts for data filtering, tokenization analysis, and visualization, along with shell script automation for phoneticization and plot generation.
Changes:
- Added data filtering and tokenization scripts (filter_files_by_script.py, tokenize_and_annotate_sizes.py) for processing language/script metadata
- Added visualization scripts for plotting language file sizes, tokenization ratios, and IPA comparisons grouped by region/script/family
- Automated phoneticization workflow using espeak2ipa.py with support for multiple languages via shell script loops
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| data/template/utils/espeak2ipa.py | Refactored to add text output modes and improved flexibility for JP-style output |
| data/flores200-res/filter_files_by_script.py | New script to extract language/script metadata from files.json |
| data/flores200-res/tokenize_and_annotate_sizes.py | New script to tokenize texts and annotate with size metadata |
| data/flores200-res/plot_*.py | New visualization scripts for various grouped analysis plots |
| data/flores200-res/spm_vocab_freq_dashboard.py | New SentencePiece vocabulary frequency dashboard generator |
| data/flores200-res/phoneticize.sh | Updated to automate IPA conversion for multiple languages |
| data/flores200-res/*.sh | New shell automation scripts for graphs, tokenization, and IPA processing |
| data/flores200-res/README.md | New documentation describing dataset purpose and licensing |
| data/flores200-res/filtered_files.json | New data file with filtered language metadata |
| data/flores200-res/ko_stats.json | New Korean phoneticization statistics |
| data/flores200-res/.gitignore | New gitignore to exclude generated PNG files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #!/bin/bash | ||
|
|
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filename contains a typo: "tokenization_vs_origina.sh" should be "tokenization_vs_original.sh". This typo makes the purpose of the script unclear and may cause confusion for users.
| #!/bin/bash | |
| #!/bin/bash | |
| # NOTE: This script's filename historically contained a typo: | |
| # "tokenization_vs_origina.sh" should be "tokenization_vs_original.sh". | |
| # It is kept under the old name for compatibility, but new usages should | |
| # prefer the correctly spelled name. | |
| script_name="$(basename "$0")" | |
| if [ "$script_name" = "tokenization_vs_origina.sh" ]; then | |
| echo "Warning: script invoked as 'tokenization_vs_origina.sh'; consider renaming or using 'tokenization_vs_original.sh' instead." >&2 | |
| fi |
|
|
||
| ## Language Codes | ||
|
|
||
| Language Codes here for Flore-200: |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: "Flore-200" should be "Flores-200" for consistency with other references in the documentation.
| Language Codes here for Flore-200: | |
| Language Codes here for Flores-200: |
| # --filtered-json filtered_tiles.json \ | ||
| # --tok-method tiktoken \ | ||
| # --skip-missing-tok |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the --filtered-json comment on line 17: "filtered_tiles.json" should be "filtered_files.json" to match the expected filename used elsewhere in the codebase.
| # --filtered-json filtered_tiles.json \ | |
| # --tok-method tiktoken \ | |
| # --skip-missing-tok | |
| # --filtered-json filtered_files.json \ | |
| # --tok-method tiktoken \ | |
| # --skip-missing-tok |
| def main(): | ||
| ap = argparse.ArgumentParser() | ||
| ap.add_argument("--json", default="files.json", help="Input files.json") | ||
| ap.add_argument("--out", default="filtered_scripts.json", help="Output JSON") |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value for --out in filter_files_by_script.py is "filtered_scripts.json", but other scripts in this PR reference "filtered_files.json". This inconsistency could lead to confusion about which file to use. Consider aligning the default output filename with the actual usage pattern shown in other scripts (tokenize.sh, tokenization_vs_origina.sh, plot_ipa_vs_text.py references).
| ap.add_argument("--out", default="filtered_scripts.json", help="Output JSON") | |
| ap.add_argument("--out", default="filtered_files.json", help="Output JSON") |
| @@ -0,0 +1,7 @@ | |||
| #!/bin/bash | |||
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution. While the script may work when sourced or executed with an explicit shell, adding the shebang is a best practice for portability and clarity.
| @@ -0,0 +1,21 @@ | |||
| #!/bin/bash | |||
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution.
| @@ -0,0 +1,23 @@ | |||
| # include tokenized comparison (uses tokenized_sizes["tiktoken"] from filtered_scripts.json) | |||
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution. While the file has a comment on line 1, the shebang should be the very first line.
| import json | ||
| from typing import List, Tuple, Optional, Dict, Any | ||
| from rich.progress import Progress, BarColumn, TextColumn, TimeRemainingColumn, TimeElapsedColumn, MofNCompleteColumn | ||
| from typing import List, Optional, Dict, Any, Tuple |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Tuple' is not used.
| from typing import List, Optional, Dict, Any, Tuple | |
| from typing import List, Optional, Dict, Any |
|
|
||
| import argparse | ||
| import json | ||
| import os |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
| except json.JSONDecodeError: | ||
| print(f"Error: Invalid JSON format in '{args.input_file}'.") |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This except block handling JSONDecodeError is unreachable; as this except block for the more general ValueError always subsumes it.
| except json.JSONDecodeError: | |
| print(f"Error: Invalid JSON format in '{args.input_file}'.") |
This pull request adds several new scripts and utilities to support analysis and visualization of the Flores-200 Restructured dataset, as well as updates to the phoneticization and dataset handling scripts. The main focus is on improved data filtering, language/script grouping, phoneticization automation, and enhanced documentation for reproducibility.
New scripts and utilities:
filter_files_by_script.py, a script to processfiles.jsonand emit a simplified JSON with key fields for script/language analysis, including size normalization.plot_hist_by_region.py, a Python script to visualize language file sizes grouped and color-coded by region, inferring regions from script codes.graphs.shandipa_scripts.shshell scripts to automate generation of various grouped plots and IPA/text comparison plots, respectively. [1] [2]Phoneticization and statistics automation:
phoneticize.shto automate IPA conversion for additional languages usingespeak2ipa.py, replacing hardcoded commands with a loop over a language array.ko_stats.jsonto store statistics for Korean phoneticization.Dataset and documentation improvements:
get_dataset.shto expand the default language array and provide a commented-out full language list for easier customization.README.mddescribing the purpose of the folder, dataset licensing, and references for language/script codes.*.pngto.gitignoreto avoid tracking generated plot images.