Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Jan 20, 2026

This pull request adds several new scripts and utilities to support analysis and visualization of the Flores-200 Restructured dataset, as well as updates to the phoneticization and dataset handling scripts. The main focus is on improved data filtering, language/script grouping, phoneticization automation, and enhanced documentation for reproducibility.

New scripts and utilities:

  • Added filter_files_by_script.py, a script to process files.json and emit a simplified JSON with key fields for script/language analysis, including size normalization.
  • Added plot_hist_by_region.py, a Python script to visualize language file sizes grouped and color-coded by region, inferring regions from script codes.
  • Added graphs.sh and ipa_scripts.sh shell scripts to automate generation of various grouped plots and IPA/text comparison plots, respectively. [1] [2]

Phoneticization and statistics automation:

  • Updated phoneticize.sh to automate IPA conversion for additional languages using espeak2ipa.py, replacing hardcoded commands with a loop over a language array.
  • Added ko_stats.json to store statistics for Korean phoneticization.

Dataset and documentation improvements:

  • Updated get_dataset.sh to expand the default language array and provide a commented-out full language list for easier customization.
  • Added a comprehensive README.md describing the purpose of the folder, dataset licensing, and references for language/script codes.
  • Added *.png to .gitignore to avoid tracking generated plot images.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive analysis and visualization tooling for the Flores-200 Restructured dataset. The changes introduce several Python scripts for data filtering, tokenization analysis, and visualization, along with shell script automation for phoneticization and plot generation.

Changes:

  • Added data filtering and tokenization scripts (filter_files_by_script.py, tokenize_and_annotate_sizes.py) for processing language/script metadata
  • Added visualization scripts for plotting language file sizes, tokenization ratios, and IPA comparisons grouped by region/script/family
  • Automated phoneticization workflow using espeak2ipa.py with support for multiple languages via shell script loops

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
data/template/utils/espeak2ipa.py Refactored to add text output modes and improved flexibility for JP-style output
data/flores200-res/filter_files_by_script.py New script to extract language/script metadata from files.json
data/flores200-res/tokenize_and_annotate_sizes.py New script to tokenize texts and annotate with size metadata
data/flores200-res/plot_*.py New visualization scripts for various grouped analysis plots
data/flores200-res/spm_vocab_freq_dashboard.py New SentencePiece vocabulary frequency dashboard generator
data/flores200-res/phoneticize.sh Updated to automate IPA conversion for multiple languages
data/flores200-res/*.sh New shell automation scripts for graphs, tokenization, and IPA processing
data/flores200-res/README.md New documentation describing dataset purpose and licensing
data/flores200-res/filtered_files.json New data file with filtered language metadata
data/flores200-res/ko_stats.json New Korean phoneticization statistics
data/flores200-res/.gitignore New gitignore to exclude generated PNG files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +2
#!/bin/bash

Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename contains a typo: "tokenization_vs_origina.sh" should be "tokenization_vs_original.sh". This typo makes the purpose of the script unclear and may cause confusion for users.

Suggested change
#!/bin/bash
#!/bin/bash
# NOTE: This script's filename historically contained a typo:
# "tokenization_vs_origina.sh" should be "tokenization_vs_original.sh".
# It is kept under the old name for compatibility, but new usages should
# prefer the correctly spelled name.
script_name="$(basename "$0")"
if [ "$script_name" = "tokenization_vs_origina.sh" ]; then
echo "Warning: script invoked as 'tokenization_vs_origina.sh'; consider renaming or using 'tokenization_vs_original.sh' instead." >&2
fi

Copilot uses AI. Check for mistakes.

## Language Codes

Language Codes here for Flore-200:
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "Flore-200" should be "Flores-200" for consistency with other references in the documentation.

Suggested change
Language Codes here for Flore-200:
Language Codes here for Flores-200:

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +19
# --filtered-json filtered_tiles.json \
# --tok-method tiktoken \
# --skip-missing-tok
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the --filtered-json comment on line 17: "filtered_tiles.json" should be "filtered_files.json" to match the expected filename used elsewhere in the codebase.

Suggested change
# --filtered-json filtered_tiles.json \
# --tok-method tiktoken \
# --skip-missing-tok
# --filtered-json filtered_files.json \
# --tok-method tiktoken \
# --skip-missing-tok

Copilot uses AI. Check for mistakes.
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--json", default="files.json", help="Input files.json")
ap.add_argument("--out", default="filtered_scripts.json", help="Output JSON")
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for --out in filter_files_by_script.py is "filtered_scripts.json", but other scripts in this PR reference "filtered_files.json". This inconsistency could lead to confusion about which file to use. Consider aligning the default output filename with the actual usage pattern shown in other scripts (tokenize.sh, tokenization_vs_origina.sh, plot_ipa_vs_text.py references).

Suggested change
ap.add_argument("--out", default="filtered_scripts.json", help="Output JSON")
ap.add_argument("--out", default="filtered_files.json", help="Output JSON")

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,7 @@
#!/bin/bash
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution. While the script may work when sourced or executed with an explicit shell, adding the shebang is a best practice for portability and clarity.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,21 @@
#!/bin/bash
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,23 @@
# include tokenized comparison (uses tokenized_sizes["tiktoken"] from filtered_scripts.json)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution. While the file has a comment on line 1, the shebang should be the very first line.

Copilot uses AI. Check for mistakes.
import json
from typing import List, Tuple, Optional, Dict, Any
from rich.progress import Progress, BarColumn, TextColumn, TimeRemainingColumn, TimeElapsedColumn, MofNCompleteColumn
from typing import List, Optional, Dict, Any, Tuple
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Tuple' is not used.

Suggested change
from typing import List, Optional, Dict, Any, Tuple
from typing import List, Optional, Dict, Any

Copilot uses AI. Check for mistakes.

import argparse
import json
import os
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.
Comment on lines +448 to +449
except json.JSONDecodeError:
print(f"Error: Invalid JSON format in '{args.input_file}'.")
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This except block handling JSONDecodeError is unreachable; as this except block for the more general ValueError always subsumes it.

Suggested change
except json.JSONDecodeError:
print(f"Error: Invalid JSON format in '{args.input_file}'.")

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant