Add flores analysis #727

klei22 · 2026-01-20T17:23:31Z

This pull request adds several new scripts and utilities to support analysis and visualization of the Flores-200 Restructured dataset, as well as updates to the phoneticization and dataset handling scripts. The main focus is on improved data filtering, language/script grouping, phoneticization automation, and enhanced documentation for reproducibility.

New scripts and utilities:

Added filter_files_by_script.py, a script to process files.json and emit a simplified JSON with key fields for script/language analysis, including size normalization.
Added plot_hist_by_region.py, a Python script to visualize language file sizes grouped and color-coded by region, inferring regions from script codes.
Added graphs.sh and ipa_scripts.sh shell scripts to automate generation of various grouped plots and IPA/text comparison plots, respectively. [1] [2]

Phoneticization and statistics automation:

Updated phoneticize.sh to automate IPA conversion for additional languages using espeak2ipa.py, replacing hardcoded commands with a loop over a language array.
Added ko_stats.json to store statistics for Korean phoneticization.

Dataset and documentation improvements:

Updated get_dataset.sh to expand the default language array and provide a commented-out full language list for easier customization.
Added a comprehensive README.md describing the purpose of the folder, dataset licensing, and references for language/script codes.
Added *.png to .gitignore to avoid tracking generated plot images.

Copilot

Pull request overview

This pull request adds comprehensive analysis and visualization tooling for the Flores-200 Restructured dataset. The changes introduce several Python scripts for data filtering, tokenization analysis, and visualization, along with shell script automation for phoneticization and plot generation.

Changes:

Added data filtering and tokenization scripts (filter_files_by_script.py, tokenize_and_annotate_sizes.py) for processing language/script metadata
Added visualization scripts for plotting language file sizes, tokenization ratios, and IPA comparisons grouped by region/script/family
Automated phoneticization workflow using espeak2ipa.py with support for multiple languages via shell script loops

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
data/template/utils/espeak2ipa.py	Refactored to add text output modes and improved flexibility for JP-style output
data/flores200-res/filter_files_by_script.py	New script to extract language/script metadata from files.json
data/flores200-res/tokenize_and_annotate_sizes.py	New script to tokenize texts and annotate with size metadata
data/flores200-res/plot_*.py	New visualization scripts for various grouped analysis plots
data/flores200-res/spm_vocab_freq_dashboard.py	New SentencePiece vocabulary frequency dashboard generator
data/flores200-res/phoneticize.sh	Updated to automate IPA conversion for multiple languages
data/flores200-res/*.sh	New shell automation scripts for graphs, tokenization, and IPA processing
data/flores200-res/README.md	New documentation describing dataset purpose and licensing
data/flores200-res/filtered_files.json	New data file with filtered language metadata
data/flores200-res/ko_stats.json	New Korean phoneticization statistics
data/flores200-res/.gitignore	New gitignore to exclude generated PNG files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-20T17:27:14Z

data/flores200-res/tokenization_vs_origina.sh

+#!/bin/bash
+


The filename contains a typo: "tokenization_vs_origina.sh" should be "tokenization_vs_original.sh". This typo makes the purpose of the script unclear and may cause confusion for users.

Suggested change

#!/bin/bash

#!/bin/bash

# NOTE: This script's filename historically contained a typo:

# "tokenization_vs_origina.sh" should be "tokenization_vs_original.sh".

# It is kept under the old name for compatibility, but new usages should

# prefer the correctly spelled name.

script_name="$(basename "$0")"

if [ "$script_name" = "tokenization_vs_origina.sh" ]; then

echo "Warning: script invoked as 'tokenization_vs_origina.sh'; consider renaming or using 'tokenization_vs_original.sh' instead." >&2

fi

Copilot · 2026-01-20T17:27:15Z

data/flores200-res/README.md

+
+## Language Codes
+
+Language Codes here for Flore-200:


Typo in comment: "Flore-200" should be "Flores-200" for consistency with other references in the documentation.

Suggested change

Language Codes here for Flore-200:

Language Codes here for Flores-200:

Copilot · 2026-01-20T17:27:15Z

data/flores200-res/ipa_scripts.sh

+#   --filtered-json filtered_tiles.json \
+#   --tok-method tiktoken \
+#   --skip-missing-tok


Typo in the --filtered-json comment on line 17: "filtered_tiles.json" should be "filtered_files.json" to match the expected filename used elsewhere in the codebase.

Suggested change

# --filtered-json filtered_tiles.json \

# --tok-method tiktoken \

# --skip-missing-tok

# --filtered-json filtered_files.json \

# --tok-method tiktoken \

# --skip-missing-tok

Copilot · 2026-01-20T17:27:16Z

data/flores200-res/filter_files_by_script.py

+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--json", default="files.json", help="Input files.json")
+    ap.add_argument("--out", default="filtered_scripts.json", help="Output JSON")


The default value for --out in filter_files_by_script.py is "filtered_scripts.json", but other scripts in this PR reference "filtered_files.json". This inconsistency could lead to confusion about which file to use. Consider aligning the default output filename with the actual usage pattern shown in other scripts (tokenize.sh, tokenization_vs_origina.sh, plot_ipa_vs_text.py references).

Suggested change

ap.add_argument("--out", default="filtered_scripts.json", help="Output JSON")

ap.add_argument("--out", default="filtered_files.json", help="Output JSON")

Copilot · 2026-01-20T17:27:16Z

data/flores200-res/tokenize.sh

@@ -0,0 +1,7 @@
+#!/bin/bash


Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution. While the script may work when sourced or executed with an explicit shell, adding the shebang is a best practice for portability and clarity.

Copilot · 2026-01-20T17:27:16Z

data/flores200-res/tokenization_vs_origina.sh

@@ -0,0 +1,21 @@
+#!/bin/bash


Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution.

Copilot · 2026-01-20T17:27:16Z

data/flores200-res/ipa_scripts.sh

@@ -0,0 +1,23 @@
+# include tokenized comparison (uses tokenized_sizes["tiktoken"] from filtered_scripts.json)


Missing shebang line. Shell scripts should start with a shebang (#!/bin/bash) for proper execution. While the file has a comment on line 1, the shebang should be the very first line.

Copilot · 2026-01-20T17:27:17Z

data/template/utils/espeak2ipa.py

 import json
-from typing import List, Tuple, Optional, Dict, Any
-from rich.progress import Progress, BarColumn, TextColumn, TimeRemainingColumn, TimeElapsedColumn, MofNCompleteColumn
+from typing import List, Optional, Dict, Any, Tuple


Import of 'Tuple' is not used.

Suggested change

from typing import List, Optional, Dict, Any, Tuple

from typing import List, Optional, Dict, Any

Copilot · 2026-01-20T17:27:17Z

data/flores200-res/tokenize_and_annotate_sizes.py

+
+import argparse
+import json
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2026-01-20T17:27:17Z

data/template/utils/espeak2ipa.py

+    except json.JSONDecodeError:
+        print(f"Error: Invalid JSON format in '{args.input_file}'.")


This except block handling JSONDecodeError is unreachable; as this except block for the more general ValueError always subsumes it.

Suggested change

except json.JSONDecodeError:

print(f"Error: Invalid JSON format in '{args.input_file}'.")

klei22 added 9 commits December 28, 2025 13:21

Add README.md stats and script updates

8a5ec69

Add Yue

8e07a23

Add Yue to get dataset

8346e3a

Add graphs for grouping bytes of languages

fe33152

Add scripts for language analysis

8e9993a

Update ipa visualizations

7449141

Add .gitignore

686f2b2

Add updates to latest scripts

375e5b9

Add additional files for plotting

410f02e

klei22 requested review from Copilot and gkielian January 20, 2026 17:23

Copilot started reviewing on behalf of klei22 January 20, 2026 17:24 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flores analysis #727

Add flores analysis #727

Uh oh!

klei22 commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-#!/bin/bash
+#!/bin/bash
+# NOTE: This script's filename historically contained a typo:
+#   "tokenization_vs_origina.sh" should be "tokenization_vs_original.sh".
+# It is kept under the old name for compatibility, but new usages should
+# prefer the correctly spelled name.
+script_name="$(basename "$0")"
+if [ "$script_name" = "tokenization_vs_origina.sh" ]; then
+  echo "Warning: script invoked as 'tokenization_vs_origina.sh'; consider renaming or using 'tokenization_vs_original.sh' instead." >&2
+fi

	Language Codes here for Flore-200:
	Language Codes here for Flores-200:

	ap.add_argument("--out", default="filtered_scripts.json", help="Output JSON")
	ap.add_argument("--out", default="filtered_files.json", help="Output JSON")

		@@ -0,0 +1,23 @@
		# include tokenized comparison (uses tokenized_sizes["tiktoken"] from filtered_scripts.json)

	from typing import List, Optional, Dict, Any, Tuple
	from typing import List, Optional, Dict, Any

		except json.JSONDecodeError:
		print(f"Error: Invalid JSON format in '{args.input_file}'.")

Add flores analysis #727

Are you sure you want to change the base?

Add flores analysis #727

Uh oh!

Conversation

klei22 commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant