Create build_parallel_corpus.py by tanhaow · Pull Request #22 · Princeton-CDH/muse

tanhaow · 2026-02-02T21:27:25Z

Associated Issue(s): resolves #1

Changes in this PR

Created scripts/build_parallel_corpus.py to build a parallel sentence corpus from Notion export data, which outputs with fields: id, lang, text, en_tr, cite, en_cite, term

Notes

Japanese text normalization: removes inter-character spaces from source data artifacts
The script has fallbacks for some entries lack proper [Language] Letter

Reviewer Checklist

Verify JSONL output is valid
Confirm all required fields are present: id, lang, text, en_tr, cite, en_cite, term

laurejt

The PR fails ruff checks, I'll start my review once these checks pass.

laurejt · 2026-02-05T13:58:53Z

scripts/build_parallel_corpus.py

+LABEL_RE = re.compile(
+    r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I
+)


As we discussed yesterday:

This regex is too rigid because it hard codes the supported languages, it should leverage the "LABEL_TO_CODE" global variable.

laurejt · 2026-02-05T14:00:15Z

scripts/build_parallel_corpus.py

+from pathlib import Path
+
+# Language label to code mapping
+LABEL_TO_CODE = {


This name is not descriptive. This structure is defining the language codes of the supported language. This should be clear from the name. This is not some generic label. It is the ISO 639-1 code.

laurejt · 2026-02-05T14:06:07Z

scripts/build_parallel_corpus.py

+        # Extract quoted text and citation
+        text = None
+        cite = None
+        q = QUOTE_RE.search(rest)
+        if q:
+            text = q.group("quote").strip()
+            after = rest[q.end() :].strip()
+            if after:
+                cite = after.strip()
+        else:
+            # Fallback: some entries use emdash separators (em dash, en dash, hyphen) instead of quotes.
+            # Split on emdash to extract text and citation. This recovers ~200 pairs
+            # that would be missed if we only accepted quoted text.
+            parts = EMDASH_SPLIT_RE.split(rest, maxsplit=1)
+            if parts:
+                text = parts[0].strip()
+                if len(parts) == 2:
+                    cite = parts[1].strip()


Separate this logic into a separate function. Instead of describing this as a fallback. It seems like the real question at hand is what marks may be used for quotation (if any).

Removing these markings is less important than trying to separate the citation. Consider attempting to match on the form the expected citation should have.

laurejt · 2026-02-05T14:08:49Z

scripts/build_parallel_corpus.py

+    args.add_argument("--input", required=True, help="Path to notion_terms.jsonl input")
+    args.add_argument(
+        "--output", required=True, help="Path to write parallel JSONL output"
+    )


The type of these arguments should also be specified via the optional type argument: type=pathlib.Path

laurejt · 2026-02-05T14:10:22Z

scripts/build_parallel_corpus.py

This script should be included within the source code (src/muse).

Also, the name of this script is too generic. This will not be the only program to build parallel corpora. This must specify the nature of the corpora, namely sentence-level.

laurejt · 2026-02-05T14:18:26Z

scripts/build_parallel_corpus.py

+    r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I
+)
+# Match quoted text (handles straight quotes "...", left curly "...", right curly "...")
+QUOTE_RE = re.compile(r'[""](?P<quote>.+?)[""]')


This regular expression is incorrect, it only matches on regular double quotes. This is also not general enough for other languages, there's at least one instance that uses single quotes (although it shouldn't impact our language subset).

Perhaps, testing first if the string begins with a quote and then perhaps using rsplit to attempt to separate the citation. That said it may be easier to make a regular expression to match the possible form of the ending citation.

laurejt · 2026-02-05T14:20:23Z

scripts/build_parallel_corpus.py

+
+# Match labels like "English A:", "Japanese B:"
+LABEL_RE = re.compile(
+    r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I


This regular expression also does not match the specification. There must be a space between the language and the letter label.

Similarly, ignoring casing is also overly permissive.

laurejt · 2026-02-05T14:20:59Z

scripts/build_parallel_corpus.py

+# Match quoted text (handles straight quotes "...", left curly "...", right curly "...")
+QUOTE_RE = re.compile(r'[""](?P<quote>.+?)[""]')
+# Match emdash-like separators (em dash, en dash, hyphen) for text/citation extraction fallback
+EMDASH_SPLIT_RE = re.compile(r"\s+[\u2014\u2013-]\s+")


Why not use rsplit with maxsplit=1?

laurejt · 2026-02-05T14:25:00Z

scripts/build_parallel_corpus.py

+        # Some quotes span multiple paragraphs (e.g., long translations broken up).
+        # Join paragraphs until we find the closing quote.
+        if rest.count('"') % 2 == 1:
+            parts = [rest]
+            j = i + 1
+            while j < len(paragraphs):
+                parts.append(paragraphs[j].strip())
+                if paragraphs[j].count('"') > 0:
+                    break
+                j += 1
+            rest = " ".join(parts)
+            i = j


This appears to be a special case. How often does this happen? This will also only work for passages that use regular quote symbols.

There should also be no chance of combining with the following paragraph if it also starts with some kind of prefix (e.g. "Source:", "English:", etc.)

laurejt · 2026-02-05T14:25:39Z

scripts/build_parallel_corpus.py

+                    "text": src_text.replace("\n", " ").replace("\\n", " ").strip(),
+                    "en_tr": en_text.replace("\n", " ").replace("\\n", " ").strip(),


Newlines should not be modified.

laurejt · 2026-02-05T14:26:51Z

scripts/build_parallel_corpus.py

+                # Source data contains unnormalized spaces between CJK characters (e.g., "日 本 の 音 楽").
+                if obj["lang"] == "ja":
+                    obj["text"] = obj["text"].replace(" ", "")
+                json.dump(obj, outf, ensure_ascii=False)


This file should be UTF-8, not ascii (presumably this shouldn't come up when using orjsonl)

laurejt · 2026-02-05T14:27:30Z

scripts/build_parallel_corpus.py

+                # Source data contains unnormalized spaces between CJK characters (e.g., "日 本 の 音 楽").
+                if obj["lang"] == "ja":
+                    obj["text"] = obj["text"].replace(" ", "")


For now, let's not modify the contents of the texts.

laurejt · 2026-02-05T14:28:27Z

scripts/build_parallel_corpus.py

+    # Fallback: pair by position if letter matching is incomplete.
+    # Some entries have source blocks without matching English letters (or vice versa).
+    # Order-based pairing recovers these by matching the nth source with the nth English block.


Passages must have the same letter label. This logic must be removed.

tanhaow · 2026-02-05T20:48:41Z

Note:

I noted that there's a case where English entries without letter suffixes does match source text
Example - "复音音乐 fùyīn yīnyuè":

Here English: text matches Chinese A:'s text content.

But there are also cases where English: text do not match with none of the letter suffixed source text.
Example - 乐 yuè:

Also, the second example has multiple paragraphs for English:. Do we want to combine all of the paragraphs - or just keep the first one followed by the English: label? @laurejt

laurejt · 2026-02-05T20:55:34Z

@tanhaow: There is no guarantee that an "unlabeled" language is a parallel text. There may be a few instances where this is true, but there are many cases where it is not. Sometimes, unlabeled language entries correspond to definitions.

Decision: Do not attempt to match unlabeled language entries (e.g., starting with "English:", "Chinese:"). During tomorrow's meeting, we can bring up the required assumption. This seems like something the research team should fix within notion, now that we have a way to export this data.

laurejt · 2026-02-05T21:00:22Z

@tanhaow : For the second issue. I think this problem is too ambiguous to handle in this initial pass. We should investigate the scale of this issue, but that can wait until after we can build an initial parallel sentence-corpus.

Decision: Do not attempt to combine multi-paragraph parallel texts.

laurejt

Thank you for your work, this is getting closer. The following changes must be made:

Update pyproject.toml with new dependencies (and therefore update uv.lock). Script currently crashes because of this.
Fix extract_text_and_citation the "fallback" logic so it can capture citations. There are two options here:
1. Update the citation regex so that it can match the full form of the citation that might occur at the end of the string
2. Remove the citation logic for now and instead make something for the dashes case that was in previous versions of this method
Additionally, update extract_text_citation so that:
1. It returns tuple[str, str]. The null cases should be empty strings. This means that the majority of the x if x else None-style statements will be removed.
2. Document the assumption about the expected form of quoted text (i.e., assumes double quotes are the symbol used to mark quotation)
Update pair_blocks so that "Part 2" is removed. This logic goes against the specification we discussed. Language prefixes without letters are not candidate parallel texts.
Update build_sentence_parallel_corpus so its inputs are pathlib.Path types
Simplify parse_labelled_paragraphs by:
1. updating LABEL_RE so that the letter suffix is a named group within the regular expression
2. converting the while loop to a for loop
Update main so that the command line arguments are positional

laurejt · 2026-02-05T21:50:27Z

src/muse/parallel_corpus/build_sentence.py

+# Match labels like "English A:", "Japanese B:", "Chinese:", "English:"
+# Labels must have a colon, may or may not have letter suffixes [A-D]
+LABEL_RE = re.compile(
+    rf"^(?P<label>{LANGUAGE_LABELS})\s*(?:{LETTER_SUFFIX_PATTERN})?:\s*(?P<rest>.*)$"


Follow the specification given. The issues documents that the prefix must have the following form "[Language] [A-D]:". There must be a letter and there must be a space between the language name and this letter.

You're already using named groups, so add one for the letter suffix.

Suggested change

rf"^(?P<label>{LANGUAGE_LABELS})\s*(?:{LETTER_SUFFIX_PATTERN})?:\s*(?P<rest>.*)$"

rf"^(?P<label>{LANGUAGE_LABELS}) (?P<letter>{LETTER_SUFFIX_PATTERN}):\s*(?P<rest>.*)$"

src/muse/parallel_corpus/build_sentence.py

laurejt · 2026-02-05T21:54:48Z

src/muse/parallel_corpus/build_sentence.py

+
+    Strategy:
+    1. First, pair entries WITH letter suffixes (A, B, C...) - matched by letter
+    2. Then, pair entries WITHOUT letter suffixes (None) - matched by position


This needs to be removed. This does not match the specification outlined in the issues. These cases must be ignored for now.

laurejt · 2026-02-05T21:55:15Z

src/muse/parallel_corpus/build_sentence.py

+    # PART 2: Pair entries WITHOUT letter suffixes (None)
+    # Some entries have "Chinese:" without letter suffix, matched directly with "English:"
+    # Verified: at most one Chinese: and one English: entry per term (one-to-one)
+    if None in blocks:
+        entry = blocks[None]
+        if "English" in entry:
+            en = entry["English"]
+            for lang_name in SUPPORTED_LANGUAGES:
+                if lang_name != "English" and lang_name in entry:
+                    src = entry[lang_name]
+                    pairs.append(
+                        (
+                            lang_name,
+                            src["text"],
+                            src.get("cite", ""),
+                            en["text"],
+                            en.get("cite", ""),
+                        )
+                    )


This part must be removed (or at the very least commented out).

laurejt · 2026-02-05T22:32:20Z

src/muse/parallel_corpus/build_sentence.py

+        en = entry["English"]
+        for lang_name in SUPPORTED_LANGUAGES:
+            if lang_name == "English":
+                continue


This should exit the for loop, not continue to the next language.

laurejt · 2026-02-05T22:37:38Z

src/muse/parallel_corpus/build_sentence.py

+# CITATION PATTERNS CONFIGURATION
+# Citation indicators that appear at the start of citation text after the main content
+CITATION_PATTERNS = {
+    "p.",  # page reference: p. 268
+    "pp.",  # pages reference: pp. 1070-1071
+    "[",  # bracket reference: [1.17]
+    "(",  # parenthetical reference: (Footnote...), (See also...)
+    "Footnote",  # footnote reference: Footnote 28
+    "Refer to",  # cross-reference: Refer to Section 1
+    "See also",  # cross-reference: See also [2.7]
+}


Why include casing if it's always ignored?

src/muse/parallel_corpus/build_sentence.py

laurejt · 2026-02-05T22:59:43Z

src/muse/parallel_corpus/build_sentence.py

+    # To handle entries without quotes,
+    # split on the LAST whitespace to separate text from citation
+    # rsplit(maxsplit=1) splits from right, so we get [text, citation]
+    parts = rest.rsplit(None, 1)  # Split on any whitespace, max 1 split
+
+    if len(parts) == 2:
+        potential_text, potential_cite = parts
+        # Validate citation by checking it starts with a known citation pattern
+        if potential_cite and re.match(f"^({CITATION_PATTERN})", potential_cite, re.I):
+            text = potential_text.strip()
+            cite = potential_cite.strip()


This logic won't work in most cases. The only citation patterns that have a chance of matching are ones without whitespace.

In general, it doesn't make sense to split on the final whitespace. Most (if not all) citations will have white space (e.g., "Koizumi Fumio (1958) [p. 3]").

Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>

laurejt · 2026-02-06T00:10:20Z

@tanhaow, One more fix I forgot to mention. The current form does not meet the data specification. The lang field must be the iso-639-1 language code. I'll update the data-design doc so this is more explicit.

…om/Princeton-CDH/muse into feature/build-parallel-text-corpus

Create build_parallel_corpus.py

1f034db

tanhaow requested a review from laurejt February 2, 2026 21:27

tanhaow self-assigned this Feb 2, 2026

laurejt requested changes Feb 3, 2026

View reviewed changes

fix ruff checks

bb2033d

laurejt self-requested a review February 3, 2026 21:16

laurejt reviewed Feb 5, 2026

View reviewed changes

Revise based on @laurejt 's review

8d5801f

change to use .rsplit() to get cites

9be78d1

tanhaow requested a review from laurejt February 5, 2026 21:45

rename script to build_sentence

016a7d3

laurejt requested changes Feb 5, 2026

View reviewed changes

tanhaow and others added 3 commits February 5, 2026 18:12

Update src/muse/parallel_corpus/build_sentence.py

b0fc272

Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>

Update src/muse/parallel_corpus/build_sentence.py

5797b85

Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>

Update src/muse/parallel_corpus/build_sentence.py

2db03bf

Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>

tanhaow added 5 commits February 5, 2026 22:20

use ISO 639-1 code

43b0fc8

Update pyproject.toml

ac1cd55

Update build_sentence.py

3631697

Merge branch 'feature/build-parallel-text-corpus' of https://github.c…

5d1bea3

…om/Princeton-CDH/muse into feature/build-parallel-text-corpus

fix pre-commit hook

7d8ad14

		"text": src_text.replace("\n", " ").replace("\\n", " ").strip(),
		"en_tr": en_text.replace("\n", " ").replace("\\n", " ").strip(),

	rf"^(?P<label>{LANGUAGE_LABELS})\s(?:{LETTER_SUFFIX_PATTERN})?:\s(?P<rest>.*)$"
	rf"^(?P<label>{LANGUAGE_LABELS}) (?P<letter>{LETTER_SUFFIX_PATTERN}):\s(?P<rest>.)$"

Conversation

tanhaow commented Feb 2, 2026 • edited by laurejt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Notes

Reviewer Checklist

Uh oh!

laurejt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanhaow commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laurejt commented Feb 5, 2026

Uh oh!

laurejt commented Feb 5, 2026

Uh oh!

laurejt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurejt commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanhaow commented Feb 2, 2026 •

edited by laurejt

Loading

tanhaow commented Feb 5, 2026 •

edited

Loading