Skip to content

Create build_parallel_corpus.py#22

Open
tanhaow wants to merge 13 commits intodevelopfrom
feature/build-parallel-text-corpus
Open

Create build_parallel_corpus.py#22
tanhaow wants to merge 13 commits intodevelopfrom
feature/build-parallel-text-corpus

Conversation

@tanhaow
Copy link

@tanhaow tanhaow commented Feb 2, 2026

Associated Issue(s): resolves #1

Changes in this PR

  • Created scripts/build_parallel_corpus.py to build a parallel sentence corpus from Notion export data, which outputs with fields: id, lang, text, en_tr, cite, en_cite, term

Notes

  • Japanese text normalization: removes inter-character spaces from source data artifacts
  • The script has fallbacks for some entries lack proper [Language] Letter

Reviewer Checklist

  • Verify JSONL output is valid
  • Confirm all required fields are present: id, lang, text, en_tr, cite, en_cite, term

@tanhaow tanhaow requested a review from laurejt February 2, 2026 21:27
@tanhaow tanhaow self-assigned this Feb 2, 2026
Copy link

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR fails ruff checks, I'll start my review once these checks pass.

@laurejt laurejt self-requested a review February 3, 2026 21:16
Comment on lines 28 to 30
LABEL_RE = re.compile(
r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed yesterday:

  • This regex is too rigid because it hard codes the supported languages, it should leverage the "LABEL_TO_CODE" global variable.

from pathlib import Path

# Language label to code mapping
LABEL_TO_CODE = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is not descriptive. This structure is defining the language codes of the supported language. This should be clear from the name. This is not some generic label. It is the ISO 639-1 code.

Comment on lines 66 to 83
# Extract quoted text and citation
text = None
cite = None
q = QUOTE_RE.search(rest)
if q:
text = q.group("quote").strip()
after = rest[q.end() :].strip()
if after:
cite = after.strip()
else:
# Fallback: some entries use emdash separators (em dash, en dash, hyphen) instead of quotes.
# Split on emdash to extract text and citation. This recovers ~200 pairs
# that would be missed if we only accepted quoted text.
parts = EMDASH_SPLIT_RE.split(rest, maxsplit=1)
if parts:
text = parts[0].strip()
if len(parts) == 2:
cite = parts[1].strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate this logic into a separate function. Instead of describing this as a fallback. It seems like the real question at hand is what marks may be used for quotation (if any).

Removing these markings is less important than trying to separate the citation. Consider attempting to match on the form the expected citation should have.

Comment on lines 195 to 198
args.add_argument("--input", required=True, help="Path to notion_terms.jsonl input")
args.add_argument(
"--output", required=True, help="Path to write parallel JSONL output"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of these arguments should also be specified via the optional type argument: type=pathlib.Path

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script should be included within the source code (src/muse).

Also, the name of this script is too generic. This will not be the only program to build parallel corpora. This must specify the nature of the corpora, namely sentence-level.

r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I
)
# Match quoted text (handles straight quotes "...", left curly "...", right curly "...")
QUOTE_RE = re.compile(r'[""](?P<quote>.+?)[""]')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regular expression is incorrect, it only matches on regular double quotes. This is also not general enough for other languages, there's at least one instance that uses single quotes (although it shouldn't impact our language subset).

Perhaps, testing first if the string begins with a quote and then perhaps using rsplit to attempt to separate the citation. That said it may be easier to make a regular expression to match the possible form of the ending citation.


# Match labels like "English A:", "Japanese B:"
LABEL_RE = re.compile(
r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regular expression also does not match the specification. There must be a space between the language and the letter label.

Similarly, ignoring casing is also overly permissive.

# Match quoted text (handles straight quotes "...", left curly "...", right curly "...")
QUOTE_RE = re.compile(r'[""](?P<quote>.+?)[""]')
# Match emdash-like separators (em dash, en dash, hyphen) for text/citation extraction fallback
EMDASH_SPLIT_RE = re.compile(r"\s+[\u2014\u2013-]\s+")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use rsplit with maxsplit=1?

Comment on lines 53 to 64
# Some quotes span multiple paragraphs (e.g., long translations broken up).
# Join paragraphs until we find the closing quote.
if rest.count('"') % 2 == 1:
parts = [rest]
j = i + 1
while j < len(paragraphs):
parts.append(paragraphs[j].strip())
if paragraphs[j].count('"') > 0:
break
j += 1
rest = " ".join(parts)
i = j
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a special case. How often does this happen? This will also only work for passages that use regular quote symbols.

There should also be no chance of combining with the following paragraph if it also starts with some kind of prefix (e.g. "Source:", "English:", etc.)

Comment on lines 177 to 178
"text": src_text.replace("\n", " ").replace("\\n", " ").strip(),
"en_tr": en_text.replace("\n", " ").replace("\\n", " ").strip(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newlines should not be modified.

# Source data contains unnormalized spaces between CJK characters (e.g., "日 本 の 音 楽").
if obj["lang"] == "ja":
obj["text"] = obj["text"].replace(" ", "")
json.dump(obj, outf, ensure_ascii=False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be UTF-8, not ascii (presumably this shouldn't come up when using orjsonl)

Comment on lines 184 to 186
# Source data contains unnormalized spaces between CJK characters (e.g., "日 本 の 音 楽").
if obj["lang"] == "ja":
obj["text"] = obj["text"].replace(" ", "")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, let's not modify the contents of the texts.

Comment on lines 136 to 138
# Fallback: pair by position if letter matching is incomplete.
# Some entries have source blocks without matching English letters (or vice versa).
# Order-based pairing recovers these by matching the nth source with the nth English block.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passages must have the same letter label. This logic must be removed.

@tanhaow
Copy link
Author

tanhaow commented Feb 5, 2026

Note:

I noted that there's a case where English entries without letter suffixes does match source text
Example - "复音音乐 fùyīn yīnyuè":
image
Here English: text matches Chinese A:'s text content.

But there are also cases where English: text do not match with none of the letter suffixed source text.
Example - 乐 yuè:
image

Also, the second example has multiple paragraphs for English:. Do we want to combine all of the paragraphs - or just keep the first one followed by the English: label? @laurejt

@laurejt
Copy link

laurejt commented Feb 5, 2026

@tanhaow: There is no guarantee that an "unlabeled" language is a parallel text. There may be a few instances where this is true, but there are many cases where it is not. Sometimes, unlabeled language entries correspond to definitions.

Decision: Do not attempt to match unlabeled language entries (e.g., starting with "English:", "Chinese:"). During tomorrow's meeting, we can bring up the required assumption. This seems like something the research team should fix within notion, now that we have a way to export this data.

@laurejt
Copy link

laurejt commented Feb 5, 2026

@tanhaow : For the second issue. I think this problem is too ambiguous to handle in this initial pass. We should investigate the scale of this issue, but that can wait until after we can build an initial parallel sentence-corpus.

Decision: Do not attempt to combine multi-paragraph parallel texts.

@tanhaow tanhaow requested a review from laurejt February 5, 2026 21:45
Copy link

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work, this is getting closer. The following changes must be made:

  • Update pyproject.toml with new dependencies (and therefore update uv.lock). Script currently crashes because of this.
  • Fix extract_text_and_citation the "fallback" logic so it can capture citations. There are two options here:
    1. Update the citation regex so that it can match the full form of the citation that might occur at the end of the string
    2. Remove the citation logic for now and instead make something for the dashes case that was in previous versions of this method
  • Additionally, update extract_text_citation so that:
    1. It returns tuple[str, str]. The null cases should be empty strings. This means that the majority of the x if x else None-style statements will be removed.
    2. Document the assumption about the expected form of quoted text (i.e., assumes double quotes are the symbol used to mark quotation)
  • Update pair_blocks so that "Part 2" is removed. This logic goes against the specification we discussed. Language prefixes without letters are not candidate parallel texts.
  • Update build_sentence_parallel_corpus so its inputs are pathlib.Path types
  • Simplify parse_labelled_paragraphs by:
    1. updating LABEL_RE so that the letter suffix is a named group within the regular expression
    2. converting the while loop to a for loop
  • Update main so that the command line arguments are positional

# Match labels like "English A:", "Japanese B:", "Chinese:", "English:"
# Labels must have a colon, may or may not have letter suffixes [A-D]
LABEL_RE = re.compile(
rf"^(?P<label>{LANGUAGE_LABELS})\s*(?:{LETTER_SUFFIX_PATTERN})?:\s*(?P<rest>.*)$"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the specification given. The issues documents that the prefix must have the following form "[Language] [A-D]:". There must be a letter and there must be a space between the language name and this letter.

You're already using named groups, so add one for the letter suffix.

Suggested change
rf"^(?P<label>{LANGUAGE_LABELS})\s*(?:{LETTER_SUFFIX_PATTERN})?:\s*(?P<rest>.*)$"
rf"^(?P<label>{LANGUAGE_LABELS}) (?P<letter>{LETTER_SUFFIX_PATTERN}):\s*(?P<rest>.*)$"


Strategy:
1. First, pair entries WITH letter suffixes (A, B, C...) - matched by letter
2. Then, pair entries WITHOUT letter suffixes (None) - matched by position
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be removed. This does not match the specification outlined in the issues. These cases must be ignored for now.

Comment on lines 168 to 186
# PART 2: Pair entries WITHOUT letter suffixes (None)
# Some entries have "Chinese:" without letter suffix, matched directly with "English:"
# Verified: at most one Chinese: and one English: entry per term (one-to-one)
if None in blocks:
entry = blocks[None]
if "English" in entry:
en = entry["English"]
for lang_name in SUPPORTED_LANGUAGES:
if lang_name != "English" and lang_name in entry:
src = entry[lang_name]
pairs.append(
(
lang_name,
src["text"],
src.get("cite", ""),
en["text"],
en.get("cite", ""),
)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part must be removed (or at the very least commented out).

en = entry["English"]
for lang_name in SUPPORTED_LANGUAGES:
if lang_name == "English":
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should exit the for loop, not continue to the next language.

Comment on lines +35 to +45
# CITATION PATTERNS CONFIGURATION
# Citation indicators that appear at the start of citation text after the main content
CITATION_PATTERNS = {
"p.", # page reference: p. 268
"pp.", # pages reference: pp. 1070-1071
"[", # bracket reference: [1.17]
"(", # parenthetical reference: (Footnote...), (See also...)
"Footnote", # footnote reference: Footnote 28
"Refer to", # cross-reference: Refer to Section 1
"See also", # cross-reference: See also [2.7]
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include casing if it's always ignored?

Comment on lines +71 to +81
# To handle entries without quotes,
# split on the LAST whitespace to separate text from citation
# rsplit(maxsplit=1) splits from right, so we get [text, citation]
parts = rest.rsplit(None, 1) # Split on any whitespace, max 1 split

if len(parts) == 2:
potential_text, potential_cite = parts
# Validate citation by checking it starts with a known citation pattern
if potential_cite and re.match(f"^({CITATION_PATTERN})", potential_cite, re.I):
text = potential_text.strip()
cite = potential_cite.strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic won't work in most cases. The only citation patterns that have a chance of matching are ones without whitespace.

In general, it doesn't make sense to split on the final whitespace. Most (if not all) citations will have white space (e.g., "Koizumi Fumio (1958) [p. 3]").

tanhaow and others added 3 commits February 5, 2026 18:12
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
@laurejt
Copy link

laurejt commented Feb 6, 2026

@tanhaow, One more fix I forgot to mention. The current form does not meet the data specification. The lang field must be the iso-639-1 language code. I'll update the data-design doc so this is more explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants