Conversation
laurejt
left a comment
There was a problem hiding this comment.
The PR fails ruff checks, I'll start my review once these checks pass.
scripts/build_parallel_corpus.py
Outdated
| LABEL_RE = re.compile( | ||
| r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I | ||
| ) |
There was a problem hiding this comment.
As we discussed yesterday:
- This regex is too rigid because it hard codes the supported languages, it should leverage the "LABEL_TO_CODE" global variable.
scripts/build_parallel_corpus.py
Outdated
| from pathlib import Path | ||
|
|
||
| # Language label to code mapping | ||
| LABEL_TO_CODE = { |
There was a problem hiding this comment.
This name is not descriptive. This structure is defining the language codes of the supported language. This should be clear from the name. This is not some generic label. It is the ISO 639-1 code.
scripts/build_parallel_corpus.py
Outdated
| # Extract quoted text and citation | ||
| text = None | ||
| cite = None | ||
| q = QUOTE_RE.search(rest) | ||
| if q: | ||
| text = q.group("quote").strip() | ||
| after = rest[q.end() :].strip() | ||
| if after: | ||
| cite = after.strip() | ||
| else: | ||
| # Fallback: some entries use emdash separators (em dash, en dash, hyphen) instead of quotes. | ||
| # Split on emdash to extract text and citation. This recovers ~200 pairs | ||
| # that would be missed if we only accepted quoted text. | ||
| parts = EMDASH_SPLIT_RE.split(rest, maxsplit=1) | ||
| if parts: | ||
| text = parts[0].strip() | ||
| if len(parts) == 2: | ||
| cite = parts[1].strip() |
There was a problem hiding this comment.
Separate this logic into a separate function. Instead of describing this as a fallback. It seems like the real question at hand is what marks may be used for quotation (if any).
Removing these markings is less important than trying to separate the citation. Consider attempting to match on the form the expected citation should have.
scripts/build_parallel_corpus.py
Outdated
| args.add_argument("--input", required=True, help="Path to notion_terms.jsonl input") | ||
| args.add_argument( | ||
| "--output", required=True, help="Path to write parallel JSONL output" | ||
| ) |
There was a problem hiding this comment.
The type of these arguments should also be specified via the optional type argument: type=pathlib.Path
scripts/build_parallel_corpus.py
Outdated
There was a problem hiding this comment.
This script should be included within the source code (src/muse).
Also, the name of this script is too generic. This will not be the only program to build parallel corpora. This must specify the nature of the corpora, namely sentence-level.
scripts/build_parallel_corpus.py
Outdated
| r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I | ||
| ) | ||
| # Match quoted text (handles straight quotes "...", left curly "...", right curly "...") | ||
| QUOTE_RE = re.compile(r'[""](?P<quote>.+?)[""]') |
There was a problem hiding this comment.
This regular expression is incorrect, it only matches on regular double quotes. This is also not general enough for other languages, there's at least one instance that uses single quotes (although it shouldn't impact our language subset).
Perhaps, testing first if the string begins with a quote and then perhaps using rsplit to attempt to separate the citation. That said it may be easier to make a regular expression to match the possible form of the ending citation.
scripts/build_parallel_corpus.py
Outdated
|
|
||
| # Match labels like "English A:", "Japanese B:" | ||
| LABEL_RE = re.compile( | ||
| r"^(?P<label>English|Japanese|Chinese|Spanish)\s*([A-D]):\s*(?P<rest>.*)$", re.I |
There was a problem hiding this comment.
This regular expression also does not match the specification. There must be a space between the language and the letter label.
Similarly, ignoring casing is also overly permissive.
scripts/build_parallel_corpus.py
Outdated
| # Match quoted text (handles straight quotes "...", left curly "...", right curly "...") | ||
| QUOTE_RE = re.compile(r'[""](?P<quote>.+?)[""]') | ||
| # Match emdash-like separators (em dash, en dash, hyphen) for text/citation extraction fallback | ||
| EMDASH_SPLIT_RE = re.compile(r"\s+[\u2014\u2013-]\s+") |
scripts/build_parallel_corpus.py
Outdated
| # Some quotes span multiple paragraphs (e.g., long translations broken up). | ||
| # Join paragraphs until we find the closing quote. | ||
| if rest.count('"') % 2 == 1: | ||
| parts = [rest] | ||
| j = i + 1 | ||
| while j < len(paragraphs): | ||
| parts.append(paragraphs[j].strip()) | ||
| if paragraphs[j].count('"') > 0: | ||
| break | ||
| j += 1 | ||
| rest = " ".join(parts) | ||
| i = j |
There was a problem hiding this comment.
This appears to be a special case. How often does this happen? This will also only work for passages that use regular quote symbols.
There should also be no chance of combining with the following paragraph if it also starts with some kind of prefix (e.g. "Source:", "English:", etc.)
scripts/build_parallel_corpus.py
Outdated
| "text": src_text.replace("\n", " ").replace("\\n", " ").strip(), | ||
| "en_tr": en_text.replace("\n", " ").replace("\\n", " ").strip(), |
scripts/build_parallel_corpus.py
Outdated
| # Source data contains unnormalized spaces between CJK characters (e.g., "日 本 の 音 楽"). | ||
| if obj["lang"] == "ja": | ||
| obj["text"] = obj["text"].replace(" ", "") | ||
| json.dump(obj, outf, ensure_ascii=False) |
There was a problem hiding this comment.
This file should be UTF-8, not ascii (presumably this shouldn't come up when using orjsonl)
scripts/build_parallel_corpus.py
Outdated
| # Source data contains unnormalized spaces between CJK characters (e.g., "日 本 の 音 楽"). | ||
| if obj["lang"] == "ja": | ||
| obj["text"] = obj["text"].replace(" ", "") |
There was a problem hiding this comment.
For now, let's not modify the contents of the texts.
scripts/build_parallel_corpus.py
Outdated
| # Fallback: pair by position if letter matching is incomplete. | ||
| # Some entries have source blocks without matching English letters (or vice versa). | ||
| # Order-based pairing recovers these by matching the nth source with the nth English block. |
There was a problem hiding this comment.
Passages must have the same letter label. This logic must be removed.
|
Note: I noted that there's a case where English entries without letter suffixes does match source text But there are also cases where Also, the second example has multiple paragraphs for |
|
@tanhaow: There is no guarantee that an "unlabeled" language is a parallel text. There may be a few instances where this is true, but there are many cases where it is not. Sometimes, unlabeled language entries correspond to definitions. Decision: Do not attempt to match unlabeled language entries (e.g., starting with "English:", "Chinese:"). During tomorrow's meeting, we can bring up the required assumption. This seems like something the research team should fix within notion, now that we have a way to export this data. |
|
@tanhaow : For the second issue. I think this problem is too ambiguous to handle in this initial pass. We should investigate the scale of this issue, but that can wait until after we can build an initial parallel sentence-corpus. Decision: Do not attempt to combine multi-paragraph parallel texts. |
laurejt
left a comment
There was a problem hiding this comment.
Thank you for your work, this is getting closer. The following changes must be made:
- Update
pyproject.tomlwith new dependencies (and therefore updateuv.lock). Script currently crashes because of this. - Fix
extract_text_and_citationthe "fallback" logic so it can capture citations. There are two options here:- Update the citation regex so that it can match the full form of the citation that might occur at the end of the string
- Remove the citation logic for now and instead make something for the dashes case that was in previous versions of this method
- Additionally, update
extract_text_citationso that:- It returns
tuple[str, str]. The null cases should be empty strings. This means that the majority of thex if x else None-style statements will be removed. - Document the assumption about the expected form of quoted text (i.e., assumes double quotes are the symbol used to mark quotation)
- It returns
- Update
pair_blocksso that "Part 2" is removed. This logic goes against the specification we discussed. Language prefixes without letters are not candidate parallel texts. - Update
build_sentence_parallel_corpusso its inputs arepathlib.Pathtypes - Simplify
parse_labelled_paragraphsby:- updating
LABEL_REso that the letter suffix is a named group within the regular expression - converting the while loop to a for loop
- updating
- Update
mainso that the command line arguments are positional
| # Match labels like "English A:", "Japanese B:", "Chinese:", "English:" | ||
| # Labels must have a colon, may or may not have letter suffixes [A-D] | ||
| LABEL_RE = re.compile( | ||
| rf"^(?P<label>{LANGUAGE_LABELS})\s*(?:{LETTER_SUFFIX_PATTERN})?:\s*(?P<rest>.*)$" |
There was a problem hiding this comment.
Follow the specification given. The issues documents that the prefix must have the following form "[Language] [A-D]:". There must be a letter and there must be a space between the language name and this letter.
You're already using named groups, so add one for the letter suffix.
| rf"^(?P<label>{LANGUAGE_LABELS})\s*(?:{LETTER_SUFFIX_PATTERN})?:\s*(?P<rest>.*)$" | |
| rf"^(?P<label>{LANGUAGE_LABELS}) (?P<letter>{LETTER_SUFFIX_PATTERN}):\s*(?P<rest>.*)$" |
|
|
||
| Strategy: | ||
| 1. First, pair entries WITH letter suffixes (A, B, C...) - matched by letter | ||
| 2. Then, pair entries WITHOUT letter suffixes (None) - matched by position |
There was a problem hiding this comment.
This needs to be removed. This does not match the specification outlined in the issues. These cases must be ignored for now.
| # PART 2: Pair entries WITHOUT letter suffixes (None) | ||
| # Some entries have "Chinese:" without letter suffix, matched directly with "English:" | ||
| # Verified: at most one Chinese: and one English: entry per term (one-to-one) | ||
| if None in blocks: | ||
| entry = blocks[None] | ||
| if "English" in entry: | ||
| en = entry["English"] | ||
| for lang_name in SUPPORTED_LANGUAGES: | ||
| if lang_name != "English" and lang_name in entry: | ||
| src = entry[lang_name] | ||
| pairs.append( | ||
| ( | ||
| lang_name, | ||
| src["text"], | ||
| src.get("cite", ""), | ||
| en["text"], | ||
| en.get("cite", ""), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
This part must be removed (or at the very least commented out).
| en = entry["English"] | ||
| for lang_name in SUPPORTED_LANGUAGES: | ||
| if lang_name == "English": | ||
| continue |
There was a problem hiding this comment.
This should exit the for loop, not continue to the next language.
| # CITATION PATTERNS CONFIGURATION | ||
| # Citation indicators that appear at the start of citation text after the main content | ||
| CITATION_PATTERNS = { | ||
| "p.", # page reference: p. 268 | ||
| "pp.", # pages reference: pp. 1070-1071 | ||
| "[", # bracket reference: [1.17] | ||
| "(", # parenthetical reference: (Footnote...), (See also...) | ||
| "Footnote", # footnote reference: Footnote 28 | ||
| "Refer to", # cross-reference: Refer to Section 1 | ||
| "See also", # cross-reference: See also [2.7] | ||
| } |
There was a problem hiding this comment.
Why include casing if it's always ignored?
| # To handle entries without quotes, | ||
| # split on the LAST whitespace to separate text from citation | ||
| # rsplit(maxsplit=1) splits from right, so we get [text, citation] | ||
| parts = rest.rsplit(None, 1) # Split on any whitespace, max 1 split | ||
|
|
||
| if len(parts) == 2: | ||
| potential_text, potential_cite = parts | ||
| # Validate citation by checking it starts with a known citation pattern | ||
| if potential_cite and re.match(f"^({CITATION_PATTERN})", potential_cite, re.I): | ||
| text = potential_text.strip() | ||
| cite = potential_cite.strip() |
There was a problem hiding this comment.
This logic won't work in most cases. The only citation patterns that have a chance of matching are ones without whitespace.
In general, it doesn't make sense to split on the final whitespace. Most (if not all) citations will have white space (e.g., "Koizumi Fumio (1958) [p. 3]").
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
|
@tanhaow, One more fix I forgot to mention. The current form does not meet the data specification. The |


Associated Issue(s): resolves #1
Changes in this PR
scripts/build_parallel_corpus.pyto build a parallel sentence corpus from Notion export data, which outputs with fields:id,lang,text,en_tr,cite,en_cite,termNotes
Reviewer Checklist
id,lang,text,en_tr,cite,en_cite,term