Restore tag-skipping for non-content HTML tags by Tomotz · Pull Request #11 · Tomotz/IPA_transcript

Tomotz · 2026-02-16T03:38:56Z

Restore tag-skipping for non-content HTML tags

Summary

Re-adds SKIP_TAGS logic that was lost during the paragraph-based refactor in PR #4. The original implementation (commit c2bcff9) skipped text inside script, style, head, and noscript tags. This PR restores that concept adapted to the current PARAGRAPH_PATTERN-based architecture, and expands the skip list to also include svg, nav, and footer.

How it works: Before processing paragraphs, _get_skip_ranges scans the HTML for regions enclosed by skip tags. Any <p> match whose start position falls inside a skip range is excluded from the matches list. Both the stdout and file-output cases share a single unified code path that iterates over this pre-filtered list (stdout uses sys.stdout as the output file).

Updates since last revision

Merged the stdout and file-output branches into a single code path. Previously there was a separate if not output_path branch using PARAGRAPH_PATTERN.sub(); now both cases use the same batched iteration loop, with sys.stdout as the writer when no output file is given. This also means stdout now benefits from parallel flite batching.

Review & Testing Checklist for Human

Progress messages interleave with stdout output: When using --html without -o, progress lines (paragraph X / Y) from _assemble_paragraph are printed to stdout alongside the HTML content. This was true before this PR as well, but worth verifying it's acceptable — piping stdout to a file will include those progress lines.
Verify the expanded skip tag list is appropriate — the original only had {script, style, head, noscript}. This PR adds svg, nav, footer. Confirm these are all tags you want skipped for your ebook use case, and consider whether any others are missing (e.g., aside, header).
Nested skip tags: SKIP_TAG_PATTERN uses non-greedy .*? matching, so nested tags of the same type (e.g., <nav><nav>inner</nav></nav>) would only match the inner pair, leaving content between the two </nav> tags unskipped. Unlikely in ebook HTML but worth awareness.
Test with a real ebook HTML file that has <p> tags inside <nav>, <head>, or <footer> sections and verify those paragraphs are passed through without IPA processing, while normal body <p> tags are still processed correctly.

Notes

Requested by: @Tomotz
Link to Devin run

Re-add SKIP_TAGS logic that was removed during paragraph-based refactor. Paragraphs inside script, style, head, noscript, svg, nav, and footer tags are now skipped (no flite processing), reducing unnecessary work for ebook conversion. Co-Authored-By: tom mottes <tom.mottes@gmail.com>

Removed the separate _in_skip_range check from the stdout path's replace_paragraph callback. Now both code paths iterate over the same pre-filtered matches list, so skip logic lives in one place. Co-Authored-By: tom mottes <tom.mottes@gmail.com>

Use sys.stdout when no output_path, eliminating the separate if-not-output_path branch. Both cases now share the exact same iteration and batching logic. Co-Authored-By: tom mottes <tom.mottes@gmail.com>

devin-ai-integration bot and others added 3 commits February 16, 2026 03:38

Merge stdout and file-output paths into single code path

68479a5

Use sys.stdout when no output_path, eliminating the separate if-not-output_path branch. Both cases now share the exact same iteration and batching logic. Co-Authored-By: tom mottes <tom.mottes@gmail.com>

Tomotz merged commit 8afddb3 into master Feb 16, 2026
1 check passed

Tomotz mentioned this pull request Feb 16, 2026

Strip non-content HTML tags from output entirely #12

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore tag-skipping for non-content HTML tags#11

Restore tag-skipping for non-content HTML tags#11
Tomotz merged 3 commits intomasterfrom
devin/1771213044-restore-skip-tags

Tomotz commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Tomotz commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Restore tag-skipping for non-content HTML tags

Summary

Updates since last revision

Review & Testing Checklist for Human

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Tomotz commented Feb 16, 2026 •

edited

Loading