Skip to content

Restore tag-skipping for non-content HTML tags#11

Merged
Tomotz merged 3 commits intomasterfrom
devin/1771213044-restore-skip-tags
Feb 16, 2026
Merged

Restore tag-skipping for non-content HTML tags#11
Tomotz merged 3 commits intomasterfrom
devin/1771213044-restore-skip-tags

Conversation

@Tomotz
Copy link
Owner

@Tomotz Tomotz commented Feb 16, 2026

Restore tag-skipping for non-content HTML tags

Summary

Re-adds SKIP_TAGS logic that was lost during the paragraph-based refactor in PR #4. The original implementation (commit c2bcff9) skipped text inside script, style, head, and noscript tags. This PR restores that concept adapted to the current PARAGRAPH_PATTERN-based architecture, and expands the skip list to also include svg, nav, and footer.

How it works: Before processing paragraphs, _get_skip_ranges scans the HTML for regions enclosed by skip tags. Any <p> match whose start position falls inside a skip range is excluded from the matches list. Both the stdout and file-output cases share a single unified code path that iterates over this pre-filtered list (stdout uses sys.stdout as the output file).

Updates since last revision

  • Merged the stdout and file-output branches into a single code path. Previously there was a separate if not output_path branch using PARAGRAPH_PATTERN.sub(); now both cases use the same batched iteration loop, with sys.stdout as the writer when no output file is given. This also means stdout now benefits from parallel flite batching.

Review & Testing Checklist for Human

  • Progress messages interleave with stdout output: When using --html without -o, progress lines (paragraph X / Y) from _assemble_paragraph are printed to stdout alongside the HTML content. This was true before this PR as well, but worth verifying it's acceptable — piping stdout to a file will include those progress lines.
  • Verify the expanded skip tag list is appropriate — the original only had {script, style, head, noscript}. This PR adds svg, nav, footer. Confirm these are all tags you want skipped for your ebook use case, and consider whether any others are missing (e.g., aside, header).
  • Nested skip tags: SKIP_TAG_PATTERN uses non-greedy .*? matching, so nested tags of the same type (e.g., <nav><nav>inner</nav></nav>) would only match the inner pair, leaving content between the two </nav> tags unskipped. Unlikely in ebook HTML but worth awareness.
  • Test with a real ebook HTML file that has <p> tags inside <nav>, <head>, or <footer> sections and verify those paragraphs are passed through without IPA processing, while normal body <p> tags are still processed correctly.

Notes

devin-ai-integration bot and others added 3 commits February 16, 2026 03:38
Re-add SKIP_TAGS logic that was removed during paragraph-based refactor.
Paragraphs inside script, style, head, noscript, svg, nav, and footer
tags are now skipped (no flite processing), reducing unnecessary work
for ebook conversion.

Co-Authored-By: tom mottes <tom.mottes@gmail.com>
Removed the separate _in_skip_range check from the stdout path's
replace_paragraph callback. Now both code paths iterate over the
same pre-filtered matches list, so skip logic lives in one place.

Co-Authored-By: tom mottes <tom.mottes@gmail.com>
Use sys.stdout when no output_path, eliminating the separate
if-not-output_path branch. Both cases now share the exact same
iteration and batching logic.

Co-Authored-By: tom mottes <tom.mottes@gmail.com>
@Tomotz Tomotz merged commit 8afddb3 into master Feb 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments