Restore tag-skipping for non-content HTML tags#11
Merged
Conversation
Re-add SKIP_TAGS logic that was removed during paragraph-based refactor. Paragraphs inside script, style, head, noscript, svg, nav, and footer tags are now skipped (no flite processing), reducing unnecessary work for ebook conversion. Co-Authored-By: tom mottes <tom.mottes@gmail.com>
Removed the separate _in_skip_range check from the stdout path's replace_paragraph callback. Now both code paths iterate over the same pre-filtered matches list, so skip logic lives in one place. Co-Authored-By: tom mottes <tom.mottes@gmail.com>
Use sys.stdout when no output_path, eliminating the separate if-not-output_path branch. Both cases now share the exact same iteration and batching logic. Co-Authored-By: tom mottes <tom.mottes@gmail.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Restore tag-skipping for non-content HTML tags
Summary
Re-adds
SKIP_TAGSlogic that was lost during the paragraph-based refactor in PR #4. The original implementation (commit c2bcff9) skipped text insidescript,style,head, andnoscripttags. This PR restores that concept adapted to the currentPARAGRAPH_PATTERN-based architecture, and expands the skip list to also includesvg,nav, andfooter.How it works: Before processing paragraphs,
_get_skip_rangesscans the HTML for regions enclosed by skip tags. Any<p>match whose start position falls inside a skip range is excluded from thematcheslist. Both the stdout and file-output cases share a single unified code path that iterates over this pre-filtered list (stdout usessys.stdoutas the output file).Updates since last revision
if not output_pathbranch usingPARAGRAPH_PATTERN.sub(); now both cases use the same batched iteration loop, withsys.stdoutas the writer when no output file is given. This also means stdout now benefits from parallel flite batching.Review & Testing Checklist for Human
--htmlwithout-o, progress lines (paragraph X / Y) from_assemble_paragraphare printed to stdout alongside the HTML content. This was true before this PR as well, but worth verifying it's acceptable — piping stdout to a file will include those progress lines.{script, style, head, noscript}. This PR addssvg,nav,footer. Confirm these are all tags you want skipped for your ebook use case, and consider whether any others are missing (e.g.,aside,header).SKIP_TAG_PATTERNuses non-greedy.*?matching, so nested tags of the same type (e.g.,<nav><nav>inner</nav></nav>) would only match the inner pair, leaving content between the two</nav>tags unskipped. Unlikely in ebook HTML but worth awareness.<p>tags inside<nav>,<head>, or<footer>sections and verify those paragraphs are passed through without IPA processing, while normal body<p>tags are still processed correctly.Notes