Skip to content

Strip non-content HTML tags from output entirely#12

Merged
Tomotz merged 2 commits intomasterfrom
devin/1771214536-strip-skip-tags
Feb 16, 2026
Merged

Strip non-content HTML tags from output entirely#12
Tomotz merged 2 commits intomasterfrom
devin/1771214536-strip-skip-tags

Conversation

@Tomotz
Copy link
Owner

@Tomotz Tomotz commented Feb 16, 2026

Strip non-content HTML tags from output entirely

Summary

Changes process_html_file to strip skip-tag regions (script, style, head, noscript, svg, nav, footer) from the HTML content before paragraph matching, so they are completely removed from the output.

The previous approach (from merged PR #11) only filtered which paragraphs to process but still wrote skip-tag content (head, style, script blocks) verbatim to the output between paragraphs. This meant the output still contained all the CSS, JS, and metadata — which is useless for ebook conversion and bloated the file (37MB → 14MB on twig_full.html).

The fix is a single-line change: content = SKIP_TAG_PATTERN.sub('', content) replaces the range-based filtering, and the now-unused _get_skip_ranges/_in_skip_range helpers are removed.

Test update: test_non_paragraph_content_preserved now asserts that <head> and <title> are absent from output (previously it expected <title> to be preserved, which contradicts the skip-tag stripping behavior).

Review & Testing Checklist for Human

  • Nested same-type tags: SKIP_TAG_PATTERN uses non-greedy .*? matching with backreference (</(?P=tag)>). Nested tags of the same type (e.g., <nav><nav>...</nav></nav>) would only strip the inner pair, leaving a broken outer shell. Verify this doesn't happen in your ebook HTML, or consider whether it matters.
  • Verify nav and footer in skip list are appropriate: These were added beyond the original {script, style, head, noscript}. If any ebook has meaningful content in <nav> or <footer>, it would be silently dropped.
  • Test plan: Run python main.py twig_full.html --html -o output.html and confirm:
    • No <head>, <style>, <script>, <nav>, <footer> tags in output (grep -c '<style' output.html should return 0)
    • Book paragraphs are present with IPA + original text
    • Output file size is significantly smaller than input

Notes

devin-ai-integration bot and others added 2 commits February 16, 2026 04:05
Remove skip-tag regions (script, style, head, noscript, svg, nav,
footer) from the HTML content before paragraph matching. This ensures
they don't appear in the output at all, since they contain no book
content and don't show in the ebook anyway.

Co-Authored-By: tom mottes <tom.mottes@gmail.com>
Co-Authored-By: tom mottes <tom.mottes@gmail.com>
@Tomotz Tomotz merged commit a5e3e1a into master Feb 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments