Strip non-content HTML tags from output entirely by Tomotz · Pull Request #12 · Tomotz/IPA_transcript

Tomotz · 2026-02-16T04:05:49Z

Strip non-content HTML tags from output entirely

Summary

Changes process_html_file to strip skip-tag regions (script, style, head, noscript, svg, nav, footer) from the HTML content before paragraph matching, so they are completely removed from the output.

The previous approach (from merged PR #11) only filtered which paragraphs to process but still wrote skip-tag content (head, style, script blocks) verbatim to the output between paragraphs. This meant the output still contained all the CSS, JS, and metadata — which is useless for ebook conversion and bloated the file (37MB → 14MB on twig_full.html).

The fix is a single-line change: content = SKIP_TAG_PATTERN.sub('', content) replaces the range-based filtering, and the now-unused _get_skip_ranges/_in_skip_range helpers are removed.

Test update: test_non_paragraph_content_preserved now asserts that <head> and <title> are absent from output (previously it expected <title> to be preserved, which contradicts the skip-tag stripping behavior).

Review & Testing Checklist for Human

Nested same-type tags: SKIP_TAG_PATTERN uses non-greedy .*? matching with backreference (</(?P=tag)>). Nested tags of the same type (e.g., <nav><nav>...</nav></nav>) would only strip the inner pair, leaving a broken outer shell. Verify this doesn't happen in your ebook HTML, or consider whether it matters.
Verify nav and footer in skip list are appropriate: These were added beyond the original {script, style, head, noscript}. If any ebook has meaningful content in <nav> or <footer>, it would be silently dropped.
Test plan: Run python main.py twig_full.html --html -o output.html and confirm:
- No <head>, <style>, <script>, <nav>, <footer> tags in output (grep -c '<style' output.html should return 0)
- Book paragraphs are present with IPA + original text
- Output file size is significantly smaller than input

Notes

Requested by: @Tomotz
Link to Devin run

Remove skip-tag regions (script, style, head, noscript, svg, nav, footer) from the HTML content before paragraph matching. This ensures they don't appear in the output at all, since they contain no book content and don't show in the ebook anyway. Co-Authored-By: tom mottes <tom.mottes@gmail.com>

Co-Authored-By: tom mottes <tom.mottes@gmail.com>

devin-ai-integration bot and others added 2 commits February 16, 2026 04:05

Update test to expect head/title stripped from output

c7b6155

Co-Authored-By: tom mottes <tom.mottes@gmail.com>

Tomotz merged commit a5e3e1a into master Feb 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip non-content HTML tags from output entirely#12

Strip non-content HTML tags from output entirely#12
Tomotz merged 2 commits intomasterfrom
devin/1771214536-strip-skip-tags

Tomotz commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Tomotz commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Strip non-content HTML tags from output entirely

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Tomotz commented Feb 16, 2026 •

edited

Loading