Strip non-content HTML tags from output entirely#12
Merged
Conversation
Remove skip-tag regions (script, style, head, noscript, svg, nav, footer) from the HTML content before paragraph matching. This ensures they don't appear in the output at all, since they contain no book content and don't show in the ebook anyway. Co-Authored-By: tom mottes <tom.mottes@gmail.com>
Co-Authored-By: tom mottes <tom.mottes@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Strip non-content HTML tags from output entirely
Summary
Changes
process_html_fileto strip skip-tag regions (script,style,head,noscript,svg,nav,footer) from the HTML content before paragraph matching, so they are completely removed from the output.The previous approach (from merged PR #11) only filtered which paragraphs to process but still wrote skip-tag content (head, style, script blocks) verbatim to the output between paragraphs. This meant the output still contained all the CSS, JS, and metadata — which is useless for ebook conversion and bloated the file (37MB → 14MB on
twig_full.html).The fix is a single-line change:
content = SKIP_TAG_PATTERN.sub('', content)replaces the range-based filtering, and the now-unused_get_skip_ranges/_in_skip_rangehelpers are removed.Test update:
test_non_paragraph_content_preservednow asserts that<head>and<title>are absent from output (previously it expected<title>to be preserved, which contradicts the skip-tag stripping behavior).Review & Testing Checklist for Human
SKIP_TAG_PATTERNuses non-greedy.*?matching with backreference (</(?P=tag)>). Nested tags of the same type (e.g.,<nav><nav>...</nav></nav>) would only strip the inner pair, leaving a broken outer shell. Verify this doesn't happen in your ebook HTML, or consider whether it matters.navandfooterin skip list are appropriate: These were added beyond the original{script, style, head, noscript}. If any ebook has meaningful content in<nav>or<footer>, it would be silently dropped.python main.py twig_full.html --html -o output.htmland confirm:<head>,<style>,<script>,<nav>,<footer>tags in output (grep -c '<style' output.htmlshould return 0)Notes