Skip to content

Strip elements by attribute rules (id="secondary", id="actionbar", etc.)#13

Merged
Tomotz merged 3 commits intomasterfrom
devin/1771215833-strip-secondary-actionbar
Feb 16, 2026
Merged

Strip elements by attribute rules (id="secondary", id="actionbar", etc.)#13
Tomotz merged 3 commits intomasterfrom
devin/1771215833-strip-secondary-actionbar

Conversation

@Tomotz
Copy link
Owner

@Tomotz Tomotz commented Feb 16, 2026

Strip elements by attribute rules

Summary

Extends the existing HTML stripping (which already removes <script>, <style>, <head>, etc.) to also strip elements matching configurable (tag, attr, value) rules — currently <div id="secondary"> and <div id="actionbar">.

The new _strip_tags_by_attr function tracks nesting depth of the specified tag type to find the correct closing tag. It handles the target attribute appearing in any position among other attributes (e.g. <div class="widget-area" id="secondary">). Default rules only match div, but expanding to other tags is a one-line addition:

SKIP_ATTR_RULES = [
    ('div', 'id', 'secondary'),
    ('div', 'id', 'actionbar'),
    # ('section', 'id', 'sidebar'),  # example: other tag types
    # ('div', 'class', 'ads'),       # example: other attributes
]

Called in process_html_file right after the existing SKIP_TAG_PATTERN.sub call.

Review & Testing Checklist for Human

  • Malformed HTML: If a matched opening tag is never closed, everything from that tag to end-of-file will be silently stripped. Verify your input HTML doesn't have unclosed instances of these elements
  • Class attribute caveat: The attribute match is exact-value — class="ads" matches, but class="ads sidebar" does not. If you plan to add class-based rules with multi-value class attributes, the regex will need adjustment
  • Content check: Confirm no actual book content lives inside elements matching these rules in your ebook HTML files
  • Test plan: Run python main.py twig_full.html --html -o output.html and verify:
    • grep -c 'id="secondary"' output.html returns 0
    • grep -c 'id="actionbar"' output.html returns 0
    • Book paragraphs are still present and correct

Notes

devin-ai-integration bot and others added 2 commits February 16, 2026 04:25
…output

Co-Authored-By: tom mottes <tom.mottes@gmail.com>
Co-Authored-By: tom mottes <tom.mottes@gmail.com>
@Tomotz Tomotz changed the title Strip <div id="secondary"> and <div id="actionbar"> blocks from HTML output Strip elements by attribute rules (id="secondary", id="actionbar", etc.) Feb 16, 2026
… tuples

Co-Authored-By: tom mottes <tom.mottes@gmail.com>
@Tomotz Tomotz merged commit b88149f into master Feb 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments