Skip to content

Conversation

@Mzack9999
Copy link
Member

@Mzack9999 Mzack9999 commented Nov 20, 2025

Closes #2316

Summary by CodeRabbit

  • Bug Fixes

    • Improved HTML sanitization and conversion to prevent crashes and parser overflow on deeply nested or malformed HTML.
    • Enhanced error handling so conversion failures return safe, empty results instead of causing failures.
  • Tests

    • Added tests verifying recovery from deeply nested HTML and correct handling of normal and problematic HTML inputs.

✏️ Tip: You can customize this high-level summary in your review settings.

@Mzack9999 Mzack9999 self-assigned this Nov 20, 2025
@Mzack9999 Mzack9999 added the Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Nov 20, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 20, 2025

Walkthrough

Adds an HTML sanitization pipeline and panic recovery to the page-type classifier. htmlToText now sanitizes input with a lazily-initialized bluemonday policy, recovers from panics, and returns errors for conversion failures; tests validating deep-nesting and normal HTML were added.

Changes

Cohort / File(s) Summary
HTML Sanitizer & Resilient Conversion
common/pagetypeclassifier/pagetypeclassifier.go
Introduces sanitizerPolicy and sanitizerPolicyOnce, adds getSanitizerPolicy(), updates htmlToText(html string) (text string, err error) to include defer-recover, sanitize input via bluemonday before conversion, early-exit on empty sanitized content, and propagate conversion errors. Adds imports sync, github.com/microcosm-cc/bluemonday, and fmt.
Tests for Deep-Nesting & Normal HTML
common/pagetypeclassifier/pagetypeclassifier_test.go
Adds tests: panic recovery with deeply nested HTML, htmlToText behavior for deeply nested HTML (expect empty/no panic), and htmlToText with normal HTML (expect non-empty text).

Sequence Diagram

sequenceDiagram
    autonumber
    participant Caller
    participant htmlToText
    participant Recovery as "defer/recover"
    participant Sanitizer as "bluemonday Sanitizer"
    participant Converter as "HTML->Markdown Converter"
    participant Result

    Caller->>htmlToText: call with HTML payload
    htmlToText->>Recovery: install defer recovery
    htmlToText->>Sanitizer: sanitize HTML input
    alt sanitization yields empty
        Sanitizer-->>htmlToText: "" (empty)
        htmlToText-->>Result: return "" , nil/error (early exit)
    else sanitization produced content
        Sanitizer-->>htmlToText: sanitized HTML
        htmlToText->>Converter: convert sanitized HTML to text
        alt conversion succeeds
            Converter-->>Result: text, nil
        else conversion fails
            Converter-->>Result: "", error
        end
    end
    Recovery-->>htmlToText: (on panic) capture and set err
    htmlToText-->>Caller: return (text, err)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review the sanitizer policy configuration and ensure the aggressive stripping meets expectations in common/pagetypeclassifier/pagetypeclassifier.go.
  • Verify the defer-recover block correctly transforms panics into actionable errors without hiding important failures.
  • Confirm any call sites or tests use the updated htmlToText signature and named return values consistently.
  • Inspect added tests in common/pagetypeclassifier/pagetypeclassifier_test.go for deterministic behavior and adequate coverage of edge cases.

Poem

🐰 I munched through tags and tangled trees,
Bluemonday snipped the wildest leaves,
A saved-from-panic tale I sing,
Clean text now hops from nested rings,
Happy rabbit, happy string. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Adding panic guard + Tests' is vague and generic, using imprecise terminology that doesn't convey the specific nature of the changes (HTML sanitization, deeply nested HTML handling). Revise title to be more specific, e.g., 'Sanitize HTML and guard against deeply nested HTML stack overflow' or 'Add HTML sanitization and panic recovery for nested HTML'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The pull request fully addresses the panic caused by deeply nested HTML (#2316) by introducing HTML sanitization via bluemonday, a defer-recover block to catch panics, and comprehensive tests validating the fix.
Out of Scope Changes check ✅ Passed All changes are directly related to addressing the panic issue from #2316; no out-of-scope modifications detected beyond the intended HTML sanitization and panic recovery.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bufgix-2316-stack-overflow

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f13beb and 275d63f.

📒 Files selected for processing (2)
  • common/pagetypeclassifier/pagetypeclassifier.go (2 hunks)
  • common/pagetypeclassifier/pagetypeclassifier_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • common/pagetypeclassifier/pagetypeclassifier_test.go
  • common/pagetypeclassifier/pagetypeclassifier.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: release-test
  • GitHub Check: Functional Test (ubuntu-latest)
  • GitHub Check: Functional Test (macOS-latest)
  • GitHub Check: Lint Test
  • GitHub Check: Functional Test (windows-latest)
  • GitHub Check: Analyze (go)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Mzack9999 Mzack9999 marked this pull request as ready for review November 20, 2025 10:57
@auto-assign auto-assign bot requested a review from dogancanbakir November 20, 2025 10:57
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
common/pagetypeclassifier/pagetypeclassifier_test.go (2)

59-79: Deeply nested classification test looks good but HTML builder could be shared

The classification behavior check (Classify returns "other" on pathological HTML) matches the implementation and the intended fallback path, and this subtest should effectively guard against regressions in the panic handling at the classifier level.

If you keep adding similar tests, consider extracting the deeply nested HTML builder into a small helper to avoid duplication with the later htmlToText test and make the intent clearer.


99-104: Consider asserting more specific output for normal HTML

The “normal HTML” case only asserts that the result is non‑empty. If you want stronger protection against regressions in sanitization or markdown conversion, you could assert that the output contains expected substrings (e.g., "Title" and "Some content here"), rather than just non‑emptiness. This is optional but would make the test more robust.

common/pagetypeclassifier/pagetypeclassifier.go (1)

43-59: Sanitizer comments may overstate its ability to prevent parser overflows

The implementation and intent of getSanitizerPolicy are clear, but the comments currently say this aggressively strips HTML “to reduce nesting depth and prevent parser stack overflow”.

In practice, bluemonday still has to parse the original HTML (internally using golang.org/x/net/html), so the 512‑node stack limit is still present during sanitization. The sanitizer is great for reducing noise and simplifying the tree, but the actual protection against the overflow panic is the defer/recover in htmlToText.

I’d suggest softening the comments to something like “helps reduce nesting depth and mitigate the risk, while htmlToText’s recover guarantees we don’t crash on parser panics”.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0658afd and 5f13beb.

📒 Files selected for processing (2)
  • common/pagetypeclassifier/pagetypeclassifier.go (2 hunks)
  • common/pagetypeclassifier/pagetypeclassifier_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
common/pagetypeclassifier/pagetypeclassifier_test.go (2)
common/pagetypeclassifier/pagetypeclassifier.go (1)
  • New (20-26)
runner/runner.go (1)
  • New (117-406)
🪛 GitHub Actions: 🔨 Build Test
common/pagetypeclassifier/pagetypeclassifier_test.go

[error] 95-95: TestPageTypeClassifier/test_htmlToText_with_deeply_nested_HTML failed: html: open stack of elements exceeds 512 nodes.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Functional Test (windows-latest)
🔇 Additional comments (1)
common/pagetypeclassifier/pagetypeclassifier.go (1)

36-41: Global sanitizer policy initialization looks correct

The sanitizerPolicy + sync.Once pattern is concurrency‑safe and avoids recreating the bluemonday policy on every call. This is a reasonable optimization on the hot path.

@dogancanbakir
Copy link
Member

test fails

@Mzack9999 Mzack9999 merged commit 61a791c into dev Nov 28, 2025
13 checks passed
@Mzack9999 Mzack9999 deleted the bufgix-2316-stack-overflow branch November 28, 2025 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Projects

None yet

3 participants