Adding panic guard + Tests #2317

Mzack9999 · 2025-11-20T10:48:20Z

Summary by CodeRabbit

Bug Fixes
- Improved HTML sanitization and conversion to prevent crashes and parser overflow on deeply nested or malformed HTML.
- Enhanced error handling so conversion failures return safe, empty results instead of causing failures.
Tests
- Added tests verifying recovery from deeply nested HTML and correct handling of normal and problematic HTML inputs.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-20T10:48:26Z

Walkthrough

Adds an HTML sanitization pipeline and panic recovery to the page-type classifier. htmlToText now sanitizes input with a lazily-initialized bluemonday policy, recovers from panics, and returns errors for conversion failures; tests validating deep-nesting and normal HTML were added.

Changes

Cohort / File(s)	Summary
HTML Sanitizer & Resilient Conversion `common/pagetypeclassifier/pagetypeclassifier.go`	Introduces `sanitizerPolicy` and `sanitizerPolicyOnce`, adds `getSanitizerPolicy()`, updates `htmlToText(html string) (text string, err error)` to include defer-recover, sanitize input via bluemonday before conversion, early-exit on empty sanitized content, and propagate conversion errors. Adds imports `sync`, `github.com/microcosm-cc/bluemonday`, and `fmt`.
Tests for Deep-Nesting & Normal HTML `common/pagetypeclassifier/pagetypeclassifier_test.go`	Adds tests: panic recovery with deeply nested HTML, `htmlToText` behavior for deeply nested HTML (expect empty/no panic), and `htmlToText` with normal HTML (expect non-empty text).

Sequence Diagram

sequenceDiagram
    autonumber
    participant Caller
    participant htmlToText
    participant Recovery as "defer/recover"
    participant Sanitizer as "bluemonday Sanitizer"
    participant Converter as "HTML->Markdown Converter"
    participant Result

    Caller->>htmlToText: call with HTML payload
    htmlToText->>Recovery: install defer recovery
    htmlToText->>Sanitizer: sanitize HTML input
    alt sanitization yields empty
        Sanitizer-->>htmlToText: "" (empty)
        htmlToText-->>Result: return "" , nil/error (early exit)
    else sanitization produced content
        Sanitizer-->>htmlToText: sanitized HTML
        htmlToText->>Converter: convert sanitized HTML to text
        alt conversion succeeds
            Converter-->>Result: text, nil
        else conversion fails
            Converter-->>Result: "", error
        end
    end
    Recovery-->>htmlToText: (on panic) capture and set err
    htmlToText-->>Caller: return (text, err)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Review the sanitizer policy configuration and ensure the aggressive stripping meets expectations in common/pagetypeclassifier/pagetypeclassifier.go.
Verify the defer-recover block correctly transforms panics into actionable errors without hiding important failures.
Confirm any call sites or tests use the updated htmlToText signature and named return values consistently.
Inspect added tests in common/pagetypeclassifier/pagetypeclassifier_test.go for deterministic behavior and adequate coverage of edge cases.

Poem

🐰 I munched through tags and tangled trees,
Bluemonday snipped the wildest leaves,
A saved-from-panic tale I sing,
Clean text now hops from nested rings,
Happy rabbit, happy string. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'Adding panic guard + Tests' is vague and generic, using imprecise terminology that doesn't convey the specific nature of the changes (HTML sanitization, deeply nested HTML handling).	Revise title to be more specific, e.g., 'Sanitize HTML and guard against deeply nested HTML stack overflow' or 'Add HTML sanitization and panic recovery for nested HTML'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	The pull request fully addresses the panic caused by deeply nested HTML (#2316) by introducing HTML sanitization via bluemonday, a defer-recover block to catch panics, and comprehensive tests validating the fix.
Out of Scope Changes check	✅ Passed	All changes are directly related to addressing the panic issue from #2316; no out-of-scope modifications detected beyond the intended HTML sanitization and panic recovery.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bufgix-2316-stack-overflow

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f13beb and 275d63f.

📒 Files selected for processing (2)

common/pagetypeclassifier/pagetypeclassifier.go (2 hunks)
common/pagetypeclassifier/pagetypeclassifier_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

common/pagetypeclassifier/pagetypeclassifier_test.go
common/pagetypeclassifier/pagetypeclassifier.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: release-test
GitHub Check: Functional Test (ubuntu-latest)
GitHub Check: Functional Test (macOS-latest)
GitHub Check: Lint Test
GitHub Check: Functional Test (windows-latest)
GitHub Check: Analyze (go)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

common/pagetypeclassifier/pagetypeclassifier_test.go (2)

59-79: Deeply nested classification test looks good but HTML builder could be shared

The classification behavior check (Classify returns "other" on pathological HTML) matches the implementation and the intended fallback path, and this subtest should effectively guard against regressions in the panic handling at the classifier level.

If you keep adding similar tests, consider extracting the deeply nested HTML builder into a small helper to avoid duplication with the later htmlToText test and make the intent clearer.

99-104: Consider asserting more specific output for normal HTML

The “normal HTML” case only asserts that the result is non‑empty. If you want stronger protection against regressions in sanitization or markdown conversion, you could assert that the output contains expected substrings (e.g., "Title" and "Some content here"), rather than just non‑emptiness. This is optional but would make the test more robust.

common/pagetypeclassifier/pagetypeclassifier.go (1)

43-59: Sanitizer comments may overstate its ability to prevent parser overflows

The implementation and intent of getSanitizerPolicy are clear, but the comments currently say this aggressively strips HTML “to reduce nesting depth and prevent parser stack overflow”.

In practice, bluemonday still has to parse the original HTML (internally using golang.org/x/net/html), so the 512‑node stack limit is still present during sanitization. The sanitizer is great for reducing noise and simplifying the tree, but the actual protection against the overflow panic is the defer/recover in htmlToText.

I’d suggest softening the comments to something like “helps reduce nesting depth and mitigate the risk, while htmlToText’s recover guarantees we don’t crash on parser panics”.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0658afd and 5f13beb.

📒 Files selected for processing (2)

common/pagetypeclassifier/pagetypeclassifier.go (2 hunks)
common/pagetypeclassifier/pagetypeclassifier_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

common/pagetypeclassifier/pagetypeclassifier_test.go (2)

common/pagetypeclassifier/pagetypeclassifier.go (1)

New (20-26)

runner/runner.go (1)

New (117-406)

🪛 GitHub Actions: 🔨 Build Test

common/pagetypeclassifier/pagetypeclassifier_test.go

[error] 95-95: TestPageTypeClassifier/test_htmlToText_with_deeply_nested_HTML failed: html: open stack of elements exceeds 512 nodes.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Functional Test (windows-latest)

🔇 Additional comments (1)

common/pagetypeclassifier/pagetypeclassifier.go (1)

36-41: Global sanitizer policy initialization looks correct

The sanitizerPolicy + sync.Once pattern is concurrency‑safe and avoids recreating the bluemonday policy on every call. This is a reasonable optimization on the hot path.

common/pagetypeclassifier/pagetypeclassifier_test.go

common/pagetypeclassifier/pagetypeclassifier.go

dogancanbakir · 2025-11-23T13:55:13Z

test fails

adding panic guard + tests

0309110

Mzack9999 self-assigned this Nov 20, 2025

Mzack9999 added the Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Nov 20, 2025

lint

5f13beb

Mzack9999 marked this pull request as ready for review November 20, 2025 10:57

auto-assign bot requested a review from dogancanbakir November 20, 2025 10:57

coderabbitai bot reviewed Nov 20, 2025

View reviewed changes

common/pagetypeclassifier/pagetypeclassifier_test.go Show resolved Hide resolved

common/pagetypeclassifier/pagetypeclassifier.go Show resolved Hide resolved

dogancanbakir approved these changes Nov 21, 2025

View reviewed changes

fix test

275d63f

Mzack9999 merged commit 61a791c into dev Nov 28, 2025
13 checks passed

Mzack9999 deleted the bufgix-2316-stack-overflow branch November 28, 2025 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding panic guard + Tests #2317

Adding panic guard + Tests #2317

Uh oh!

Mzack9999 commented Nov 20, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

dogancanbakir commented Nov 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding panic guard + Tests #2317

Adding panic guard + Tests #2317

Uh oh!

Conversation

Mzack9999 commented Nov 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dogancanbakir commented Nov 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mzack9999 commented Nov 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 20, 2025 •

edited

Loading