-
Notifications
You must be signed in to change notification settings - Fork 1k
Adding panic guard + Tests #2317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds an HTML sanitization pipeline and panic recovery to the page-type classifier. Changes
Sequence DiagramsequenceDiagram
autonumber
participant Caller
participant htmlToText
participant Recovery as "defer/recover"
participant Sanitizer as "bluemonday Sanitizer"
participant Converter as "HTML->Markdown Converter"
participant Result
Caller->>htmlToText: call with HTML payload
htmlToText->>Recovery: install defer recovery
htmlToText->>Sanitizer: sanitize HTML input
alt sanitization yields empty
Sanitizer-->>htmlToText: "" (empty)
htmlToText-->>Result: return "" , nil/error (early exit)
else sanitization produced content
Sanitizer-->>htmlToText: sanitized HTML
htmlToText->>Converter: convert sanitized HTML to text
alt conversion succeeds
Converter-->>Result: text, nil
else conversion fails
Converter-->>Result: "", error
end
end
Recovery-->>htmlToText: (on panic) capture and set err
htmlToText-->>Caller: return (text, err)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
common/pagetypeclassifier/pagetypeclassifier_test.go (2)
59-79: Deeply nested classification test looks good but HTML builder could be sharedThe classification behavior check (
Classifyreturns"other"on pathological HTML) matches the implementation and the intended fallback path, and this subtest should effectively guard against regressions in the panic handling at the classifier level.If you keep adding similar tests, consider extracting the deeply nested HTML builder into a small helper to avoid duplication with the later htmlToText test and make the intent clearer.
99-104: Consider asserting more specific output for normal HTMLThe “normal HTML” case only asserts that the result is non‑empty. If you want stronger protection against regressions in sanitization or markdown conversion, you could assert that the output contains expected substrings (e.g.,
"Title"and"Some content here"), rather than just non‑emptiness. This is optional but would make the test more robust.common/pagetypeclassifier/pagetypeclassifier.go (1)
43-59: Sanitizer comments may overstate its ability to prevent parser overflowsThe implementation and intent of
getSanitizerPolicyare clear, but the comments currently say this aggressively strips HTML “to reduce nesting depth and prevent parser stack overflow”.In practice, bluemonday still has to parse the original HTML (internally using
golang.org/x/net/html), so the 512‑node stack limit is still present during sanitization. The sanitizer is great for reducing noise and simplifying the tree, but the actual protection against the overflow panic is thedefer/recoverinhtmlToText.I’d suggest softening the comments to something like “helps reduce nesting depth and mitigate the risk, while
htmlToText’s recover guarantees we don’t crash on parser panics”.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
common/pagetypeclassifier/pagetypeclassifier.go(2 hunks)common/pagetypeclassifier/pagetypeclassifier_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
common/pagetypeclassifier/pagetypeclassifier_test.go (2)
common/pagetypeclassifier/pagetypeclassifier.go (1)
New(20-26)runner/runner.go (1)
New(117-406)
🪛 GitHub Actions: 🔨 Build Test
common/pagetypeclassifier/pagetypeclassifier_test.go
[error] 95-95: TestPageTypeClassifier/test_htmlToText_with_deeply_nested_HTML failed: html: open stack of elements exceeds 512 nodes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Functional Test (windows-latest)
🔇 Additional comments (1)
common/pagetypeclassifier/pagetypeclassifier.go (1)
36-41: Global sanitizer policy initialization looks correctThe
sanitizerPolicy+sync.Oncepattern is concurrency‑safe and avoids recreating the bluemonday policy on every call. This is a reasonable optimization on the hot path.
|
test fails |
Closes #2316
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.