-
Notifications
You must be signed in to change notification settings - Fork 5
feat: switch to pdfplumber for improved resume text extraction #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe PR replaces the PDF extraction library from pypdf to pdfplumber. The app now streams uploaded files directly with pdfplumber.open(), extracts text page-by-page, and joins results with newlines. Dependencies and tests are updated accordingly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
levelup/app.py(2 hunks)pyproject.toml(1 hunks)tests/test_resume.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_resume.py (1)
levelup/app.py (1)
extract_text_from_pdf(23-31)
🔇 Additional comments (2)
tests/test_resume.py (1)
24-40: LGTM! Well-structured test for pdfplumber integration.The test correctly mocks the pdfplumber context manager protocol and accurately simulates the
pdf.pagesAPI. The use ofBytesIOfor the fake file input aligns well with how Streamlit'sfile_uploaderworks in practice.levelup/app.py (1)
23-31: LGTM! Clean implementation with proper error handling.The switch to pdfplumber is well-implemented:
- Context manager ensures proper resource cleanup
- The
or ""fallback on line 26 safely handles pages whereextract_text()returnsNone- Joining pages with newlines preserves document structure
- Error handling is preserved from the previous implementation
| "starlette>=0.49.1", | ||
| "streamlit>=1.51.0", | ||
| "pytest-mock==3.15.1" | ||
| "pytest-mock==3.15.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove test dependency from main dependencies.
pytest-mock is a testing library that should not be in the main dependencies list. It's already correctly listed in the dev dependency group at line 50. Including it here will unnecessarily bloat production installations.
Apply this diff to remove it from main dependencies:
"starlette>=0.49.1",
"streamlit>=1.51.0",
- "pytest-mock==3.15.1",
"pdfplumber==0.11.8"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "pytest-mock==3.15.1", | |
| "starlette>=0.49.1", | |
| "streamlit>=1.51.0", | |
| "pdfplumber==0.11.8" |
🤖 Prompt for AI Agents
In pyproject.toml around line 32, remove the "pytest-mock==3.15.1" entry from
the main dependencies section (it's a test-only library and already present in
the dev group); delete that line and adjust any trailing commas or list
formatting so the dependencies list remains valid.
| "streamlit>=1.51.0", | ||
| "pytest-mock==3.15.1" | ||
| "pytest-mock==3.15.1", | ||
| "pdfplumber==0.11.8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify pdfplumber version validity and mypy configuration.
Please ensure:
- Version 0.11.8 is valid and has no known security vulnerabilities
- Whether
pdfplumberrequires a mypy override entry intool.mypy.overrides(lines 162-181), similar to howpypdf.*was previously configured
Run the following script to check the pdfplumber version and security advisories:
Additionally, search for whether pdfplumber provides type stubs:
🏁 Script executed:
#!/bin/bash
# Check PyPI for latest pdfplumber version
curl -s https://pypi.org/pypi/pdfplumber/json | jq '.info.version'
# Check for security advisories
gh api graphql -f query='
{
securityVulnerabilities(first: 5, ecosystem: PIP, package: "pdfplumber") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}'Length of output: 58
🌐 Web query:
Does pdfplumber 0.11.8 provide type stubs or require mypy ignore_missing_imports configuration?
💡 Result:
Short answer: No — pdfplumber (0.11.8) does not ship type stubs / a py.typed marker. You will need to provide your own stubs or tell mypy to ignore missing imports (e.g., ignore_missing_imports = True or per-import # type: ignore). Sources: pdfplumber project page / repo and PyPI listing. [1][2]
References
- pdfplumber GitHub repository (no bundled typing files/stubs). [1]
- pdfplumber on PyPI (release files; package has no separate types package). [2]
---
🏁 Script executed:
```shell
#!/bin/bash
# Check current mypy overrides in pyproject.toml
rg -A 20 "tool\.mypy\.overrides" pyproject.toml
# Check for pdfplumber imports in codebase
rg -n "import pdfplumber|from pdfplumber" --type=py
Length of output: 511
Add pdfplumber to mypy overrides configuration.
pdfplumber 0.11.8 does not provide type stubs and must be configured in mypy to suppress type-checking errors. Add "pdfplumber.*" to the ignore_missing_imports list in tool.mypy.overrides (lines 162-181 in pyproject.toml), or add # type: ignore to the import in levelup/app.py:7.
🤖 Prompt for AI Agents
pyproject.toml around lines 162-181: pdfplumber lacks type stubs and is causing
mypy import errors; add "pdfplumber.*" to the tool.mypy.overrides
ignore_missing_imports list (or alternatively add a trailing "# type: ignore" to
the pdfplumber import at levelup/app.py:7) to suppress missing-import
type-checking errors.
Summary by CodeRabbit
Release Notes