feat: switch to pdfplumber for improved resume text extraction #26

baranylcn · 2025-11-16T23:04:51Z

Summary by CodeRabbit

Release Notes

Chores
- Updated PDF parsing library for improved reliability and performance in document handling.
- Refreshed dependencies to align with current library versions.

coderabbitai · 2025-11-16T23:05:03Z

Walkthrough

The PR replaces the PDF extraction library from pypdf to pdfplumber. The app now streams uploaded files directly with pdfplumber.open(), extracts text page-by-page, and joins results with newlines. Dependencies and tests are updated accordingly.

Changes

Cohort / File(s)	Summary
PDF Extraction Implementation `levelup/app.py`	Replaces pypdf with pdfplumber for PDF parsing; switches from io.BytesIO-based PdfReader approach to direct file streaming with pdfplumber.open(). Text extraction now iterates through pages and collects text with newline separators. Error handling preserved.
Dependency Management `pyproject.toml`	Removes pypdf>=6.1.3 dependency; adds pdfplumber==0.11.8. Removes pypdf.* from mypy per-module overrides.
Test Updates `tests/test_resume.py`	Updates mocks to simulate pdfplumber.open context manager and its PDF object structure. Replaces PdfReader mocking with pdfplumber-based mocking. Assertions and test expectations remain functionally equivalent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

app.py: Verify pdfplumber API usage is correct; ensure text extraction from pages produces expected output format and handles multi-page documents properly.
pyproject.toml: Confirm pdfplumber version is compatible and no additional transitive dependencies are problematic.
tests/test_resume.py: Validate mock setup properly simulates pdfplumber's context manager behavior and that mocked page extraction returns expected values.

Poem

🐰 A rabbit hops through PDFs with glee,
pypdf was fine, but pdfplumber's the key!
Page after page, we extract with care,
New library magic fills the air ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: switch to pdfplumber for improved resume text extraction' directly summarizes the main change—replacing pypdf with pdfplumber for PDF text extraction across app.py, pyproject.toml, and tests.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/pdf-extraction-improvement

Tip

📝 Customizable high-level summaries are now available!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide custom instructions to shape the summary (bullet lists, tables, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example:

"Create a concise high-level summary as a bullet-point list. Then include a Markdown table showing lines added and removed by each contributing author."

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 382d8f1 and 5399671.

📒 Files selected for processing (3)

levelup/app.py (2 hunks)
pyproject.toml (1 hunks)
tests/test_resume.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_resume.py (1)

levelup/app.py (1)

extract_text_from_pdf (23-31)

🔇 Additional comments (2)

tests/test_resume.py (1)

24-40: LGTM! Well-structured test for pdfplumber integration.

The test correctly mocks the pdfplumber context manager protocol and accurately simulates the pdf.pages API. The use of BytesIO for the fake file input aligns well with how Streamlit's file_uploader works in practice.

levelup/app.py (1)

23-31: LGTM! Clean implementation with proper error handling.

The switch to pdfplumber is well-implemented:

Context manager ensures proper resource cleanup

The or "" fallback on line 26 safely handles pages where extract_text() returns None

Joining pages with newlines preserves document structure

Error handling is preserved from the previous implementation

coderabbitai · 2025-11-16T23:09:39Z

pyproject.toml

    "starlette>=0.49.1",
    "streamlit>=1.51.0",
-    "pytest-mock==3.15.1"
+    "pytest-mock==3.15.1",


⚠️ Potential issue | 🔴 Critical

Remove test dependency from main dependencies.

pytest-mock is a testing library that should not be in the main dependencies list. It's already correctly listed in the dev dependency group at line 50. Including it here will unnecessarily bloat production installations.

Apply this diff to remove it from main dependencies:

"starlette>=0.49.1", "streamlit>=1.51.0", - "pytest-mock==3.15.1", "pdfplumber==0.11.8"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"pytest-mock==3.15.1",

"starlette>=0.49.1",

"streamlit>=1.51.0",

"pdfplumber==0.11.8"

🤖 Prompt for AI Agents

In pyproject.toml around line 32, remove the "pytest-mock==3.15.1" entry from the main dependencies section (it's a test-only library and already present in the dev group); delete that line and adjust any trailing commas or list formatting so the dependencies list remains valid.

coderabbitai · 2025-11-16T23:09:40Z

pyproject.toml

    "streamlit>=1.51.0",
-    "pytest-mock==3.15.1"
+    "pytest-mock==3.15.1",
+    "pdfplumber==0.11.8"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Verify pdfplumber version validity and mypy configuration.

Please ensure:

Version 0.11.8 is valid and has no known security vulnerabilities

Whether pdfplumber requires a mypy override entry in tool.mypy.overrides (lines 162-181), similar to how pypdf.* was previously configured

Run the following script to check the pdfplumber version and security advisories:

Additionally, search for whether pdfplumber provides type stubs:

🏁 Script executed:

#!/bin/bash # Check PyPI for latest pdfplumber version curl -s https://pypi.org/pypi/pdfplumber/json | jq '.info.version' # Check for security advisories gh api graphql -f query=' { securityVulnerabilities(first: 5, ecosystem: PIP, package: "pdfplumber") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }'

Length of output: 58

🌐 Web query:

Does pdfplumber 0.11.8 provide type stubs or require mypy ignore_missing_imports configuration?

💡 Result:

Short answer: No — pdfplumber (0.11.8) does not ship type stubs / a py.typed marker. You will need to provide your own stubs or tell mypy to ignore missing imports (e.g., ignore_missing_imports = True or per-import # type: ignore). Sources: pdfplumber project page / repo and PyPI listing. [1][2]

References

pdfplumber GitHub repository (no bundled typing files/stubs). [1]

pdfplumber on PyPI (release files; package has no separate types package). [2]

--- 🏁 Script executed: ```shell #!/bin/bash # Check current mypy overrides in pyproject.toml rg -A 20 "tool\.mypy\.overrides" pyproject.toml # Check for pdfplumber imports in codebase rg -n "import pdfplumber|from pdfplumber" --type=py

Length of output: 511

Add pdfplumber to mypy overrides configuration.

pdfplumber 0.11.8 does not provide type stubs and must be configured in mypy to suppress type-checking errors. Add "pdfplumber.*" to the ignore_missing_imports list in tool.mypy.overrides (lines 162-181 in pyproject.toml), or add # type: ignore to the import in levelup/app.py:7.

🤖 Prompt for AI Agents

pyproject.toml around lines 162-181: pdfplumber lacks type stubs and is causing mypy import errors; add "pdfplumber.*" to the tool.mypy.overrides ignore_missing_imports list (or alternatively add a trailing "# type: ignore" to the pdfplumber import at levelup/app.py:7) to suppress missing-import type-checking errors.

feat: switch to pdfplumber for improved resume text extraction

5399671

coderabbitai bot reviewed Nov 16, 2025

View reviewed changes

baranylcn merged commit a3c8752 into main Nov 16, 2025
6 checks passed

baranylcn deleted the feat/pdf-extraction-improvement branch November 23, 2025 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: switch to pdfplumber for improved resume text extraction #26

feat: switch to pdfplumber for improved resume text extraction #26

baranylcn commented Nov 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 16, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 16, 2025

Uh oh!

coderabbitai bot Nov 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: switch to pdfplumber for improved resume text extraction #26

feat: switch to pdfplumber for improved resume text extraction #26

Conversation

baranylcn commented Nov 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baranylcn commented Nov 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 16, 2025 •

edited

Loading