Skip to content

Conversation

@baranylcn
Copy link
Member

@baranylcn baranylcn commented Nov 16, 2025

Summary by CodeRabbit

Release Notes

  • Chores
    • Updated PDF parsing library for improved reliability and performance in document handling.
    • Refreshed dependencies to align with current library versions.

@coderabbitai
Copy link

coderabbitai bot commented Nov 16, 2025

Walkthrough

The PR replaces the PDF extraction library from pypdf to pdfplumber. The app now streams uploaded files directly with pdfplumber.open(), extracts text page-by-page, and joins results with newlines. Dependencies and tests are updated accordingly.

Changes

Cohort / File(s) Summary
PDF Extraction Implementation
levelup/app.py
Replaces pypdf with pdfplumber for PDF parsing; switches from io.BytesIO-based PdfReader approach to direct file streaming with pdfplumber.open(). Text extraction now iterates through pages and collects text with newline separators. Error handling preserved.
Dependency Management
pyproject.toml
Removes pypdf>=6.1.3 dependency; adds pdfplumber==0.11.8. Removes pypdf.* from mypy per-module overrides.
Test Updates
tests/test_resume.py
Updates mocks to simulate pdfplumber.open context manager and its PDF object structure. Replaces PdfReader mocking with pdfplumber-based mocking. Assertions and test expectations remain functionally equivalent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • app.py: Verify pdfplumber API usage is correct; ensure text extraction from pages produces expected output format and handles multi-page documents properly.
  • pyproject.toml: Confirm pdfplumber version is compatible and no additional transitive dependencies are problematic.
  • tests/test_resume.py: Validate mock setup properly simulates pdfplumber's context manager behavior and that mocked page extraction returns expected values.

Poem

🐰 A rabbit hops through PDFs with glee,
pypdf was fine, but pdfplumber's the key!
Page after page, we extract with care,
New library magic fills the air ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: switch to pdfplumber for improved resume text extraction' directly summarizes the main change—replacing pypdf with pdfplumber for PDF text extraction across app.py, pyproject.toml, and tests.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/pdf-extraction-improvement

Tip

📝 Customizable high-level summaries are now available!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide custom instructions to shape the summary (bullet lists, tables, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example:

"Create a concise high-level summary as a bullet-point list. Then include a Markdown table showing lines added and removed by each contributing author."


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 382d8f1 and 5399671.

📒 Files selected for processing (3)
  • levelup/app.py (2 hunks)
  • pyproject.toml (1 hunks)
  • tests/test_resume.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_resume.py (1)
levelup/app.py (1)
  • extract_text_from_pdf (23-31)
🔇 Additional comments (2)
tests/test_resume.py (1)

24-40: LGTM! Well-structured test for pdfplumber integration.

The test correctly mocks the pdfplumber context manager protocol and accurately simulates the pdf.pages API. The use of BytesIO for the fake file input aligns well with how Streamlit's file_uploader works in practice.

levelup/app.py (1)

23-31: LGTM! Clean implementation with proper error handling.

The switch to pdfplumber is well-implemented:

  • Context manager ensures proper resource cleanup
  • The or "" fallback on line 26 safely handles pages where extract_text() returns None
  • Joining pages with newlines preserves document structure
  • Error handling is preserved from the previous implementation

"starlette>=0.49.1",
"streamlit>=1.51.0",
"pytest-mock==3.15.1"
"pytest-mock==3.15.1",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove test dependency from main dependencies.

pytest-mock is a testing library that should not be in the main dependencies list. It's already correctly listed in the dev dependency group at line 50. Including it here will unnecessarily bloat production installations.

Apply this diff to remove it from main dependencies:

     "starlette>=0.49.1",
     "streamlit>=1.51.0",
-    "pytest-mock==3.15.1",
     "pdfplumber==0.11.8"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"pytest-mock==3.15.1",
"starlette>=0.49.1",
"streamlit>=1.51.0",
"pdfplumber==0.11.8"
🤖 Prompt for AI Agents
In pyproject.toml around line 32, remove the "pytest-mock==3.15.1" entry from
the main dependencies section (it's a test-only library and already present in
the dev group); delete that line and adjust any trailing commas or list
formatting so the dependencies list remains valid.

"streamlit>=1.51.0",
"pytest-mock==3.15.1"
"pytest-mock==3.15.1",
"pdfplumber==0.11.8"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Verify pdfplumber version validity and mypy configuration.

Please ensure:

  1. Version 0.11.8 is valid and has no known security vulnerabilities
  2. Whether pdfplumber requires a mypy override entry in tool.mypy.overrides (lines 162-181), similar to how pypdf.* was previously configured

Run the following script to check the pdfplumber version and security advisories:

Additionally, search for whether pdfplumber provides type stubs:


🏁 Script executed:

#!/bin/bash
# Check PyPI for latest pdfplumber version
curl -s https://pypi.org/pypi/pdfplumber/json | jq '.info.version'

# Check for security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "pdfplumber") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 58


🌐 Web query:

Does pdfplumber 0.11.8 provide type stubs or require mypy ignore_missing_imports configuration?

💡 Result:

Short answer: No — pdfplumber (0.11.8) does not ship type stubs / a py.typed marker. You will need to provide your own stubs or tell mypy to ignore missing imports (e.g., ignore_missing_imports = True or per-import # type: ignore). Sources: pdfplumber project page / repo and PyPI listing. [1][2]

References

  • pdfplumber GitHub repository (no bundled typing files/stubs). [1]
  • pdfplumber on PyPI (release files; package has no separate types package). [2]


---

🏁 Script executed:

```shell
#!/bin/bash
# Check current mypy overrides in pyproject.toml
rg -A 20 "tool\.mypy\.overrides" pyproject.toml

# Check for pdfplumber imports in codebase
rg -n "import pdfplumber|from pdfplumber" --type=py

Length of output: 511


Add pdfplumber to mypy overrides configuration.

pdfplumber 0.11.8 does not provide type stubs and must be configured in mypy to suppress type-checking errors. Add "pdfplumber.*" to the ignore_missing_imports list in tool.mypy.overrides (lines 162-181 in pyproject.toml), or add # type: ignore to the import in levelup/app.py:7.

🤖 Prompt for AI Agents
pyproject.toml around lines 162-181: pdfplumber lacks type stubs and is causing
mypy import errors; add "pdfplumber.*" to the tool.mypy.overrides
ignore_missing_imports list (or alternatively add a trailing "# type: ignore" to
the pdfplumber import at levelup/app.py:7) to suppress missing-import
type-checking errors.

@baranylcn baranylcn merged commit a3c8752 into main Nov 16, 2025
6 checks passed
@baranylcn baranylcn deleted the feat/pdf-extraction-improvement branch November 23, 2025 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants