fix: harden deserialization, path traversal, log redaction, and size limits #57

hummat · 2026-02-08T19:04:32Z

Summary

Fix all 5 security vulnerabilities reported in #55 (code audit findings), ordered by severity.

Closes: #55

Changes

[CRITICAL] Restricted pickle deserialization — _RestrictedUnpickler + _safe_unpickle replace raw pickle.loads() in paperqa.py and paperqa_mcp_server.py. Only allows basic Python builtins (dict, str, int, etc.), blocking arbitrary code execution from crafted index files.
[HIGH] Path traversal validation — _validate_index_name() rejects .., /, \, null bytes. Applied at 7 path-construction entry points across paperqa.py, paperqa_mcp_server.py, and leann.py.
[MEDIUM] Staging boundary check — _paperqa_find_crashing_file() now verifies all returned paths are within paper_directory via is_relative_to(). Additional guard in cli/ask.py before f.unlink().
[MEDIUM] API key redaction — _redact_cmd() in leann.py masks --api-key / --embedding-api-key values with *** in 3 debug/error log locations.
[LOW] Bounded decompression and download sizes — _safe_zlib_decompress() (100MB limit), _MAX_DOWNLOAD_SIZE (500MB) check on Content-Length + body, _MAX_TAR_MEMBER_SIZE/_MAX_TAR_TOTAL_SIZE for tarball extraction.
25 new tests covering all 5 fixes across test_paperqa.py, test_leann.py, test_paper.py.
1 existing test fix — added headers attribute to MockResponse in test_figure_extraction.py (required by new download size check).

Type of Change

Bug fix

Testing

Ran make check (required) — 470 passed, 0 failed, lint clean, 75.80% coverage
Added/updated tests — 25 new security tests
Tested CLI manually

Checklist

I have read the Contributing Guidelines
My code follows the existing style (ruff format + pyright)
I have updated documentation if needed (README, AGENT_INTEGRATION.md)
CLI changes are reflected in help text and docs

…limits Address all 5 security findings from issue #55: 1. [CRITICAL] Restricted pickle deserialization via _RestrictedUnpickler (blocks arbitrary code execution from crafted index files) 2. [HIGH] Path traversal validation on index_name at all entry points 3. [MEDIUM] Staging boundary check before file deletion in crash handler 4. [MEDIUM] API key redaction in LEANN debug/error log output 5. [LOW] Bounded zlib decompression (100MB), download size (500MB), and tarball extraction limits (100MB/file, 500MB total) Includes 25 new tests covering all fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe427a03c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-08T19:08:20Z

paperpipe/paperqa.py

+    for i in range(0, len(data), block_size):
+        chunk = dobj.decompress(data[i : i + block_size])
+        total += len(chunk)
+        if total > max_size:


Enforce zlib size limit before allocating output

The new _safe_zlib_decompress still calls dobj.decompress(...) without a max_length, so a crafted zlib stream that expands massively in a single call will allocate the full output before the total > max_size check runs. That means a decompression bomb can still exhaust memory even with the limit in place when reading untrusted files.zip contents. To actually bound memory, pass a remaining-length cap (e.g., dobj.decompress(chunk, max_size - total)) and handle unconsumed_tail/unused_data accordingly. The same issue appears in paperpipe/paperqa_mcp_server.py’s _safe_zlib_decompress.

Useful? React with 👍 / 👎.

- Validate index_name at entry of get_pqa_index_status before path construction - Use decompressobj max_length to truly bound memory during decompression - Guard int(Content-Length) against malformed headers (ValueError) - Use validated index_name in _write_index_metadata path construction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-02-08T19:20:48Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

chatgpt-codex-connector bot reviewed Feb 8, 2026

View reviewed changes

hummat merged commit 9eb7a7a into main Feb 8, 2026
2 checks passed

hummat deleted the fix/security-hardening branch February 8, 2026 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: harden deserialization, path traversal, log redaction, and size limits #57

fix: harden deserialization, path traversal, log redaction, and size limits #57

Uh oh!

hummat commented Feb 8, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 8, 2026

Uh oh!

Uh oh!

claude bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix: harden deserialization, path traversal, log redaction, and size limits #57

fix: harden deserialization, path traversal, log redaction, and size limits #57

Uh oh!

Conversation

hummat commented Feb 8, 2026

Summary

Changes

Type of Change

Testing

Checklist

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot commented Feb 8, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant