Skip to content

Conversation

@CJBshuosi
Copy link
Collaborator

@CJBshuosi CJBshuosi commented Feb 10, 2026

Closes #26

Summary by CodeRabbit

Release Notes

  • New Features
    • Added paper harvesting from multiple sources: arXiv, Semantic Scholar, and OpenAlex with real-time progress streaming.
    • Introduced paper search and filtering across harvested papers.
    • Added personal paper library management—save and organize papers with library endpoints.
    • Enabled harvest run tracking and history retrieval.
    • Integrated automatic paper deduplication and metadata enrichment across sources.

Closes jerry609#26

Signed-off-by: LIU BOYU <oor2020@163.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

Implements a complete paper harvesting system with multiple sources (arXiv, Semantic Scholar, OpenAlex), including pipeline orchestration, deduplication, storage layer, and REST API endpoints for searching, saving, and managing papers with harvest run tracking.

Changes

Cohort / File(s) Summary
Database Schema & ORM Models
alembic/versions/0003_paper_harvest_tables.py, src/paperbot/infrastructure/stores/models.py
Adds migration script creating papers and harvest_runs tables with indexes; introduces PaperModel and HarvestRunModel ORM classes with JSON field helpers for metadata serialization.
Domain Models & Ports
src/paperbot/domain/harvest.py, src/paperbot/application/ports/harvester_port.py
Defines domain types: HarvestSource enum, HarvestedPaper (with title hashing and serialization), HarvestResult, HarvestRunResult; establishes HarvesterPort protocol contract for harvester implementations.
Harvester Implementations
src/paperbot/infrastructure/harvesters/__init__.py, src/paperbot/infrastructure/harvesters/arxiv_harvester.py, src/paperbot/infrastructure/harvesters/semantic_scholar_harvester.py, src/paperbot/infrastructure/harvesters/openalex_harvester.py
Implements three harvesters with session/rate-limit management, API query building, result parsing, and error handling; exports via package init.
Utility Services
src/paperbot/application/services/__init__.py, src/paperbot/application/services/paper_deduplicator.py, src/paperbot/application/services/query_rewriter.py, src/paperbot/application/services/venue_recommender.py
Adds deduplication via DOI/IDs and title hash with metadata merging; query rewriter for abbreviation expansion; venue recommender with keyword-to-venue mappings.
Orchestration & Storage
src/paperbot/application/workflows/harvest_pipeline.py, src/paperbot/infrastructure/stores/paper_store.py
HarvestPipeline orchestrates multi-source harvesting with keyword expansion, venue recommendation, deduplication, and storage; PaperStore provides batch upsert, flexible search, harvest run lifecycle, and user library management.
API Routes & Integrations
src/paperbot/api/main.py, src/paperbot/api/routes/harvest.py, src/paperbot/api/routes/research.py
Adds comprehensive harvest API (POST /harvest with SSE streaming, harvest run management); paper search and library endpoints (save/unsave/list); extends research route with paper metadata fields and library integration on save action.
Frontend Components & Routing
web/src/app/api/papers/[paperId]/save/route.ts, web/src/app/api/papers/library/route.ts, web/src/components/research/ResearchDashboard.tsx, web/src/components/research/SavedPapersList.tsx, web/src/lib/api.ts
Adds proxy routes for paper save/library endpoints; updates dashboard to send paper metadata on save; refactors SavedPapersList to consume new /api/papers/library endpoint with updated response shape and unsave functionality.
Logging Enhancements
src/paperbot/context_engine/engine.py, src/paperbot/infrastructure/stores/research_store.py
Adds request-scoped tracing and logging to paper search and feedback workflows; logs initialization, configuration, results, and errors with detailed traceback on exceptions.
Test Coverage
tests/integration/test_harvest_pipeline.py, tests/integration/test_harvesters.py, tests/integration/test_paper_store.py, tests/unit/test_harvested_paper.py, tests/unit/test_paper_deduplicator.py, tests/unit/test_query_rewriter.py, tests/unit/test_venue_recommender.py
Comprehensive test suite covering harvest pipeline phases, deduplication scenarios, harvester API interactions, paper store CRUD/search/filtering, domain model serialization, and utility service behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant Pipeline
    participant Harvesters
    participant Deduplicator
    participant PaperStore
    participant DB

    Client->>API: POST /harvest (HarvestRequest)
    activate API
    
    API->>Pipeline: run(config)
    activate Pipeline
    
    Pipeline->>Pipeline: Expand keywords (optional)
    Pipeline->>Pipeline: Recommend venues (optional)
    Pipeline->>PaperStore: create_harvest_run()
    activate PaperStore
    PaperStore->>DB: INSERT harvest_runs
    PaperStore-->>Pipeline: HarvestRunModel
    deactivate PaperStore
    
    Pipeline->>API: yield HarvestProgress(phase="Initializing")
    API->>Client: Event: progress
    
    par Multi-source Harvest
        Pipeline->>Harvesters: search(query) via ArxivHarvester
        activate Harvesters
        Harvesters-->>Pipeline: HarvestResult[papers]
        deactivate Harvesters
        
        Pipeline->>Harvesters: search(query) via SemanticScholarHarvester
        activate Harvesters
        Harvesters-->>Pipeline: HarvestResult[papers]
        deactivate Harvesters
        
        Pipeline->>Harvesters: search(query) via OpenAlexHarvester
        activate Harvesters
        Harvesters-->>Pipeline: HarvestResult[papers]
        deactivate Harvesters
    end
    
    Pipeline->>API: yield HarvestProgress(phase="Deduplicating")
    API->>Client: Event: progress
    
    Pipeline->>Deduplicator: deduplicate(all_papers)
    activate Deduplicator
    Deduplicator-->>Pipeline: (deduplicated_papers, count)
    deactivate Deduplicator
    
    Pipeline->>API: yield HarvestProgress(phase="Storing")
    API->>Client: Event: progress
    
    Pipeline->>PaperStore: upsert_papers_batch(deduplicated)
    activate PaperStore
    PaperStore->>DB: INSERT/UPDATE papers
    PaperStore-->>Pipeline: (papers_created, papers_updated)
    deactivate PaperStore
    
    Pipeline->>PaperStore: update_harvest_run(final_status, counts)
    activate PaperStore
    PaperStore->>DB: UPDATE harvest_runs
    PaperStore-->>Pipeline: HarvestRunModel
    deactivate PaperStore
    
    Pipeline->>API: yield HarvestFinalResult(status, counts, duration)
    deactivate Pipeline
    
    API->>Client: Event: final_result
    deactivate API
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

This change introduces a substantial, multi-component harvesting system with interconnected services, heterogeneous logic across harvesters, deduplication strategies, API workflows, and frontend integration. Multiple new files with varying complexity (domain models, service implementations, orchestration, storage, API routes, tests) require separate reasoning paths and careful validation of integration points.

Possibly related PRs

  • PR #23: Updates SavedPapersList component and API response shapes; directly related as this PR implements the new /api/papers/library endpoint and backend storage that PR #23 consumes.

Suggested reviewers

  • jerry609

Poem

🐰 The rabbit hops through harvest fields,
Gathering papers, what bounty yields!
ArXiv, Scholar, OpenAlex aligned,
Dedup and store—a system designed,
From query to library, streams flow free,
A pipeline complete, hop-hop-hooray! 🌾

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(Harvest): add -- Paper Search and Storage' directly corresponds to the main feature being implemented: paper search and storage functionality with harvesters.
Linked Issues check ✅ Passed The PR successfully implements all core requirements from issue #26: adds ArxivHarvester and OpenAlexHarvester, implements paper search API, and enables saving papers to the library with full backend/frontend integration.
Out of Scope Changes check ✅ Passed All changes are directly related to the harvest/search/storage objectives; no unrelated modifications detected. Database migrations, services, and API endpoints are all cohesive parts of the paper harvesting and storage feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
web/src/lib/api.ts (1)

187-189: ⚠️ Potential issue | 🟡 Minor

fetchScholarDetails now implicitly depends on a live backend.

fetchScholarDetails calls fetchPapers() (Line 188), which now hits the real API instead of returning mock data. This means the scholar details page will fail or show empty publications if the backend is unavailable, even though all other data in this function is still mocked.

🤖 Fix all issues with AI agents
In `@alembic/versions/0003_paper_harvest_tables.py`:
- Around line 54-59: The upgrade() function returns early when _is_offline() is
true, so _upgrade_create_indexes() is never called and generated offline SQL
omits index creation; remove the early return or restructure upgrade() so it
always calls _upgrade_create_indexes() after _upgrade_create_tables() (rely on
_create_index/op.create_index which already handles offline mode) — modify
upgrade() to invoke _upgrade_create_tables() then _upgrade_create_indexes()
regardless of _is_offline() so indexes are emitted in offline-generated SQL.

In `@src/paperbot/api/routes/harvest.py`:
- Around line 323-331: get_user_library currently accepts user_id from the query
string (user_id param in get_user_library) which allows callers to access/modify
arbitrary users' libraries; change this to derive the user ID from the
authenticated session/token (e.g., replace the user_id Query param with a
dependency that extracts the current_user/current_user.id from your auth
middleware or token validation function used elsewhere) and remove the default
"default" value; apply the same change to the other endpoint that accepts
user_id in the request body (lines referenced) so both endpoints use the
authenticated user context; if no authentication exists for this MVP, add a
clear TODO comment inside get_user_library and the other endpoint stating that
user_id is temporary and must be replaced with auth-derived user identification
before production.
- Around line 390-414: save_paper_to_library creates a new
SqlAlchemyResearchStore() per request and never closes it, leaking DB resources;
change this to reuse a shared/lazy-initialized research store or inject one
instead of instantiating inline: add a module-level lazy getter (similar to
_get_paper_store) or accept a dependency and use that shared instance, then call
the store's close/dispose method if it has one after record_paper_feedback (or
rely on the shared instance's lifecycle); update references in
save_paper_to_library and any other handlers that instantiate
SqlAlchemyResearchStore() so they use the new get_research_store() or injected
research_store and remove the per-request constructor.
- Line 52: The Field declaration for keywords in HarvestRequest uses the
deprecated Pydantic v1 arg min_items which causes a TypeError; update the Field
call on the keywords attribute (keywords: List[str] = Field(..., min_items=1,
description="Search keywords")) to use min_length=1 instead (i.e., replace
min_items with min_length) so Pydantic v2 accepts the constraint. Ensure the
change is made on the keywords field in the HarvestRequest/schema where Field is
used.

In `@src/paperbot/api/routes/research.py`:
- Around line 693-706: The feedback currently records mixed IDs because
actual_paper_id is replaced with the internal DB ID on save; change the call to
_research_store.add_paper_feedback to always use the original external paper id
from the request (e.g., req.paper_id or a saved variable representing the
external id) as the paper_id parameter, and move the internal library_paper_id
(library_paper_id) into the metadata argument (meta) before calling
add_paper_feedback; keep returning PaperFeedbackResponse(feedback=fb,
library_paper_id=library_paper_id) so the client still sees the internal id
while the feedback table consistently stores the external identifier.
- Around line 677-688: The route handler directly imports PaperModel and uses
paper_store._provider.session() with a SQLAlchemy select
(PaperModel.semantic_scholar_id ...) which breaks the layering; add or use a
PaperStore method like get_paper_by_source_id(source, source_id) on the existing
paper_store to encapsulate the lookup, remove the direct ORM import and session
usage in the handler, call
paper_store.get_paper_by_source_id("semantic_scholar", req.paper_id) to obtain
the Paper entity (then extract id/actual_paper_id from that returned
domain/model), and update tests or callers accordingly.
- Around line 652-691: The save-to-library flow hardcodes
HarvestSource.SEMANTIC_SCHOLAR and constructs a new PaperStore per request;
update PaperFeedbackRequest to include an optional paper_source and use that
when building the HarvestedPaper (falling back to
HarvestSource.SEMANTIC_SCHOLAR), and replace the per-request PaperStore()
instantiation by using a module-scoped or lazy singleton store (e.g.,
_research_store) so this handler reuses the same PaperStore instance when
calling upsert_papers_batch and accessing _provider.session().

In `@src/paperbot/context_engine/engine.py`:
- Line 506: The Logger.info call on the Paper search config line exceeds the
100-char limit; modify the call in engine.py (the Logger.info invocation
referencing self.config.offline and self.config.paper_limit and
LogFiles.HARVEST) to wrap or split the formatted message so each line stays
under 100 characters (for example build the message into a shorter variable or
use multiple string parts concatenated across lines) without changing the logged
content or the LogFiles.HARVEST target.

In `@src/paperbot/domain/harvest.py`:
- Around line 87-111: from_dict can raise ValueError when constructing
HarvestSource from an empty/invalid string; wrap the conversion in a safe parse
(in from_dict) by attempting HarvestSource(source) in a try/except and falling
back to a safe value (e.g., HarvestSource.UNKNOWN if you add that enum member,
or None/default enum) so deserialization never throws; update the from_dict
logic around the source handling and, if adding HarvestSource.UNKNOWN, add that
enum member to HarvestSource so the fallback is available.

In `@src/paperbot/infrastructure/harvesters/openalex_harvester.py`:
- Around line 42-45: Add an HTTP timeout to prevent indefinite hangs by creating
the aiohttp session with a ClientTimeout in _get_session (use
aiohttp.ClientTimeout with an appropriate total/Connect/read values) instead of
a bare aiohttp.ClientSession, and ensure individual calls that use
session.get(...) also pass a timeout (or rely on the session default) so
requests to the OpenAlex API cannot block forever; apply the same changes to the
ArxivHarvester equivalent methods.

In `@src/paperbot/infrastructure/stores/paper_store.py`:
- Around line 318-331: The join currently casts PaperFeedbackModel.paper_id to
Integer (in base_stmt) which will raise on PostgreSQL for non-numeric strings;
replace the direct cast with a safe conditional expression (e.g., build a
case/boolean expression that first checks
PaperFeedbackModel.paper_id.op("~")(r"^\d+$") then compares
cast(PaperFeedbackModel.paper_id, Integer) only when the regex matches) and use
that safe_int_match in the or_ alongside PaperModel.semantic_scholar_id ==
PaperFeedbackModel.paper_id; update the join condition in base_stmt (where
cast(...) is referenced) to use the safe case/regex approach so non-numeric
paper_id values don’t trigger a CAST error (alternatively, normalize paper_id at
write time to always store numeric IDs).
- Around line 244-251: The current falsy checks (if year_from:, if year_to:, if
min_citations:) skip valid zero values; change them to explicit None checks
(e.g., use "is not None") so filters on PaperModel.year and
PaperModel.citation_count are applied when year_from, year_to or min_citations
equal 0. Update the conditionals around the stmt.where calls that reference
PaperModel.year and PaperModel.citation_count (the blocks using year_from,
year_to, min_citations) to use "is not None" checks to ensure zero is treated as
a valid filter value.
- Around line 209-278: The search_papers method currently ignores the keywords
parameter and unsafely uses getattr(PaperModel, sort_by...), so remove or
implement keywords filtering and restrict sort_by to a whitelist: update
search_papers to either apply keywords (e.g., build OR ilike conditions on
PaperModel.title/abstract/keywords) or delete the unused keywords argument and
its callers, and replace the dynamic getattr with a mapping/whitelist of allowed
sort keys (e.g., {"citation_count": PaperModel.citation_count, "year":
PaperModel.year, "title": PaperModel.title}) then look up sort_col =
ALLOWED_SORT_COLUMNS.get(sort_by, PaperModel.citation_count) before ordering;
reference function search_papers, parameter keywords, parameter sort_by, and
model PaperModel when making the changes.

In `@web/src/app/api/papers/`[paperId]/save/route.ts:
- Around line 7-9: The paperId is being interpolated directly into the upstream
URL in route.ts (upstream and apiBaseUrl()) and in the POST handler, allowing
path-traversal style values; sanitize or encode paperId before use (e.g.,
validate it against an allowed pattern like /^[A-Za-z0-9_-]+$/ or otherwise
reject invalid values, or use a safe encoder such as encodeURIComponent on the
paperId) and then use that sanitized variable for building upstream and any
other requests in the POST handler to prevent crafted path segments from
escaping the intended route.
🟡 Minor comments (9)
web/src/app/api/papers/[paperId]/save/route.ts-13-19 (1)

13-19: ⚠️ Potential issue | 🟡 Minor

Inconsistent query string forwarding between DELETE and POST.

DELETE forwards url.search but POST does not. If this is intentional, a brief comment would clarify the design choice. If not, the POST handler should also forward the query string for consistency.

src/paperbot/domain/harvest.py-57-62 (1)

57-62: ⚠️ Potential issue | 🟡 Minor

Title hash collision on empty/whitespace-only titles.

compute_title_hash() on a paper with an empty title will produce the same hash as any other empty-titled paper, causing false deduplication matches. Consider guarding against this in the deduplicator or returning None/a sentinel for empty titles.

tests/unit/test_harvested_paper.py-95-104 (1)

95-104: ⚠️ Potential issue | 🟡 Minor

paper3 is assigned but never used — either assert on it or remove it.

Ruff flags paper3 (Line 99) as unused (F841). The comment on Lines 103–104 acknowledges the hyphen case may differ, but the test neither asserts equality nor inequality for paper3. Either add an explicit assertion (e.g., assert paper1.compute_title_hash() != paper3.compute_title_hash() to document the known divergence) or remove paper3.

Proposed fix — assert the known divergence
         # All should have same hash after removing punctuation
         assert paper1.compute_title_hash() == paper2.compute_title_hash()
-        # Note: hyphens are removed, making it "deeplearning" vs "deep learning"
-        # This might differ, which is intentional for similar titles
+        # Hyphen removal joins words: "Deep-Learning?" → "deeplearning" vs "deep learning"
+        assert paper1.compute_title_hash() != paper3.compute_title_hash()
src/paperbot/application/services/venue_recommender.py-133-142 (1)

133-142: ⚠️ Potential issue | 🟡 Minor

Exact matches are double-counted by the partial-match loop.

When keyword_lower matches a key exactly, it scores +3 from the exact-match branch (Line 136), then the partial-match loop at Line 140 also fires because keyword_lower in mapped_kw is True when they are equal. This adds an extra +1 per venue for every exact match.

If the intent is that exact matches should score strictly +3, guard the partial loop:

Proposed fix
             # Exact match (highest priority)
             if keyword_lower in self.mappings:
                 for venue in self.mappings[keyword_lower]:
                     venue_scores[venue] = venue_scores.get(venue, 0) + 3

             # Partial match (medium priority)
             for mapped_kw, venues in self.mappings.items():
-                if keyword_lower in mapped_kw or mapped_kw in keyword_lower:
+                if mapped_kw == keyword_lower:
+                    continue  # already scored as exact match
+                if keyword_lower in mapped_kw or mapped_kw in keyword_lower:
                     for venue in venues:
                         venue_scores[venue] = venue_scores.get(venue, 0) + 1
src/paperbot/api/routes/research.py-617-624 (1)

617-624: ⚠️ Potential issue | 🟡 Minor

Missing tests for the new save-to-library flow.

The add_paper_feedback endpoint gained significant new behavior (paper save, library lookup, ID mapping), but no corresponding tests were added. As per coding guidelines, behavior changes should include test updates.

As per coding guidelines: {src,web/src}/**/*.{py,ts,tsx}: If behavior changes, add or update tests.

src/paperbot/infrastructure/stores/paper_store.py-95-133 (1)

95-133: ⚠️ Potential issue | 🟡 Minor

_find_existing doesn't filter out soft-deleted papers.

The lookup queries don't include PaperModel.deleted_at.is_(None), so a soft-deleted paper will be found and updated rather than creating a new record. If soft-delete is meant to logically remove a paper, re-harvesting should probably create a fresh record or un-delete it explicitly.

src/paperbot/infrastructure/stores/paper_store.py-233-241 (1)

233-241: ⚠️ Potential issue | 🟡 Minor

LIKE pattern does not escape user-supplied wildcards.

The query value is interpolated directly into the %{query}% pattern. If a user searches for 100%, it becomes %100%%, which alters the matching semantics. While this isn't a SQL injection (parameters are bound), it can produce unexpected results.

Consider escaping % and _ in the user's query string before wrapping with wildcards.

src/paperbot/api/routes/harvest.py-118-118 (1)

118-118: ⚠️ Potential issue | 🟡 Minor

Remove unused trace_id assignment.

The return value from set_trace_id() is captured but never used.

🧹 Fix
-    trace_id = set_trace_id()
+    set_trace_id()
src/paperbot/application/workflows/harvest_pipeline.py-314-335 (1)

314-335: ⚠️ Potential issue | 🟡 Minor

Pipeline-level error handler may fail if the harvest run was never created.

If an exception occurs before create_harvest_run on Line 208 (e.g., during keyword expansion on Line 187), the update_harvest_run call on Line 317 will silently fail because the run_id doesn't exist in the DB yet. The error won't be recorded.

Consider guarding this with a flag or a try/except around update_harvest_run.

🛡️ Suggested guard
+        run_created = False
         try:
             # ...
             self.paper_store.create_harvest_run(...)
+            run_created = True
             # ...
         except Exception as e:
             logger.exception(f"Harvest pipeline failed: {e}")
-            self.paper_store.update_harvest_run(
-                run_id=run_id,
-                status="failed",
-                errors={"pipeline": str(e)},
-            )
+            if run_created:
+                self.paper_store.update_harvest_run(
+                    run_id=run_id,
+                    status="failed",
+                    errors={"pipeline": str(e)},
+                )
🧹 Nitpick comments (32)
src/paperbot/infrastructure/stores/research_store.py (1)

329-356: Log messages lack contextual identifiers, reducing observability value.

These log statements don't include user_id, track_id, or paper_id, making them hard to correlate in a multi-user system. Consider including key identifiers:

♻️ Suggested improvement
-        Logger.info("Recording paper feedback", file=LogFiles.HARVEST)
+        Logger.info(f"Recording paper feedback user={user_id} track={track_id} paper={paper_id}", file=LogFiles.HARVEST)
         now = _utcnow()
         with self._provider.session() as session:
             ...
             if track is None:
-                Logger.error("Track not found", file=LogFiles.HARVEST)
+                Logger.warning(f"Track not found: track_id={track_id} user={user_id}", file=LogFiles.HARVEST)
                 return None

-            Logger.info("Creating new feedback record", file=LogFiles.HARVEST)
+            Logger.info(f"Creating feedback record: paper={paper_id} action={action}", file=LogFiles.HARVEST)
             ...
-            Logger.info("Feedback record created successfully", file=LogFiles.HARVEST)
+            Logger.info(f"Feedback recorded: id={row.id} paper={paper_id}", file=LogFiles.HARVEST)

Also, Line 338: "Track not found" is a normal client-input condition (not a system error) — warning level would be more appropriate than error.

src/paperbot/application/services/__init__.py (1)

6-12: __all__ is not sorted (RUF022).

Static analysis flags that __all__ entries should be sorted alphabetically.

♻️ Proposed fix
 __all__ = [
     "LLMService",
+    "PaperDeduplicator",
+    "QueryRewriter",
+    "VenueRecommender",
     "get_llm_service",
-    "PaperDeduplicator",
-    "QueryRewriter",
-    "VenueRecommender",
 ]
src/paperbot/infrastructure/harvesters/__init__.py (1)

9-17: Sort imports and __all__ alphabetically.

Both the imports and __all__ entries are out of alphabetical order. The static analysis tool (Ruff RUF022) flags __all__ as unsorted, and the project's coding guidelines require isort with Black profile.

♻️ Proposed fix
-from .arxiv_harvester import ArxivHarvester
-from .semantic_scholar_harvester import SemanticScholarHarvester
-from .openalex_harvester import OpenAlexHarvester
+from .arxiv_harvester import ArxivHarvester
+from .openalex_harvester import OpenAlexHarvester
+from .semantic_scholar_harvester import SemanticScholarHarvester
 
 __all__ = [
     "ArxivHarvester",
-    "SemanticScholarHarvester",
     "OpenAlexHarvester",
+    "SemanticScholarHarvester",
 ]

As per coding guidelines, "Use isort with Black profile for Python import sorting".

tests/integration/test_harvesters.py (1)

436-478: Resource cleanup: harvester instances created in interface tests are never closed.

Each test in TestHarvesterInterface instantiates all three harvesters but never calls close(). While no real HTTP sessions are opened (only hasattr/inspect checks are performed), this could become a problem if harvester constructors ever acquire resources eagerly. Consider adding cleanup or using a fixture with teardown.

tests/integration/test_paper_store.py (1)

238-242: Unused unpacked variables flagged by linter.

Ruff flags total (Lines 240, 265, 281) and all_papers (Line 265) as unused. Use _ for discarded tuple elements to satisfy the linter.

Example fix (applied to all occurrences)
-        papers, total = self.store.search_papers(sources=["arxiv"])
+        papers, _ = self.store.search_papers(sources=["arxiv"])
-        all_papers, total = self.store.search_papers(limit=100)
+        _all_papers, _total = self.store.search_papers(limit=100)
-        papers, total = self.store.search_papers(
+        papers, _ = self.store.search_papers(
src/paperbot/infrastructure/harvesters/openalex_harvester.py (1)

47-55: Move import time to the module level.

import time on Line 49 is executed on every _rate_limit() call. While Python caches module imports, the lookup overhead is unnecessary. Move it to the top of the file with other stdlib imports.

Proposed fix

At the top of the file (after import asyncio):

 import asyncio
 import logging
+import time
 from typing import Any, Dict, List, Optional

Then in the method:

     async def _rate_limit(self) -> None:
         """Enforce rate limiting between requests."""
-        import time
-
         now = time.time()
src/paperbot/application/services/venue_recommender.py (1)

25-81: Annotate DEFAULT_MAPPINGS with ClassVar.

Static analysis (RUF012) flags this mutable class attribute. While .copy() in __init__ prevents accidental cross-instance mutation, adding the ClassVar annotation makes the intent explicit.

Proposed fix
-from typing import Dict, List, Optional
+from typing import ClassVar, Dict, List, Optional
 ...
-    DEFAULT_MAPPINGS: Dict[str, List[str]] = {
+    DEFAULT_MAPPINGS: ClassVar[Dict[str, List[str]]] = {
tests/unit/test_query_rewriter.py (2)

51-57: Assertion is overly lenient — both the original and expansion should be present.

expand_all(["ML", "deep learning"]) should include both "ML" (original) and "machine learning" (expansion). The or lets the test pass even if the expansion is missing entirely. Consider asserting both:

-        assert "ML" in expanded or "machine learning" in expanded
+        assert "ML" in expanded
+        assert "machine learning" in expanded

92-96: Line 94 likely exceeds the 100-character line limit.

Consider splitting across multiple lines.

Proposed fix
-        known_abbrevs = ["llm", "ml", "dl", "nlp", "cv", "rl", "gan", "cnn", "rnn", "bert", "gpt", "rag"]
+        known_abbrevs = [
+            "llm", "ml", "dl", "nlp", "cv", "rl",
+            "gan", "cnn", "rnn", "bert", "gpt", "rag",
+        ]

As per coding guidelines, "Use Black formatting with line-length = 100 for Python code".

src/paperbot/application/services/query_rewriter.py (1)

28-68: Annotate DEFAULT_ABBREVIATIONS with ClassVar.

Same as the VenueRecommender class — static analysis (RUF012) flags this mutable class attribute. Adding ClassVar makes the intent explicit.

Proposed fix
-from typing import Dict, List, Optional
+from typing import ClassVar, Dict, List, Optional
 ...
-    DEFAULT_ABBREVIATIONS: Dict[str, str] = {
+    DEFAULT_ABBREVIATIONS: ClassVar[Dict[str, str]] = {
src/paperbot/application/services/paper_deduplicator.py (2)

183-186: _find_index returns -1 as a sentinel — this could silently corrupt index mappings.

If _find_index ever fails to find the paper (e.g., due to a future code change), -1 would be stored in the identifier indexes, pointing at the last element in unique_papers rather than the intended paper. Consider raising an error or using a safer pattern:

Proposed fix
     def _find_index(self, paper: HarvestedPaper) -> int:
         """Find the index of a paper in the title hash index."""
         title_hash = paper.compute_title_hash()
-        return self._title_hash_index.get(title_hash, -1)
+        idx = self._title_hash_index.get(title_hash)
+        if idx is None:
+            raise KeyError(f"Paper not found in title hash index: {paper.title!r}")
+        return idx

179-181: list(set(...)) produces non-deterministic ordering for keywords and fields_of_study.

This won't cause bugs, but it makes deduplication results non-deterministic across runs (set iteration order varies). If stable output matters for tests or downstream consumers, consider using dict.fromkeys() to preserve insertion order:

-        existing.keywords = list(set(existing.keywords + new.keywords))
-        existing.fields_of_study = list(set(existing.fields_of_study + new.fields_of_study))
+        existing.keywords = list(dict.fromkeys(existing.keywords + new.keywords))
+        existing.fields_of_study = list(dict.fromkeys(existing.fields_of_study + new.fields_of_study))
alembic/versions/0003_paper_harvest_tables.py (2)

92-94: created_at and updated_at lack server defaults — NULLs are likely if application code doesn't set them.

Consider adding server_default=sa.func.now() (or sa.text("CURRENT_TIMESTAMP")) so that rows always have a creation timestamp, even if the application layer misses setting one.

Proposed fix
-            sa.Column("created_at", sa.DateTime(timezone=True), nullable=True),
-            sa.Column("updated_at", sa.DateTime(timezone=True), nullable=True),
+            sa.Column("created_at", sa.DateTime(timezone=True), server_default=sa.text("CURRENT_TIMESTAMP"), nullable=False),
+            sa.Column("updated_at", sa.DateTime(timezone=True), server_default=sa.text("CURRENT_TIMESTAMP"), nullable=False),

62-117: No unique constraints on deduplication identifier columns (doi, arxiv_id, etc.).

The papers table relies entirely on the application layer for deduplication. If multiple processes or API calls race, duplicate rows can be inserted. Consider adding unique indexes (at least on doi and title_hash) to enforce deduplication at the database level. Partial unique indexes (e.g., CREATE UNIQUE INDEX ... WHERE doi IS NOT NULL) would handle the nullable columns.

src/paperbot/infrastructure/harvesters/semantic_scholar_harvester.py (2)

28-41: Annotate FIELDS with ClassVar to satisfy the mutable class-attribute lint.

Ruff RUF012 flags mutable class attributes that aren't annotated with ClassVar. A one-line fix silences the warning and communicates intent.

Proposed fix
+from typing import Any, ClassVar, Dict, List, Optional
-from typing import Any, Dict, List, Optional
 ...
-    FIELDS = [
+    FIELDS: ClassVar[List[str]] = [

50-101: Search method is well-structured; one subtle note on year-filter truthiness.

Lines 63-68: if year_from and year_to uses truthiness, so year_from=0 would be skipped. Since publication years are always positive this is fine in practice, but using is not None would be more precise and consistent with how the other harvesters (arXiv) might handle it.

Overall the search→filter→return flow, the cap at 100, and the graceful error fallback look good.

More explicit year checks
-            if year_from and year_to:
+            if year_from is not None and year_to is not None:
                 year_filter = f" year:{year_from}-{year_to}"
-            elif year_from:
+            elif year_from is not None:
                 year_filter = f" year:{year_from}-"
-            elif year_to:
+            elif year_to is not None:
                 year_filter = f" year:-{year_to}"
tests/integration/test_harvest_pipeline.py (2)

94-105: Pipeline fixture lacks teardown — potential resource leak across tests.

The pipeline fixture creates a HarvestPipeline with a SQLite DB but never calls close(). The error-handling tests (Lines 338, 380, 417) close their locally-created pipelines, but the fixture-based tests don't. Consider adding a yield-based cleanup.

Proposed fix
 `@pytest.fixture`
-def pipeline(tmp_path, mock_harvesters):
+async def pipeline(tmp_path, mock_harvesters):
     """Create HarvestPipeline with mocked dependencies."""
     db_url = f"sqlite:///{tmp_path / 'test_harvest.db'}"
     pipeline = HarvestPipeline(db_url=db_url)
 
     # Inject mock harvesters
     def get_mock_harvester(source):
         return mock_harvesters.get(source)
 
     pipeline._get_harvester = get_mock_harvester
-    return pipeline
+    yield pipeline
+    await pipeline.close()

225-226: Rename unused loop variable item to _.

Ruff B007 flags this. Minor fix for lint compliance.

Proposed fix
-        async for item in pipeline.run(config):
-            pass  # Just run to completion
+        async for _ in pipeline.run(config):
+            pass  # Just run to completion

Also applies to: 267-268

web/src/components/research/SavedPapersList.tsx (1)

99-104: Consider extracting the HTML error-response handling into a shared helper.

The same pattern for detecting and sanitizing HTML error responses is duplicated across loadSavedPapers, unsavePaper, and toggleReadStatus. A small utility would reduce repetition.

Example extraction
async function extractErrorMessage(res: Response): Promise<string> {
  const text = await res.text()
  if (text.startsWith("<!DOCTYPE") || text.startsWith("<html")) {
    return `Server error: ${res.status} ${res.statusText}`
  }
  return text
}

Also applies to: 140-146, 176-181

src/paperbot/infrastructure/harvesters/arxiv_harvester.py (3)

133-162: Version suffix stripping could incorrectly split IDs containing "v" in unusual positions.

Line 140-141: arxiv_id.split("v")[0] splits on the first lowercase "v" character, which works for standard arXiv IDs but is fragile. Using a regex to strip only a trailing version suffix is more robust.

Proposed fix
+import re
 ...
         # Remove version suffix (e.g., "2301.12345v1" -> "2301.12345")
-        if "v" in arxiv_id:
-            arxiv_id = arxiv_id.split("v")[0]
+        arxiv_id = re.sub(r"v\d+$", "", arxiv_id)

58-74: Same year-filter truthiness note as the S2 harvester.

Line 69: if year_from or year_to will skip year_from=0. Consistent with the suggestion on the S2 harvester, using is not None would be more precise.

Proposed fix
-        if year_from or year_to:
+        if year_from is not None or year_to is not None:
             start_date = f"{year_from}0101" if year_from else "199101"
             end_date = f"{year_to}1231" if year_to else "209912"

Note: the inner ternaries on lines 70-71 would also need updating to use is not None for full consistency.


48-56: Move import time to the top of the file.

The time module is imported inside _rate_limit() on every call. It's a stdlib module with negligible import cost — it should be at module level.

Proposed fix
 import asyncio
 import logging
+import time
 from typing import List, Optional
 ...
     async def _rate_limit(self) -> None:
         """Enforce rate limiting between requests."""
-        import time
-
         now = time.time()
src/paperbot/application/workflows/harvest_pipeline.py (3)

216-261: Harvesting is sequential, not parallel as documented.

The comment on Line 216 says "Harvest from each source in parallel", but the for loop on Line 223 awaits each harvester one at a time. This means sources are harvested sequentially, which may significantly impact latency with multiple sources.

Consider using asyncio.gather (or asyncio.TaskGroup) for true concurrency. If you intentionally want sequential execution (e.g., to avoid rate-limiting), update the comment to match.

♻️ Sketch for parallel harvesting
-            # Phase 4: Harvest from each source in parallel
-            all_papers: List[HarvestedPaper] = []
-
-            # Build search query from expanded keywords
-            search_query = " ".join(expanded_keywords)
-
-            # Harvest from each source
-            for source in sources:
-                yield HarvestProgress(
-                    phase="Harvesting",
-                    message=f"Fetching from {source}...",
-                    details={"source": source},
-                )
-
-                harvester = self._get_harvester(source)
-                if harvester is None:
-                    errors[source] = f"Unknown source: {source}"
-                    source_results[source] = {"papers": 0, "error": errors[source]}
-                    continue
-
-                try:
-                    result = await harvester.search(
-                        query=search_query,
-                        max_results=config.max_results_per_source,
-                        year_from=config.year_from,
-                        year_to=config.year_to,
-                        venues=venues,
-                    )
-
-                    all_papers.extend(result.papers)
-                    source_results[source] = {
-                        "papers": result.total_found,
-                        "error": result.error,
-                    }
-
-                    if result.error:
-                        errors[source] = result.error
-                        logger.warning(f"Error from {source}: {result.error}")
-                    else:
-                        logger.info(f"Harvested {result.total_found} papers from {source}")
-
-                except Exception as e:
-                    error_msg = str(e)
-                    errors[source] = error_msg
-                    source_results[source] = {"papers": 0, "error": error_msg}
-                    logger.exception(f"Exception harvesting from {source}")
+            # Phase 4: Harvest from each source
+            all_papers: List[HarvestedPaper] = []
+            search_query = " ".join(expanded_keywords)
+
+            yield HarvestProgress(
+                phase="Harvesting",
+                message=f"Fetching from {len(sources)} source(s)...",
+            )
+
+            async def _harvest_one(source: str):
+                harvester = self._get_harvester(source)
+                if harvester is None:
+                    return source, None, f"Unknown source: {source}"
+                try:
+                    result = await harvester.search(
+                        query=search_query,
+                        max_results=config.max_results_per_source,
+                        year_from=config.year_from,
+                        year_to=config.year_to,
+                        venues=venues,
+                    )
+                    return source, result, None
+                except Exception as e:
+                    return source, None, str(e)
+
+            tasks = [_harvest_one(s) for s in sources]
+            results = await asyncio.gather(*tasks)
+
+            for source, result, err in results:
+                if err:
+                    errors[source] = err
+                    source_results[source] = {"papers": 0, "error": err}
+                    logger.warning(f"Error from {source}: {err}")
+                else:
+                    all_papers.extend(result.papers)
+                    source_results[source] = {
+                        "papers": result.total_found,
+                        "error": result.error,
+                    }
+                    if result.error:
+                        errors[source] = result.error
+                    else:
+                        logger.info(
+                            f"Harvested {result.total_found} papers from {source}"
+                        )

337-355: run_sync is an async method — the name is misleading.

The method is declared async def run_sync(...) but "sync" typically implies it blocks without requiring await. Callers still need await pipeline.run_sync(...). Consider renaming to run_to_completion or run_collect to better express the intent (consume the generator and return the final result).


357-370: Silently swallowing exceptions during cleanup.

Lines 361-364 catch all exceptions with a bare pass. If a harvester's close() fails, the error is completely lost, making debugging difficult.

🛡️ Proposed fix: log instead of swallowing
         for harvester in self._harvesters.values():
             try:
                 await harvester.close()
-            except Exception:
-                pass
+            except Exception:
+                logger.debug("Error closing harvester", exc_info=True)
src/paperbot/api/routes/harvest.py (3)

163-181: Duplicated HarvestRunResponse construction logic.

The conversion from HarvestRunModel to HarvestRunResponse is repeated verbatim between list_harvest_runs and get_harvest_run. Extract it into a helper to keep things DRY.

♻️ Suggested helper
def _run_to_response(run: "HarvestRunModel") -> HarvestRunResponse:
    return HarvestRunResponse(
        run_id=run.run_id,
        keywords=run.get_keywords(),
        venues=run.get_venues(),
        sources=run.get_sources(),
        max_results_per_source=run.max_results_per_source or 50,
        status=run.status or "unknown",
        papers_found=run.papers_found or 0,
        papers_new=run.papers_new or 0,
        papers_deduplicated=run.papers_deduplicated or 0,
        errors=run.get_errors(),
        started_at=run.started_at.isoformat() if run.started_at else None,
        ended_at=run.ended_at.isoformat() if run.ended_at else None,
    )

Also applies to: 193-206


230-252: PaperResponse model is defined but never used.

No endpoint uses PaperResponse as a response_model. PaperSearchResponse.papers is typed as List[Dict[str, Any]] and get_paper returns a raw dict. Consider either using PaperResponse as the response model (for schema documentation and validation) or removing it to avoid dead code.


264-290: search_papers uses POST for a query-like operation.

Using POST for a search endpoint means results are not cacheable and the operation isn't idempotent from an HTTP semantics perspective. GET with query parameters is more conventional for search. If the reason for POST is the complexity of the filter body, this is a common pragmatic choice — but worth a brief comment in the docstring for clarity.

src/paperbot/infrastructure/stores/paper_store.py (4)

59-93: N+1 query pattern in upsert_papers_batch — up to 5 SELECTs per paper.

_find_existing issues up to 5 sequential queries per paper (doi, arxiv_id, semantic_scholar_id, openalex_id, title_hash). For a batch of 200 papers, this could mean ~1000 queries within a single transaction.

Consider batching the lookups: pre-fetch all existing papers matching any of the incoming identifiers in a single query (or a small number of queries), then do the upsert loop against the in-memory results.


290-381: get_user_library loads all matching rows into memory for sorting/pagination.

The method fetches all matching paper-feedback pairs (Line 338), deduplicates in Python, sorts in Python, and then applies offset/limit. For users with large libraries, this won't scale.

The dual-type join (numeric ID vs. Semantic Scholar ID) makes SQL-level pagination difficult, but this should be documented as a known limitation with a TODO for future optimization — e.g., by normalizing paper_id storage to always use the numeric DB ID.


498-503: close() swallows all exceptions silently.

Same pattern as in the pipeline. Consider logging at debug level.

🧹 Fix
     def close(self) -> None:
         """Close database connections."""
         try:
             self._provider.engine.dispose()
-        except Exception:
-            pass
+        except Exception:
+            logger.debug("Error disposing engine", exc_info=True)

53-57: Base.metadata.create_all in __init__ may conflict with Alembic migrations.

The AI summary mentions Alembic migration 0003_paper_harvest_tables.py for these tables. Having auto_create_schema=True by default means every PaperStore() instantiation (including from API route handlers) will call create_all, which can create tables outside of Alembic's migration tracking. Consider defaulting auto_create_schema to False in production and only using it for tests.

Comment on lines +54 to +59
def upgrade() -> None:
if _is_offline():
_upgrade_create_tables()
return
_upgrade_create_tables()
_upgrade_create_indexes()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Indexes are skipped in offline mode — generated SQL will be incomplete.

When _is_offline() is True, upgrade() returns early after creating tables, never calling _upgrade_create_indexes(). Since _create_index already handles offline mode (Line 46-48, it calls op.create_index unconditionally), the indexes should be emitted in offline-generated SQL too.

Proposed fix
 def upgrade() -> None:
-    if _is_offline():
-        _upgrade_create_tables()
-        return
     _upgrade_create_tables()
     _upgrade_create_indexes()
🤖 Prompt for AI Agents
In `@alembic/versions/0003_paper_harvest_tables.py` around lines 54 - 59, The
upgrade() function returns early when _is_offline() is true, so
_upgrade_create_indexes() is never called and generated offline SQL omits index
creation; remove the early return or restructure upgrade() so it always calls
_upgrade_create_indexes() after _upgrade_create_tables() (rely on
_create_index/op.create_index which already handles offline mode) — modify
upgrade() to invoke _upgrade_create_tables() then _upgrade_create_indexes()
regardless of _is_offline() so indexes are emitted in offline-generated SQL.

class HarvestRequest(BaseModel):
"""Request body for harvest endpoint."""

keywords: List[str] = Field(..., min_items=1, description="Search keywords")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find project dependency files
fd -e "pyproject.toml" -e "requirements.txt" -e "setup.py" -e "setup.cfg" -e "poetry.lock" -e "Pipfile" | head -20

Repository: jerry609/PaperBot

Length of output: 43


🏁 Script executed:

# Look for Pydantic version specification
cat pyproject.toml 2>/dev/null || cat setup.py 2>/dev/null || cat requirements.txt 2>/dev/null || echo "No common dependency file found"

Repository: jerry609/PaperBot

Length of output: 1921


🌐 Web query:

Pydantic v2 Field min_items min_length List constraint validation

💡 Result:

In Pydantic v2, min_items / max_items are v1-era names for list-size constraints. For lists (and tuples) you should use:

  • Field(min_length=..., max_length=...) (these map to JSON Schema minItems / maxItems) (docs.pydantic.dev)
  • or conlist(item_type, min_length=..., max_length=...) (docs.pydantic.dev)

Examples (v2)

from typing import Annotated
from pydantic import BaseModel, Field, conlist

class M(BaseModel):
    a: list[int] = Field(min_length=2, max_length=5)
    b: Annotated[list[int], Field(min_length=2)]
    c: conlist(int, min_length=2, max_length=5)

If you try min_items in v2 (e.g., conlist(..., min_items=3)), you’ll get an “unexpected keyword argument” error; the fix is to use min_length / max_length instead. (stackoverflow.com)

Citations:


🏁 Script executed:

# Check the actual code at line 52 in the file
sed -n '49,55p' src/paperbot/api/routes/harvest.py

Repository: jerry609/PaperBot

Length of output: 477


🏁 Script executed:

# Also check if there are other Field definitions with min_items in the file
rg "min_items" src/paperbot/api/routes/harvest.py

Repository: jerry609/PaperBot

Length of output: 141


min_items is deprecated in Pydantic v2 — use min_length.

With Pydantic v2 (as specified in dependencies), min_items on Field is not recognized and raises TypeError: unexpected keyword argument. This will cause the code to fail at runtime when HarvestRequest is instantiated.

Fix
-    keywords: List[str] = Field(..., min_items=1, description="Search keywords")
+    keywords: List[str] = Field(..., min_length=1, description="Search keywords")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
keywords: List[str] = Field(..., min_items=1, description="Search keywords")
keywords: List[str] = Field(..., min_length=1, description="Search keywords")
🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/harvest.py` at line 52, The Field declaration for
keywords in HarvestRequest uses the deprecated Pydantic v1 arg min_items which
causes a TypeError; update the Field call on the keywords attribute (keywords:
List[str] = Field(..., min_items=1, description="Search keywords")) to use
min_length=1 instead (i.e., replace min_items with min_length) so Pydantic v2
accepts the constraint. Ensure the change is made on the keywords field in the
HarvestRequest/schema where Field is used.

Comment on lines +323 to +331
@router.get("/papers/library", response_model=LibraryResponse)
def get_user_library(
user_id: str = Query("default", description="User ID"),
track_id: Optional[int] = Query(None, description="Filter by track"),
actions: Optional[str] = Query(None, description="Filter by actions (comma-separated)"),
sort_by: str = Query("saved_at", description="Sort field"),
sort_order: str = Query("desc", description="Sort order"),
limit: int = Query(50, ge=1, le=500),
offset: int = Query(0, ge=0),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

user_id accepted from client without authentication — security concern.

user_id is supplied as a query parameter (Line 325) or request body field (Line 386) defaulting to "default". Any caller can read or modify any user's library by passing a different user_id. If this is intentional for an MVP with no auth, consider adding a TODO/comment. Otherwise, user_id should come from an authenticated session or token.

Also applies to: 386-387

🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/harvest.py` around lines 323 - 331, get_user_library
currently accepts user_id from the query string (user_id param in
get_user_library) which allows callers to access/modify arbitrary users'
libraries; change this to derive the user ID from the authenticated
session/token (e.g., replace the user_id Query param with a dependency that
extracts the current_user/current_user.id from your auth middleware or token
validation function used elsewhere) and remove the default "default" value;
apply the same change to the other endpoint that accepts user_id in the request
body (lines referenced) so both endpoints use the authenticated user context; if
no authentication exists for this MVP, add a clear TODO comment inside
get_user_library and the other endpoint stating that user_id is temporary and
must be replaced with auth-derived user identification before production.

Comment on lines +390 to +414
@router.post("/papers/{paper_id}/save")
def save_paper_to_library(paper_id: int, request: SavePaperRequest):
"""
Save a paper to user's library.

Uses paper_feedback table with action='save'.
"""
from paperbot.infrastructure.stores.research_store import SqlAlchemyResearchStore

# Verify paper exists
store = _get_paper_store()
paper = store.get_paper_by_id(paper_id)
if not paper:
raise HTTPException(status_code=404, detail="Paper not found")

# Use research store to record feedback
research_store = SqlAlchemyResearchStore()
feedback = research_store.record_paper_feedback(
user_id=request.user_id,
paper_id=str(paper_id),
action="save",
track_id=request.track_id,
)

return {"success": True, "feedback": feedback}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

SqlAlchemyResearchStore() instantiated per request and never closed — potential resource leak.

Each call to save_paper_to_library creates a new SqlAlchemyResearchStore() on Line 406. Unlike _paper_store, it is not cached or closed after use. This leaks database connections over time.

Consider using the same lazy-initialization pattern or injecting a shared store.

♻️ Sketch: lazy-initialize research store
+_research_store: Optional["SqlAlchemyResearchStore"] = None
+
+def _get_research_store():
+    global _research_store
+    if _research_store is None:
+        from paperbot.infrastructure.stores.research_store import SqlAlchemyResearchStore
+        _research_store = SqlAlchemyResearchStore()
+    return _research_store
+
 `@router.post`("/papers/{paper_id}/save")
 def save_paper_to_library(paper_id: int, request: SavePaperRequest):
-    from paperbot.infrastructure.stores.research_store import SqlAlchemyResearchStore
     # Verify paper exists
     store = _get_paper_store()
     paper = store.get_paper_by_id(paper_id)
     if not paper:
         raise HTTPException(status_code=404, detail="Paper not found")
 
-    research_store = SqlAlchemyResearchStore()
+    research_store = _get_research_store()
     feedback = research_store.record_paper_feedback(
🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/harvest.py` around lines 390 - 414,
save_paper_to_library creates a new SqlAlchemyResearchStore() per request and
never closes it, leaking DB resources; change this to reuse a
shared/lazy-initialized research store or inject one instead of instantiating
inline: add a module-level lazy getter (similar to _get_paper_store) or accept a
dependency and use that shared instance, then call the store's close/dispose
method if it has one after record_paper_feedback (or rely on the shared
instance's lifecycle); update references in save_paper_to_library and any other
handlers that instantiate SqlAlchemyResearchStore() so they use the new
get_research_store() or injected research_store and remove the per-request
constructor.

Comment on lines +652 to +691
library_paper_id: Optional[int] = None
actual_paper_id = req.paper_id

# If action is "save" and we have paper metadata, insert into papers table
if req.action == "save" and req.paper_title:
Logger.info("Save action detected, inserting paper into papers table", file=LogFiles.HARVEST)
try:
from paperbot.domain.harvest import HarvestedPaper, HarvestSource
from paperbot.infrastructure.stores.paper_store import PaperStore

paper_store = PaperStore()
paper = HarvestedPaper(
title=req.paper_title,
source=HarvestSource.SEMANTIC_SCHOLAR,
abstract=req.paper_abstract or "",
authors=req.paper_authors or [],
semantic_scholar_id=req.paper_id,
year=req.paper_year,
venue=req.paper_venue,
citation_count=req.paper_citation_count or 0,
url=req.paper_url,
)
Logger.info("Calling paper store to upsert paper", file=LogFiles.HARVEST)
new_count, _ = paper_store.upsert_papers_batch([paper])

# Get the paper ID from database
from paperbot.infrastructure.stores.models import PaperModel
from sqlalchemy import select
with paper_store._provider.session() as session:
result = session.execute(
select(PaperModel).where(
PaperModel.semantic_scholar_id == req.paper_id
)
).scalar_one_or_none()
if result:
library_paper_id = result.id
actual_paper_id = str(result.id) # Use integer ID for feedback
Logger.info(f"Paper saved to library with id={library_paper_id}", file=LogFiles.HARVEST)
except Exception as e:
Logger.warning(f"Failed to save paper to library: {e}", file=LogFiles.HARVEST)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hard-coded SEMANTIC_SCHOLAR source and per-request PaperStore instantiation.

Two concerns in the save-to-library flow:

  1. Hard-coded source (Line 665): The paper source is always set to HarvestSource.SEMANTIC_SCHOLAR, but papers could originate from arXiv or OpenAlex. Consider accepting paper_source as an optional field on PaperFeedbackRequest and using it here.

  2. Per-request PaperStore() (Line 662): A new PaperStore is instantiated on every save request. If PaperStore manages a connection pool or session factory, this creates unnecessary overhead. Consider initializing it once at module scope (like _research_store) or using a lazy singleton.

Proposed fix sketch
 # At module scope, alongside other stores:
+from paperbot.infrastructure.stores.paper_store import PaperStore
+_paper_store = PaperStore()

 # In add_paper_feedback, replace:
-            from paperbot.infrastructure.stores.paper_store import PaperStore
-            paper_store = PaperStore()
+            paper_store = _paper_store

Add paper_source to PaperFeedbackRequest:

     paper_url: Optional[str] = None
+    paper_source: Optional[str] = None  # e.g. "arxiv", "semantic_scholar", "openalex"

Then use it:

-                source=HarvestSource.SEMANTIC_SCHOLAR,
+                source=HarvestSource(req.paper_source) if req.paper_source else HarvestSource.SEMANTIC_SCHOLAR,
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 675-675: Unpacked variable new_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 690-690: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/research.py` around lines 652 - 691, The
save-to-library flow hardcodes HarvestSource.SEMANTIC_SCHOLAR and constructs a
new PaperStore per request; update PaperFeedbackRequest to include an optional
paper_source and use that when building the HarvestedPaper (falling back to
HarvestSource.SEMANTIC_SCHOLAR), and replace the per-request PaperStore()
instantiation by using a module-scoped or lazy singleton store (e.g.,
_research_store) so this handler reuses the same PaperStore instance when
calling upsert_papers_batch and accessing _provider.session().

Comment on lines +42 to +45
async def _get_session(self) -> aiohttp.ClientSession:
if self._session is None or self._session.closed:
self._session = aiohttp.ClientSession()
return self._session
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, read the OpenAlex harvester file around the lines mentioned
cat -n src/paperbot/infrastructure/harvesters/openalex_harvester.py | head -220

Repository: jerry609/PaperBot

Length of output: 8522


🏁 Script executed:

# Search for other harvesters and their timeout patterns
fd -e py "harvester" src/paperbot/infrastructure/harvesters/ | head -20

Repository: jerry609/PaperBot

Length of output: 248


🏁 Script executed:

# Check if ClientTimeout is used anywhere in the codebase
rg "ClientTimeout" --type py -B 2 -A 2

Repository: jerry609/PaperBot

Length of output: 10041


🏁 Script executed:

# Check all ClientSession instantiations
rg "ClientSession\(" --type py -B 2 -A 2

Repository: jerry609/PaperBot

Length of output: 8422


🏁 Script executed:

# Check for test files related to harvesters
find . -name "*test*harvester*" -o -name "*harvester*test*" | head -20

Repository: jerry609/PaperBot

Length of output: 99


🏁 Script executed:

# Check semantic scholar harvester for comparison
cat -n src/paperbot/infrastructure/harvesters/semantic_scholar_harvester.py | grep -A 5 "_get_session"

Repository: jerry609/PaperBot

Length of output: 43


🏁 Script executed:

# Check if semantic_scholar_harvester exists
ls -la src/paperbot/infrastructure/harvesters/

Repository: jerry609/PaperBot

Length of output: 457


🏁 Script executed:

# Look at the test file to understand coverage
cat -n tests/integration/test_harvesters.py

Repository: jerry609/PaperBot

Length of output: 20610


Add HTTP request timeout to prevent indefinite hangs.

The aiohttp.ClientSession at line 44 and the session.get() call at line 90 lack timeout configuration. If the OpenAlex API becomes unresponsive, the harvester will block indefinitely, stalling the harvest pipeline. The codebase already establishes timeout patterns elsewhere (see src/paperbot/utils/downloader.py and src/paperbot/infrastructure/services/api_client.py).

Proposed fix — add ClientTimeout to session creation
     async def _get_session(self) -> aiohttp.ClientSession:
         if self._session is None or self._session.closed:
-            self._session = aiohttp.ClientSession()
+            timeout = aiohttp.ClientTimeout(total=30)
+            self._session = aiohttp.ClientSession(timeout=timeout)
         return self._session

Note: ArxivHarvester has the identical pattern and should also be fixed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def _get_session(self) -> aiohttp.ClientSession:
if self._session is None or self._session.closed:
self._session = aiohttp.ClientSession()
return self._session
async def _get_session(self) -> aiohttp.ClientSession:
if self._session is None or self._session.closed:
timeout = aiohttp.ClientTimeout(total=30)
self._session = aiohttp.ClientSession(timeout=timeout)
return self._session
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/harvesters/openalex_harvester.py` around lines 42
- 45, Add an HTTP timeout to prevent indefinite hangs by creating the aiohttp
session with a ClientTimeout in _get_session (use aiohttp.ClientTimeout with an
appropriate total/Connect/read values) instead of a bare aiohttp.ClientSession,
and ensure individual calls that use session.get(...) also pass a timeout (or
rely on the session default) so requests to the OpenAlex API cannot block
forever; apply the same changes to the ArxivHarvester equivalent methods.

Comment on lines +209 to +278
def search_papers(
self,
*,
query: Optional[str] = None,
keywords: Optional[List[str]] = None,
venues: Optional[List[str]] = None,
year_from: Optional[int] = None,
year_to: Optional[int] = None,
min_citations: Optional[int] = None,
sources: Optional[List[str]] = None,
sort_by: str = "citation_count",
sort_order: str = "desc",
limit: int = 50,
offset: int = 0,
) -> Tuple[List[PaperModel], int]:
"""
Search papers with filters and pagination.

Returns:
Tuple of (papers, total_count)
"""
with self._provider.session() as session:
stmt = select(PaperModel).where(PaperModel.deleted_at.is_(None))

# Full-text search (LIKE for v1)
if query:
pattern = f"%{query}%"
stmt = stmt.where(
or_(
PaperModel.title.ilike(pattern),
PaperModel.abstract.ilike(pattern),
)
)

# Year filters
if year_from:
stmt = stmt.where(PaperModel.year >= year_from)
if year_to:
stmt = stmt.where(PaperModel.year <= year_to)

# Citation filter
if min_citations:
stmt = stmt.where(PaperModel.citation_count >= min_citations)

# Venue filter
if venues:
venue_conditions = [PaperModel.venue.ilike(f"%{v}%") for v in venues]
stmt = stmt.where(or_(*venue_conditions))

# Source filter
if sources:
stmt = stmt.where(PaperModel.primary_source.in_(sources))

# Count total before pagination
count_stmt = select(func.count()).select_from(stmt.subquery())
total_count = session.execute(count_stmt).scalar() or 0

# Sort
sort_col = getattr(PaperModel, sort_by, PaperModel.citation_count)
if sort_order.lower() == "desc":
stmt = stmt.order_by(sort_col.desc())
else:
stmt = stmt.order_by(sort_col.asc())

# Pagination
stmt = stmt.offset(offset).limit(limit)

papers = session.execute(stmt).scalars().all()

return list(papers), total_count
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unused keywords parameter and unvalidated sort_by open to attribute access.

Two issues in search_papers:

  1. The keywords parameter (Line 213) is accepted but never used in the method body. This is dead code that misleads callers.

  2. sort_by (Line 267) uses getattr(PaperModel, sort_by, PaperModel.citation_count) with a user-supplied string. This allows sorting by any model attribute (including deleted_at, internal fields, or non-column attributes). Whitelist allowed sort columns.

♻️ Proposed fix
+    _ALLOWED_SORT_COLUMNS = {"citation_count", "year", "title", "created_at", "updated_at"}
+
     def search_papers(
         self,
         *,
         query: Optional[str] = None,
-        keywords: Optional[List[str]] = None,
         venues: Optional[List[str]] = None,
         year_from: Optional[int] = None,
         year_to: Optional[int] = None,
         min_citations: Optional[int] = None,
         sources: Optional[List[str]] = None,
         sort_by: str = "citation_count",
         sort_order: str = "desc",
         limit: int = 50,
         offset: int = 0,
     ) -> Tuple[List[PaperModel], int]:
         # ...
-            sort_col = getattr(PaperModel, sort_by, PaperModel.citation_count)
+            if sort_by not in self._ALLOWED_SORT_COLUMNS:
+                sort_by = "citation_count"
+            sort_col = getattr(PaperModel, sort_by)
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 213-213: Unused method argument: keywords

(ARG002)

🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 209 - 278,
The search_papers method currently ignores the keywords parameter and unsafely
uses getattr(PaperModel, sort_by...), so remove or implement keywords filtering
and restrict sort_by to a whitelist: update search_papers to either apply
keywords (e.g., build OR ilike conditions on PaperModel.title/abstract/keywords)
or delete the unused keywords argument and its callers, and replace the dynamic
getattr with a mapping/whitelist of allowed sort keys (e.g., {"citation_count":
PaperModel.citation_count, "year": PaperModel.year, "title": PaperModel.title})
then look up sort_col = ALLOWED_SORT_COLUMNS.get(sort_by,
PaperModel.citation_count) before ordering; reference function search_papers,
parameter keywords, parameter sort_by, and model PaperModel when making the
changes.

Comment on lines +244 to +251
if year_from:
stmt = stmt.where(PaperModel.year >= year_from)
if year_to:
stmt = stmt.where(PaperModel.year <= year_to)

# Citation filter
if min_citations:
stmt = stmt.where(PaperModel.citation_count >= min_citations)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Falsy-value filters skip valid zero values.

if year_from: on Line 244 and if min_citations: on Line 250 are falsy checks. While year_from=0 is unrealistic, min_citations=0 is a valid filter meaning "include papers with zero or more citations" — but it would be skipped because 0 is falsy.

🐛 Fix
-            if year_from:
+            if year_from is not None:
                 stmt = stmt.where(PaperModel.year >= year_from)
-            if year_to:
+            if year_to is not None:
                 stmt = stmt.where(PaperModel.year <= year_to)
 
             # Citation filter
-            if min_citations:
+            if min_citations is not None:
                 stmt = stmt.where(PaperModel.citation_count >= min_citations)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if year_from:
stmt = stmt.where(PaperModel.year >= year_from)
if year_to:
stmt = stmt.where(PaperModel.year <= year_to)
# Citation filter
if min_citations:
stmt = stmt.where(PaperModel.citation_count >= min_citations)
if year_from is not None:
stmt = stmt.where(PaperModel.year >= year_from)
if year_to is not None:
stmt = stmt.where(PaperModel.year <= year_to)
# Citation filter
if min_citations is not None:
stmt = stmt.where(PaperModel.citation_count >= min_citations)
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 244 - 251,
The current falsy checks (if year_from:, if year_to:, if min_citations:) skip
valid zero values; change them to explicit None checks (e.g., use "is not None")
so filters on PaperModel.year and PaperModel.citation_count are applied when
year_from, year_to or min_citations equal 0. Update the conditionals around the
stmt.where calls that reference PaperModel.year and PaperModel.citation_count
(the blocks using year_from, year_to, min_citations) to use "is not None" checks
to ensure zero is treated as a valid filter value.

Comment on lines +318 to +331
base_stmt = (
select(PaperModel, PaperFeedbackModel)
.join(
PaperFeedbackModel,
or_(
PaperModel.id == cast(PaperFeedbackModel.paper_id, Integer),
PaperModel.semantic_scholar_id == PaperFeedbackModel.paper_id,
),
)
.where(
PaperFeedbackModel.user_id == user_id,
PaperFeedbackModel.action.in_(actions),
PaperModel.deleted_at.is_(None),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

CAST(paper_id AS INTEGER) will fail on PostgreSQL for non-numeric strings.

The OR join on Lines 322-325 uses cast(PaperFeedbackModel.paper_id, Integer) to match numeric paper IDs. However, as the comments on Lines 312-314 acknowledge, paper_id can also be a Semantic Scholar ID (e.g., "abc123"). On PostgreSQL, CAST('abc123' AS INTEGER) raises a data-type error — it does not return NULL or 0.

This means any row in paper_feedback with a non-numeric paper_id will cause the entire query to fail on PostgreSQL.

Consider using a CASE expression or a regex filter to conditionally cast, or normalize paper_id storage to always use the numeric ID.

🐛 Possible approach using a safe cast with CASE
from sqlalchemy import case, text

# Safe cast: only attempt integer comparison when paper_id looks numeric
safe_int_match = case(
    (
        PaperFeedbackModel.paper_id.op("~")(r"^\d+$"),
        PaperModel.id == cast(PaperFeedbackModel.paper_id, Integer),
    ),
    else_=False,
)

base_stmt = (
    select(PaperModel, PaperFeedbackModel)
    .join(
        PaperFeedbackModel,
        or_(
            safe_int_match,
            PaperModel.semantic_scholar_id == PaperFeedbackModel.paper_id,
        ),
    )
    # ...
)

A cleaner long-term fix would be to always store the numeric papers.id in paper_feedback.paper_id at save time, avoiding the dual-type join entirely.

🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 318 - 331,
The join currently casts PaperFeedbackModel.paper_id to Integer (in base_stmt)
which will raise on PostgreSQL for non-numeric strings; replace the direct cast
with a safe conditional expression (e.g., build a case/boolean expression that
first checks PaperFeedbackModel.paper_id.op("~")(r"^\d+$") then compares
cast(PaperFeedbackModel.paper_id, Integer) only when the regex matches) and use
that safe_int_match in the or_ alongside PaperModel.semantic_scholar_id ==
PaperFeedbackModel.paper_id; update the join condition in base_stmt (where
cast(...) is referenced) to use the safe case/regex approach so non-numeric
paper_id values don’t trigger a CAST error (alternatively, normalize paper_id at
write time to always store numeric IDs).

Comment on lines +7 to +9
const { paperId } = await params
const url = new URL(req.url)
const upstream = `${apiBaseUrl()}/api/papers/${paperId}/save${url.search}`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

paperId is interpolated into the upstream URL without sanitization — potential path traversal risk.

A crafted paperId (e.g., ../../admin/delete) could manipulate the upstream request path. Consider validating or encoding paperId before interpolation:

🛡️ Proposed fix
  const { paperId } = await params
+ const safePaperId = encodeURIComponent(paperId)
  const url = new URL(req.url)
- const upstream = `${apiBaseUrl()}/api/papers/${paperId}/save${url.search}`
+ const upstream = `${apiBaseUrl()}/api/papers/${safePaperId}/save${url.search}`
  return proxyJson(req, upstream, "DELETE")

The same applies to the POST handler on Line 18.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const { paperId } = await params
const url = new URL(req.url)
const upstream = `${apiBaseUrl()}/api/papers/${paperId}/save${url.search}`
const { paperId } = await params
const safePaperId = encodeURIComponent(paperId)
const url = new URL(req.url)
const upstream = `${apiBaseUrl()}/api/papers/${safePaperId}/save${url.search}`
return proxyJson(req, upstream, "DELETE")
🤖 Prompt for AI Agents
In `@web/src/app/api/papers/`[paperId]/save/route.ts around lines 7 - 9, The
paperId is being interpolated directly into the upstream URL in route.ts
(upstream and apiBaseUrl()) and in the POST handler, allowing path-traversal
style values; sanitize or encode paperId before use (e.g., validate it against
an allowed pattern like /^[A-Za-z0-9_-]+$/ or otherwise reject invalid values,
or use a safe encoder such as encodeURIComponent on the paperId) and then use
that sanitized variable for building upstream and any other requests in the POST
handler to prevent crafted path segments from escaping the intended route.

@gemini-code-assist
Copy link

Summary of Changes

Hello @CJBshuosi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive paper harvesting and management system. It includes new API endpoints for initiating paper harvests from various sources, searching and filtering papers, managing user libraries, and tracking harvest run history. The changes involve database schema updates, new API routes, and the implementation of a multi-source paper harvest pipeline.

Highlights

  • Paper Harvesting: Introduces a new feature to harvest papers from multiple sources based on keywords, venues, and publication years.
  • Paper Search and Retrieval: Enables searching and filtering of harvested papers, enhancing the ability to find relevant research.
  • User Library Management: Allows users to save papers to their personal library and manage their reading status.
  • Harvest Run History: Tracks the execution and audit of harvest runs, providing insights into the harvesting process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • alembic/versions/0003_paper_harvest_tables.py
    • Created Alembic migration script to add 'papers' and 'harvest_runs' tables to the database.
  • src/paperbot/api/main.py
    • Included the 'harvest' router in the main API, making the new endpoints accessible.
  • src/paperbot/api/routes/harvest.py
    • Added new API routes for paper harvesting, search, and user library management.
  • src/paperbot/api/routes/research.py
    • Modified the research routes to include paper metadata when saving to the library and added logging for paper feedback requests.
  • src/paperbot/application/ports/harvester_port.py
    • Defined the 'HarvesterPort' interface for paper harvesters.
  • src/paperbot/application/services/init.py
    • Exported new services: 'PaperDeduplicator', 'QueryRewriter', and 'VenueRecommender'.
  • src/paperbot/application/services/paper_deduplicator.py
    • Implemented 'PaperDeduplicator' service for multi-strategy paper deduplication.
  • src/paperbot/application/services/query_rewriter.py
    • Implemented 'QueryRewriter' service for expanding and rewriting search queries.
  • src/paperbot/application/services/venue_recommender.py
    • Implemented 'VenueRecommender' service for recommending relevant academic venues based on search keywords.
  • src/paperbot/application/workflows/harvest_pipeline.py
    • Implemented 'HarvestPipeline' to orchestrate multi-source paper harvesting with deduplication and storage.
  • src/paperbot/context_engine/engine.py
    • Modified the context engine to include logging for paper searches and error handling.
  • src/paperbot/domain/harvest.py
    • Defined domain models for paper harvesting, including 'HarvestedPaper', 'HarvestSource', 'HarvestResult', and 'HarvestRunResult'.
  • src/paperbot/infrastructure/harvesters/init.py
    • Added harvesters for Arxiv, Semantic Scholar, and OpenAlex.
  • src/paperbot/infrastructure/harvesters/arxiv_harvester.py
    • Implemented 'ArxivHarvester' to fetch papers from arXiv.
  • src/paperbot/infrastructure/harvesters/openalex_harvester.py
    • Implemented 'OpenAlexHarvester' to fetch papers from OpenAlex.
  • src/paperbot/infrastructure/harvesters/semantic_scholar_harvester.py
    • Implemented 'SemanticScholarHarvester' to fetch papers from Semantic Scholar.
  • src/paperbot/infrastructure/stores/models.py
    • Modified the database models to include 'PaperModel' and 'HarvestRunModel'.
  • src/paperbot/infrastructure/stores/paper_store.py
    • Implemented 'PaperStore' to handle persistence and retrieval of harvested papers.
  • src/paperbot/infrastructure/stores/research_store.py
    • Modified the research store to include logging and paper saving functionality.
  • tests/integration/test_harvest_pipeline.py
    • Added integration tests for the harvest pipeline.
  • tests/integration/test_harvesters.py
    • Added integration tests for the harvesters.
  • tests/integration/test_paper_store.py
    • Added integration tests for the paper store.
  • tests/unit/test_harvested_paper.py
    • Added unit tests for the harvested paper domain model.
  • tests/unit/test_paper_deduplicator.py
    • Added unit tests for the paper deduplicator.
  • tests/unit/test_query_rewriter.py
    • Added unit tests for the query rewriter.
  • tests/unit/test_venue_recommender.py
    • Added unit tests for the venue recommender.
  • web/package-lock.json
    • Updated package lock file.
  • web/src/app/api/papers/[paperId]/save/route.ts
    • Added API route for saving papers.
  • web/src/app/api/papers/library/route.ts
    • Added API route for retrieving the paper library.
  • web/src/components/research/ResearchDashboard.tsx
    • Modified the research dashboard to include paper metadata when saving and to pass paper data to the feedback function.
  • web/src/components/research/SavedPapersList.tsx
    • Modified the saved papers list to include additional paper metadata and to use the new API endpoints.
  • web/src/lib/api.ts
    • Modified the API to fetch papers from the new library endpoint.
Activity
  • The pull request introduces a new feature for paper harvesting and management.
  • Database schema is updated to include new tables for papers and harvest runs.
  • New API endpoints are added for paper harvesting, search, and user library management.
  • Services for paper deduplication, query rewriting, and venue recommendation are implemented.
  • Harvesters for Arxiv, Semantic Scholar, and OpenAlex are added.
  • Unit and integration tests are added to ensure the functionality of the new features.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive paper harvesting and management system. Key changes include the addition of new database tables (papers and harvest_runs) via an Alembic migration to store harvested paper metadata and track harvest execution. A new FastAPI router (/api/harvest) is added, exposing endpoints for initiating multi-source paper harvests with streaming progress updates, listing and retrieving harvest run details, searching stored papers with various filters, and managing a user's paper library (saving/unsaving papers). The backend now includes new application services: PaperDeduplicator for in-memory deduplication of papers from different sources, QueryRewriter for expanding search keywords (e.g., abbreviations), and VenueRecommender for suggesting relevant academic venues. A HarvestPipeline workflow orchestrates these services and harvesters (Arxiv, Semantic Scholar, OpenAlex) to perform the end-to-end harvesting process. The research API's add_paper_feedback endpoint is enhanced to optionally save paper metadata to the new papers table when a 'save' action is performed, and logging is improved across several API routes. Frontend changes in web/ update the ResearchDashboard to pass full paper metadata when saving, and SavedPapersList is refactored to use the new /api/papers/library endpoint for fetching and managing saved papers, including new unsavePaper and toggleReadStatus functions. Additionally, several integration and unit tests for the new harvesting components and services are added.

Review comments highlight several security and efficiency concerns. Multiple endpoints (/api/papers/{paper_id}/save, /harvest/runs, /api/papers/library, /research/papers/feedback, /research/context) are identified as vulnerable to Insecure Direct Object Reference (IDOR) or unauthorized data access due to client-supplied user_id parameters without proper authentication. Information exposure is noted in harvest_stream for directly exposing raw exception messages. Efficiency concerns are raised regarding PaperDeduplicator's _merge_paper logic for re-indexing, PaperStore._find_existing's sequential database queries, and client-side venue filtering in OpenAlexHarvester and SemanticScholarHarvester which can lead to inefficient API usage. Other comments suggest improving full-text search performance, optimizing LIKE queries for venue filtering, standardizing paper_id types across frontend/backend, and reviewing column lengths in the papers table. The compute_title_hash method's handling of hyphens and the HarvestPipeline's paper count reporting during pipeline-level failures are also noted for potential refinement.

Comment on lines +143 to +153
existing.doi = new.doi
self._doi_index[new.doi.lower().strip()] = self._find_index(existing)
if not existing.arxiv_id and new.arxiv_id:
existing.arxiv_id = new.arxiv_id
self._arxiv_index[new.arxiv_id.lower().strip()] = self._find_index(existing)
if not existing.semantic_scholar_id and new.semantic_scholar_id:
existing.semantic_scholar_id = new.semantic_scholar_id
self._s2_index[new.semantic_scholar_id.lower().strip()] = self._find_index(existing)
if not existing.openalex_id and new.openalex_id:
existing.openalex_id = new.openalex_id
self._openalex_index[new.openalex_id.lower().strip()] = self._find_index(existing)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for updating _doi_index, _arxiv_index, _s2_index, and _openalex_index within _merge_paper relies on self._find_index(existing). This _find_index method computes the title hash and looks it up in _title_hash_index. This approach is problematic because if the existing paper was initially indexed by an identifier (e.g., DOI) and not its title hash, _find_index might return -1, leading to incorrect index updates for the canonical identifiers. The indexes should be updated directly using the idx of the existing paper, which is already known from _find_duplicate.

Suggested change
existing.doi = new.doi
self._doi_index[new.doi.lower().strip()] = self._find_index(existing)
if not existing.arxiv_id and new.arxiv_id:
existing.arxiv_id = new.arxiv_id
self._arxiv_index[new.arxiv_id.lower().strip()] = self._find_index(existing)
if not existing.semantic_scholar_id and new.semantic_scholar_id:
existing.semantic_scholar_id = new.semantic_scholar_id
self._s2_index[new.semantic_scholar_id.lower().strip()] = self._find_index(existing)
if not existing.openalex_id and new.openalex_id:
existing.openalex_id = new.openalex_id
self._openalex_index[new.openalex_id.lower().strip()] = self._find_index(existing)
# Fill in missing identifiers
if not existing.doi and new.doi:
existing.doi = new.doi
self._doi_index[new.doi.lower().strip()] = self._find_index(existing) # This line is problematic
if not existing.arxiv_id and new.arxiv_id:
existing.arxiv_id = new.arxiv_id
self._arxiv_index[new.arxiv_id.lower().strip()] = self._find_index(existing) # This line is problematic
if not existing.semantic_scholar_id and new.semantic_scholar_id:
existing.semantic_scholar_id = new.semantic_scholar_id
self._s2_index[new.semantic_scholar_id.lower().strip()] = self._find_index(existing) # This line is problematic
if not existing.openalex_id and new.openalex_id:
existing.openalex_id = new.openalex_id
self._openalex_index[new.openalex_id.lower().strip()] = self._find_index(existing) # This line is problematic

Comment on lines +390 to +391
@router.post("/papers/{paper_id}/save")
def save_paper_to_library(paper_id: int, request: SavePaperRequest):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The endpoint /api/papers/{paper_id}/save is vulnerable to Broken Access Control (IDOR) because it accepts a user_id in the request body without verification. This allows an attacker to save papers to any user's library, leading to unauthorized modification of user data. Additionally, there's an inconsistency where paper_id is an integer in the path but expected as a string in SqlAlchemyResearchStore, which could lead to confusion if paper_id can also be a Semantic Scholar ID. Ensure the user_id in the request matches the authenticated requester's identity and clarify the paper_id type handling.

Comment on lines +153 to +154
@router.get("/harvest/runs", response_model=HarvestRunListResponse)
def list_harvest_runs(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The /harvest/runs endpoint lists all harvest execution records in the system without any user-based filtering or authentication. Since harvest runs contain sensitive research data such as search keywords and targeted venues, this allows any user to monitor the research activities of others. Access to these records should be restricted to the user who initiated the run.

Comment on lines +323 to +325
@router.get("/papers/library", response_model=LibraryResponse)
def get_user_library(
user_id: str = Query("default", description="User ID"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The endpoint /api/papers/library accepts a user_id parameter directly from the client without any authentication. In a multi-user environment, this allows any user to view the saved papers of any other user by simply providing their user_id. This is an Insecure Direct Object Reference (IDOR) vulnerability. Authorization should be enforced based on an authenticated session or token.

Comment on lines +417 to +421
@router.delete("/papers/{paper_id}/save")
def remove_paper_from_library(
paper_id: int,
user_id: str = Query("default", description="User ID"),
):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The DELETE endpoint for /api/papers/{paper_id}/save allows removing papers from any user's library by providing their user_id in the query string. Without authentication, this is a critical access control flaw that permits unauthorized deletion of user data.

Comment on lines +234 to +240
if query:
pattern = f"%{query}%"
stmt = stmt.where(
or_(
PaperModel.title.ilike(pattern),
PaperModel.abstract.ilike(pattern),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The full-text search for query uses ilike(pattern) on both title and abstract. While this works for basic substring matching, ilike can be inefficient for large text fields and complex queries, especially without proper full-text indexing. For more advanced and performant full-text search capabilities, consider integrating with a dedicated full-text search solution (e.g., PostgreSQL's tsvector, SQLite's FTS5, or an external search engine like Elasticsearch).

Comment on lines +255 to +256
venue_conditions = [PaperModel.venue.ilike(f"%{v}%") for v in venues]
stmt = stmt.where(or_(*venue_conditions))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The venue filter uses PaperModel.venue.ilike(f"%{v}%") for each venue. This LIKE operator with leading wildcards (%) can be very slow as it prevents the use of indexes on the venue column. If venue filtering is a common operation, consider normalizing venue names and storing them in a separate table or using a more efficient search mechanism for this field.

Comment on lines +69 to +72
sa.Column("doi", sa.String(length=256), nullable=True),
sa.Column("arxiv_id", sa.String(length=64), nullable=True),
sa.Column("semantic_scholar_id", sa.String(length=64), nullable=True),
sa.Column("openalex_id", sa.String(length=64), nullable=True),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The lengths for doi, arxiv_id, semantic_scholar_id, and openalex_id columns in the papers table seem arbitrary. It's best practice to define these lengths based on the maximum possible length of these identifiers from their respective sources to prevent truncation or unnecessary over-allocation. For example, DOIs can be up to 256 characters, arXiv IDs are typically shorter (e.g., 10-15 chars plus optional version), Semantic Scholar IDs are UUID-like (32-36 chars), and OpenAlex IDs are also UUID-like. Consider reviewing the actual maximum lengths from the sources to set more precise limits.

Comment on lines +338 to +348
all_results = session.execute(base_stmt).all()
Logger.info(f"Query returned {len(all_results)} results before deduplication", file=LogFiles.HARVEST)

# Deduplicate by paper.id, keeping the one with latest timestamp
Logger.info("Deduplicating results by paper id", file=LogFiles.HARVEST)
paper_map: Dict[int, Tuple[PaperModel, PaperFeedbackModel]] = {}
for row in all_results:
paper = row[0]
feedback = row[1]
if paper.id not in paper_map or feedback.ts > paper_map[paper.id][1].ts:
paper_map[paper.id] = (paper, feedback)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The deduplication logic for get_user_library is performed in Python (paper_map: Dict[int, Tuple[PaperModel, PaperFeedbackModel]] = {}). While this works for smaller result sets, fetching all results from the database and then deduplicating in Python can be inefficient for large user libraries. It would be more performant to perform this deduplication at the database level, for example, using DISTINCT ON (PostgreSQL) or a subquery with ROW_NUMBER() to select the latest feedback per paper ID.

Comment on lines +403 to +411
if (action === "save" && paper) {
body.paper_title = paper.title
body.paper_abstract = paper.abstract || ""
body.paper_authors = paper.authors || []
body.paper_year = paper.year
body.paper_venue = paper.venue
body.paper_citation_count = paper.citation_count
body.paper_url = paper.url
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The if (action === "save" && paper) block conditionally adds paper metadata to the request body. This is a good improvement for the new save functionality. However, the paper_id sent in the sendFeedback function is p.paper_id, which is a string (e.g., Semantic Scholar ID). The backend's add_paper_feedback expects paper_id to be an integer if it's an internal ID, or a string if it's an external ID. The PaperStore's get_user_library handles both, but the save_paper_to_library in harvest.py explicitly converts paper_id to str(paper_id) before passing it to record_paper_feedback. This inconsistency in paper_id types (string vs. integer) across frontend and backend, and within backend logic, could lead to confusion or subtle bugs. It's best to standardize the paper_id type for feedback actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Paper Search and Storage

1 participant