-
Notifications
You must be signed in to change notification settings - Fork 3
feat(Harvest): add -- Paper Search and Storage #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Closes jerry609#26 Signed-off-by: LIU BOYU <oor2020@163.com>
📝 WalkthroughWalkthroughImplements a complete paper harvesting system with multiple sources (arXiv, Semantic Scholar, OpenAlex), including pipeline orchestration, deduplication, storage layer, and REST API endpoints for searching, saving, and managing papers with harvest run tracking. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API
participant Pipeline
participant Harvesters
participant Deduplicator
participant PaperStore
participant DB
Client->>API: POST /harvest (HarvestRequest)
activate API
API->>Pipeline: run(config)
activate Pipeline
Pipeline->>Pipeline: Expand keywords (optional)
Pipeline->>Pipeline: Recommend venues (optional)
Pipeline->>PaperStore: create_harvest_run()
activate PaperStore
PaperStore->>DB: INSERT harvest_runs
PaperStore-->>Pipeline: HarvestRunModel
deactivate PaperStore
Pipeline->>API: yield HarvestProgress(phase="Initializing")
API->>Client: Event: progress
par Multi-source Harvest
Pipeline->>Harvesters: search(query) via ArxivHarvester
activate Harvesters
Harvesters-->>Pipeline: HarvestResult[papers]
deactivate Harvesters
Pipeline->>Harvesters: search(query) via SemanticScholarHarvester
activate Harvesters
Harvesters-->>Pipeline: HarvestResult[papers]
deactivate Harvesters
Pipeline->>Harvesters: search(query) via OpenAlexHarvester
activate Harvesters
Harvesters-->>Pipeline: HarvestResult[papers]
deactivate Harvesters
end
Pipeline->>API: yield HarvestProgress(phase="Deduplicating")
API->>Client: Event: progress
Pipeline->>Deduplicator: deduplicate(all_papers)
activate Deduplicator
Deduplicator-->>Pipeline: (deduplicated_papers, count)
deactivate Deduplicator
Pipeline->>API: yield HarvestProgress(phase="Storing")
API->>Client: Event: progress
Pipeline->>PaperStore: upsert_papers_batch(deduplicated)
activate PaperStore
PaperStore->>DB: INSERT/UPDATE papers
PaperStore-->>Pipeline: (papers_created, papers_updated)
deactivate PaperStore
Pipeline->>PaperStore: update_harvest_run(final_status, counts)
activate PaperStore
PaperStore->>DB: UPDATE harvest_runs
PaperStore-->>Pipeline: HarvestRunModel
deactivate PaperStore
Pipeline->>API: yield HarvestFinalResult(status, counts, duration)
deactivate Pipeline
API->>Client: Event: final_result
deactivate API
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes This change introduces a substantial, multi-component harvesting system with interconnected services, heterogeneous logic across harvesters, deduplication strategies, API workflows, and frontend integration. Multiple new files with varying complexity (domain models, service implementations, orchestration, storage, API routes, tests) require separate reasoning paths and careful validation of integration points. Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
web/src/lib/api.ts (1)
187-189:⚠️ Potential issue | 🟡 Minor
fetchScholarDetailsnow implicitly depends on a live backend.
fetchScholarDetailscallsfetchPapers()(Line 188), which now hits the real API instead of returning mock data. This means the scholar details page will fail or show empty publications if the backend is unavailable, even though all other data in this function is still mocked.
🤖 Fix all issues with AI agents
In `@alembic/versions/0003_paper_harvest_tables.py`:
- Around line 54-59: The upgrade() function returns early when _is_offline() is
true, so _upgrade_create_indexes() is never called and generated offline SQL
omits index creation; remove the early return or restructure upgrade() so it
always calls _upgrade_create_indexes() after _upgrade_create_tables() (rely on
_create_index/op.create_index which already handles offline mode) — modify
upgrade() to invoke _upgrade_create_tables() then _upgrade_create_indexes()
regardless of _is_offline() so indexes are emitted in offline-generated SQL.
In `@src/paperbot/api/routes/harvest.py`:
- Around line 323-331: get_user_library currently accepts user_id from the query
string (user_id param in get_user_library) which allows callers to access/modify
arbitrary users' libraries; change this to derive the user ID from the
authenticated session/token (e.g., replace the user_id Query param with a
dependency that extracts the current_user/current_user.id from your auth
middleware or token validation function used elsewhere) and remove the default
"default" value; apply the same change to the other endpoint that accepts
user_id in the request body (lines referenced) so both endpoints use the
authenticated user context; if no authentication exists for this MVP, add a
clear TODO comment inside get_user_library and the other endpoint stating that
user_id is temporary and must be replaced with auth-derived user identification
before production.
- Around line 390-414: save_paper_to_library creates a new
SqlAlchemyResearchStore() per request and never closes it, leaking DB resources;
change this to reuse a shared/lazy-initialized research store or inject one
instead of instantiating inline: add a module-level lazy getter (similar to
_get_paper_store) or accept a dependency and use that shared instance, then call
the store's close/dispose method if it has one after record_paper_feedback (or
rely on the shared instance's lifecycle); update references in
save_paper_to_library and any other handlers that instantiate
SqlAlchemyResearchStore() so they use the new get_research_store() or injected
research_store and remove the per-request constructor.
- Line 52: The Field declaration for keywords in HarvestRequest uses the
deprecated Pydantic v1 arg min_items which causes a TypeError; update the Field
call on the keywords attribute (keywords: List[str] = Field(..., min_items=1,
description="Search keywords")) to use min_length=1 instead (i.e., replace
min_items with min_length) so Pydantic v2 accepts the constraint. Ensure the
change is made on the keywords field in the HarvestRequest/schema where Field is
used.
In `@src/paperbot/api/routes/research.py`:
- Around line 693-706: The feedback currently records mixed IDs because
actual_paper_id is replaced with the internal DB ID on save; change the call to
_research_store.add_paper_feedback to always use the original external paper id
from the request (e.g., req.paper_id or a saved variable representing the
external id) as the paper_id parameter, and move the internal library_paper_id
(library_paper_id) into the metadata argument (meta) before calling
add_paper_feedback; keep returning PaperFeedbackResponse(feedback=fb,
library_paper_id=library_paper_id) so the client still sees the internal id
while the feedback table consistently stores the external identifier.
- Around line 677-688: The route handler directly imports PaperModel and uses
paper_store._provider.session() with a SQLAlchemy select
(PaperModel.semantic_scholar_id ...) which breaks the layering; add or use a
PaperStore method like get_paper_by_source_id(source, source_id) on the existing
paper_store to encapsulate the lookup, remove the direct ORM import and session
usage in the handler, call
paper_store.get_paper_by_source_id("semantic_scholar", req.paper_id) to obtain
the Paper entity (then extract id/actual_paper_id from that returned
domain/model), and update tests or callers accordingly.
- Around line 652-691: The save-to-library flow hardcodes
HarvestSource.SEMANTIC_SCHOLAR and constructs a new PaperStore per request;
update PaperFeedbackRequest to include an optional paper_source and use that
when building the HarvestedPaper (falling back to
HarvestSource.SEMANTIC_SCHOLAR), and replace the per-request PaperStore()
instantiation by using a module-scoped or lazy singleton store (e.g.,
_research_store) so this handler reuses the same PaperStore instance when
calling upsert_papers_batch and accessing _provider.session().
In `@src/paperbot/context_engine/engine.py`:
- Line 506: The Logger.info call on the Paper search config line exceeds the
100-char limit; modify the call in engine.py (the Logger.info invocation
referencing self.config.offline and self.config.paper_limit and
LogFiles.HARVEST) to wrap or split the formatted message so each line stays
under 100 characters (for example build the message into a shorter variable or
use multiple string parts concatenated across lines) without changing the logged
content or the LogFiles.HARVEST target.
In `@src/paperbot/domain/harvest.py`:
- Around line 87-111: from_dict can raise ValueError when constructing
HarvestSource from an empty/invalid string; wrap the conversion in a safe parse
(in from_dict) by attempting HarvestSource(source) in a try/except and falling
back to a safe value (e.g., HarvestSource.UNKNOWN if you add that enum member,
or None/default enum) so deserialization never throws; update the from_dict
logic around the source handling and, if adding HarvestSource.UNKNOWN, add that
enum member to HarvestSource so the fallback is available.
In `@src/paperbot/infrastructure/harvesters/openalex_harvester.py`:
- Around line 42-45: Add an HTTP timeout to prevent indefinite hangs by creating
the aiohttp session with a ClientTimeout in _get_session (use
aiohttp.ClientTimeout with an appropriate total/Connect/read values) instead of
a bare aiohttp.ClientSession, and ensure individual calls that use
session.get(...) also pass a timeout (or rely on the session default) so
requests to the OpenAlex API cannot block forever; apply the same changes to the
ArxivHarvester equivalent methods.
In `@src/paperbot/infrastructure/stores/paper_store.py`:
- Around line 318-331: The join currently casts PaperFeedbackModel.paper_id to
Integer (in base_stmt) which will raise on PostgreSQL for non-numeric strings;
replace the direct cast with a safe conditional expression (e.g., build a
case/boolean expression that first checks
PaperFeedbackModel.paper_id.op("~")(r"^\d+$") then compares
cast(PaperFeedbackModel.paper_id, Integer) only when the regex matches) and use
that safe_int_match in the or_ alongside PaperModel.semantic_scholar_id ==
PaperFeedbackModel.paper_id; update the join condition in base_stmt (where
cast(...) is referenced) to use the safe case/regex approach so non-numeric
paper_id values don’t trigger a CAST error (alternatively, normalize paper_id at
write time to always store numeric IDs).
- Around line 244-251: The current falsy checks (if year_from:, if year_to:, if
min_citations:) skip valid zero values; change them to explicit None checks
(e.g., use "is not None") so filters on PaperModel.year and
PaperModel.citation_count are applied when year_from, year_to or min_citations
equal 0. Update the conditionals around the stmt.where calls that reference
PaperModel.year and PaperModel.citation_count (the blocks using year_from,
year_to, min_citations) to use "is not None" checks to ensure zero is treated as
a valid filter value.
- Around line 209-278: The search_papers method currently ignores the keywords
parameter and unsafely uses getattr(PaperModel, sort_by...), so remove or
implement keywords filtering and restrict sort_by to a whitelist: update
search_papers to either apply keywords (e.g., build OR ilike conditions on
PaperModel.title/abstract/keywords) or delete the unused keywords argument and
its callers, and replace the dynamic getattr with a mapping/whitelist of allowed
sort keys (e.g., {"citation_count": PaperModel.citation_count, "year":
PaperModel.year, "title": PaperModel.title}) then look up sort_col =
ALLOWED_SORT_COLUMNS.get(sort_by, PaperModel.citation_count) before ordering;
reference function search_papers, parameter keywords, parameter sort_by, and
model PaperModel when making the changes.
In `@web/src/app/api/papers/`[paperId]/save/route.ts:
- Around line 7-9: The paperId is being interpolated directly into the upstream
URL in route.ts (upstream and apiBaseUrl()) and in the POST handler, allowing
path-traversal style values; sanitize or encode paperId before use (e.g.,
validate it against an allowed pattern like /^[A-Za-z0-9_-]+$/ or otherwise
reject invalid values, or use a safe encoder such as encodeURIComponent on the
paperId) and then use that sanitized variable for building upstream and any
other requests in the POST handler to prevent crafted path segments from
escaping the intended route.
🟡 Minor comments (9)
web/src/app/api/papers/[paperId]/save/route.ts-13-19 (1)
13-19:⚠️ Potential issue | 🟡 MinorInconsistent query string forwarding between DELETE and POST.
DELETEforwardsurl.searchbutPOSTdoes not. If this is intentional, a brief comment would clarify the design choice. If not, the POST handler should also forward the query string for consistency.src/paperbot/domain/harvest.py-57-62 (1)
57-62:⚠️ Potential issue | 🟡 MinorTitle hash collision on empty/whitespace-only titles.
compute_title_hash()on a paper with an empty title will produce the same hash as any other empty-titled paper, causing false deduplication matches. Consider guarding against this in the deduplicator or returningNone/a sentinel for empty titles.tests/unit/test_harvested_paper.py-95-104 (1)
95-104:⚠️ Potential issue | 🟡 Minor
paper3is assigned but never used — either assert on it or remove it.Ruff flags
paper3(Line 99) as unused (F841). The comment on Lines 103–104 acknowledges the hyphen case may differ, but the test neither asserts equality nor inequality forpaper3. Either add an explicit assertion (e.g.,assert paper1.compute_title_hash() != paper3.compute_title_hash()to document the known divergence) or removepaper3.Proposed fix — assert the known divergence
# All should have same hash after removing punctuation assert paper1.compute_title_hash() == paper2.compute_title_hash() - # Note: hyphens are removed, making it "deeplearning" vs "deep learning" - # This might differ, which is intentional for similar titles + # Hyphen removal joins words: "Deep-Learning?" → "deeplearning" vs "deep learning" + assert paper1.compute_title_hash() != paper3.compute_title_hash()src/paperbot/application/services/venue_recommender.py-133-142 (1)
133-142:⚠️ Potential issue | 🟡 MinorExact matches are double-counted by the partial-match loop.
When
keyword_lowermatches a key exactly, it scores +3 from the exact-match branch (Line 136), then the partial-match loop at Line 140 also fires becausekeyword_lower in mapped_kwisTruewhen they are equal. This adds an extra +1 per venue for every exact match.If the intent is that exact matches should score strictly +3, guard the partial loop:
Proposed fix
# Exact match (highest priority) if keyword_lower in self.mappings: for venue in self.mappings[keyword_lower]: venue_scores[venue] = venue_scores.get(venue, 0) + 3 # Partial match (medium priority) for mapped_kw, venues in self.mappings.items(): - if keyword_lower in mapped_kw or mapped_kw in keyword_lower: + if mapped_kw == keyword_lower: + continue # already scored as exact match + if keyword_lower in mapped_kw or mapped_kw in keyword_lower: for venue in venues: venue_scores[venue] = venue_scores.get(venue, 0) + 1src/paperbot/api/routes/research.py-617-624 (1)
617-624:⚠️ Potential issue | 🟡 MinorMissing tests for the new save-to-library flow.
The
add_paper_feedbackendpoint gained significant new behavior (paper save, library lookup, ID mapping), but no corresponding tests were added. As per coding guidelines, behavior changes should include test updates.As per coding guidelines:
{src,web/src}/**/*.{py,ts,tsx}: If behavior changes, add or update tests.src/paperbot/infrastructure/stores/paper_store.py-95-133 (1)
95-133:⚠️ Potential issue | 🟡 Minor
_find_existingdoesn't filter out soft-deleted papers.The lookup queries don't include
PaperModel.deleted_at.is_(None), so a soft-deleted paper will be found and updated rather than creating a new record. If soft-delete is meant to logically remove a paper, re-harvesting should probably create a fresh record or un-delete it explicitly.src/paperbot/infrastructure/stores/paper_store.py-233-241 (1)
233-241:⚠️ Potential issue | 🟡 MinorLIKE pattern does not escape user-supplied wildcards.
The
queryvalue is interpolated directly into the%{query}%pattern. If a user searches for100%, it becomes%100%%, which alters the matching semantics. While this isn't a SQL injection (parameters are bound), it can produce unexpected results.Consider escaping
%and_in the user's query string before wrapping with wildcards.src/paperbot/api/routes/harvest.py-118-118 (1)
118-118:⚠️ Potential issue | 🟡 MinorRemove unused
trace_idassignment.The return value from
set_trace_id()is captured but never used.🧹 Fix
- trace_id = set_trace_id() + set_trace_id()src/paperbot/application/workflows/harvest_pipeline.py-314-335 (1)
314-335:⚠️ Potential issue | 🟡 MinorPipeline-level error handler may fail if the harvest run was never created.
If an exception occurs before
create_harvest_runon Line 208 (e.g., during keyword expansion on Line 187), theupdate_harvest_runcall on Line 317 will silently fail because therun_iddoesn't exist in the DB yet. The error won't be recorded.Consider guarding this with a flag or a try/except around
update_harvest_run.🛡️ Suggested guard
+ run_created = False try: # ... self.paper_store.create_harvest_run(...) + run_created = True # ... except Exception as e: logger.exception(f"Harvest pipeline failed: {e}") - self.paper_store.update_harvest_run( - run_id=run_id, - status="failed", - errors={"pipeline": str(e)}, - ) + if run_created: + self.paper_store.update_harvest_run( + run_id=run_id, + status="failed", + errors={"pipeline": str(e)}, + )
🧹 Nitpick comments (32)
src/paperbot/infrastructure/stores/research_store.py (1)
329-356: Log messages lack contextual identifiers, reducing observability value.These log statements don't include
user_id,track_id, orpaper_id, making them hard to correlate in a multi-user system. Consider including key identifiers:♻️ Suggested improvement
- Logger.info("Recording paper feedback", file=LogFiles.HARVEST) + Logger.info(f"Recording paper feedback user={user_id} track={track_id} paper={paper_id}", file=LogFiles.HARVEST) now = _utcnow() with self._provider.session() as session: ... if track is None: - Logger.error("Track not found", file=LogFiles.HARVEST) + Logger.warning(f"Track not found: track_id={track_id} user={user_id}", file=LogFiles.HARVEST) return None - Logger.info("Creating new feedback record", file=LogFiles.HARVEST) + Logger.info(f"Creating feedback record: paper={paper_id} action={action}", file=LogFiles.HARVEST) ... - Logger.info("Feedback record created successfully", file=LogFiles.HARVEST) + Logger.info(f"Feedback recorded: id={row.id} paper={paper_id}", file=LogFiles.HARVEST)Also, Line 338: "Track not found" is a normal client-input condition (not a system error) —
warninglevel would be more appropriate thanerror.src/paperbot/application/services/__init__.py (1)
6-12:__all__is not sorted (RUF022).Static analysis flags that
__all__entries should be sorted alphabetically.♻️ Proposed fix
__all__ = [ "LLMService", + "PaperDeduplicator", + "QueryRewriter", + "VenueRecommender", "get_llm_service", - "PaperDeduplicator", - "QueryRewriter", - "VenueRecommender", ]src/paperbot/infrastructure/harvesters/__init__.py (1)
9-17: Sort imports and__all__alphabetically.Both the imports and
__all__entries are out of alphabetical order. The static analysis tool (Ruff RUF022) flags__all__as unsorted, and the project's coding guidelines require isort with Black profile.♻️ Proposed fix
-from .arxiv_harvester import ArxivHarvester -from .semantic_scholar_harvester import SemanticScholarHarvester -from .openalex_harvester import OpenAlexHarvester +from .arxiv_harvester import ArxivHarvester +from .openalex_harvester import OpenAlexHarvester +from .semantic_scholar_harvester import SemanticScholarHarvester __all__ = [ "ArxivHarvester", - "SemanticScholarHarvester", "OpenAlexHarvester", + "SemanticScholarHarvester", ]As per coding guidelines, "Use isort with Black profile for Python import sorting".
tests/integration/test_harvesters.py (1)
436-478: Resource cleanup: harvester instances created in interface tests are never closed.Each test in
TestHarvesterInterfaceinstantiates all three harvesters but never callsclose(). While no real HTTP sessions are opened (onlyhasattr/inspectchecks are performed), this could become a problem if harvester constructors ever acquire resources eagerly. Consider adding cleanup or using a fixture with teardown.tests/integration/test_paper_store.py (1)
238-242: Unused unpacked variables flagged by linter.Ruff flags
total(Lines 240, 265, 281) andall_papers(Line 265) as unused. Use_for discarded tuple elements to satisfy the linter.Example fix (applied to all occurrences)
- papers, total = self.store.search_papers(sources=["arxiv"]) + papers, _ = self.store.search_papers(sources=["arxiv"])- all_papers, total = self.store.search_papers(limit=100) + _all_papers, _total = self.store.search_papers(limit=100)- papers, total = self.store.search_papers( + papers, _ = self.store.search_papers(src/paperbot/infrastructure/harvesters/openalex_harvester.py (1)
47-55: Moveimport timeto the module level.
import timeon Line 49 is executed on every_rate_limit()call. While Python caches module imports, the lookup overhead is unnecessary. Move it to the top of the file with other stdlib imports.Proposed fix
At the top of the file (after
import asyncio):import asyncio import logging +import time from typing import Any, Dict, List, OptionalThen in the method:
async def _rate_limit(self) -> None: """Enforce rate limiting between requests.""" - import time - now = time.time()src/paperbot/application/services/venue_recommender.py (1)
25-81: AnnotateDEFAULT_MAPPINGSwithClassVar.Static analysis (RUF012) flags this mutable class attribute. While
.copy()in__init__prevents accidental cross-instance mutation, adding theClassVarannotation makes the intent explicit.Proposed fix
-from typing import Dict, List, Optional +from typing import ClassVar, Dict, List, Optional ... - DEFAULT_MAPPINGS: Dict[str, List[str]] = { + DEFAULT_MAPPINGS: ClassVar[Dict[str, List[str]]] = {tests/unit/test_query_rewriter.py (2)
51-57: Assertion is overly lenient — both the original and expansion should be present.
expand_all(["ML", "deep learning"])should include both"ML"(original) and"machine learning"(expansion). Theorlets the test pass even if the expansion is missing entirely. Consider asserting both:- assert "ML" in expanded or "machine learning" in expanded + assert "ML" in expanded + assert "machine learning" in expanded
92-96: Line 94 likely exceeds the 100-character line limit.Consider splitting across multiple lines.
Proposed fix
- known_abbrevs = ["llm", "ml", "dl", "nlp", "cv", "rl", "gan", "cnn", "rnn", "bert", "gpt", "rag"] + known_abbrevs = [ + "llm", "ml", "dl", "nlp", "cv", "rl", + "gan", "cnn", "rnn", "bert", "gpt", "rag", + ]As per coding guidelines, "Use Black formatting with
line-length = 100for Python code".src/paperbot/application/services/query_rewriter.py (1)
28-68: AnnotateDEFAULT_ABBREVIATIONSwithClassVar.Same as the
VenueRecommenderclass — static analysis (RUF012) flags this mutable class attribute. AddingClassVarmakes the intent explicit.Proposed fix
-from typing import Dict, List, Optional +from typing import ClassVar, Dict, List, Optional ... - DEFAULT_ABBREVIATIONS: Dict[str, str] = { + DEFAULT_ABBREVIATIONS: ClassVar[Dict[str, str]] = {src/paperbot/application/services/paper_deduplicator.py (2)
183-186:_find_indexreturns-1as a sentinel — this could silently corrupt index mappings.If
_find_indexever fails to find the paper (e.g., due to a future code change),-1would be stored in the identifier indexes, pointing at the last element inunique_papersrather than the intended paper. Consider raising an error or using a safer pattern:Proposed fix
def _find_index(self, paper: HarvestedPaper) -> int: """Find the index of a paper in the title hash index.""" title_hash = paper.compute_title_hash() - return self._title_hash_index.get(title_hash, -1) + idx = self._title_hash_index.get(title_hash) + if idx is None: + raise KeyError(f"Paper not found in title hash index: {paper.title!r}") + return idx
179-181:list(set(...))produces non-deterministic ordering for keywords and fields_of_study.This won't cause bugs, but it makes deduplication results non-deterministic across runs (set iteration order varies). If stable output matters for tests or downstream consumers, consider using
dict.fromkeys()to preserve insertion order:- existing.keywords = list(set(existing.keywords + new.keywords)) - existing.fields_of_study = list(set(existing.fields_of_study + new.fields_of_study)) + existing.keywords = list(dict.fromkeys(existing.keywords + new.keywords)) + existing.fields_of_study = list(dict.fromkeys(existing.fields_of_study + new.fields_of_study))alembic/versions/0003_paper_harvest_tables.py (2)
92-94:created_atandupdated_atlack server defaults — NULLs are likely if application code doesn't set them.Consider adding
server_default=sa.func.now()(orsa.text("CURRENT_TIMESTAMP")) so that rows always have a creation timestamp, even if the application layer misses setting one.Proposed fix
- sa.Column("created_at", sa.DateTime(timezone=True), nullable=True), - sa.Column("updated_at", sa.DateTime(timezone=True), nullable=True), + sa.Column("created_at", sa.DateTime(timezone=True), server_default=sa.text("CURRENT_TIMESTAMP"), nullable=False), + sa.Column("updated_at", sa.DateTime(timezone=True), server_default=sa.text("CURRENT_TIMESTAMP"), nullable=False),
62-117: No unique constraints on deduplication identifier columns (doi, arxiv_id, etc.).The
paperstable relies entirely on the application layer for deduplication. If multiple processes or API calls race, duplicate rows can be inserted. Consider adding unique indexes (at least ondoiandtitle_hash) to enforce deduplication at the database level. Partial unique indexes (e.g.,CREATE UNIQUE INDEX ... WHERE doi IS NOT NULL) would handle the nullable columns.src/paperbot/infrastructure/harvesters/semantic_scholar_harvester.py (2)
28-41: AnnotateFIELDSwithClassVarto satisfy the mutable class-attribute lint.Ruff RUF012 flags mutable class attributes that aren't annotated with
ClassVar. A one-line fix silences the warning and communicates intent.Proposed fix
+from typing import Any, ClassVar, Dict, List, Optional -from typing import Any, Dict, List, Optional ... - FIELDS = [ + FIELDS: ClassVar[List[str]] = [
50-101: Search method is well-structured; one subtle note on year-filter truthiness.Lines 63-68:
if year_from and year_touses truthiness, soyear_from=0would be skipped. Since publication years are always positive this is fine in practice, but usingis not Nonewould be more precise and consistent with how the other harvesters (arXiv) might handle it.Overall the search→filter→return flow, the cap at 100, and the graceful error fallback look good.
More explicit year checks
- if year_from and year_to: + if year_from is not None and year_to is not None: year_filter = f" year:{year_from}-{year_to}" - elif year_from: + elif year_from is not None: year_filter = f" year:{year_from}-" - elif year_to: + elif year_to is not None: year_filter = f" year:-{year_to}"tests/integration/test_harvest_pipeline.py (2)
94-105: Pipeline fixture lacks teardown — potential resource leak across tests.The
pipelinefixture creates aHarvestPipelinewith a SQLite DB but never callsclose(). The error-handling tests (Lines 338, 380, 417) close their locally-created pipelines, but the fixture-based tests don't. Consider adding a yield-based cleanup.Proposed fix
`@pytest.fixture` -def pipeline(tmp_path, mock_harvesters): +async def pipeline(tmp_path, mock_harvesters): """Create HarvestPipeline with mocked dependencies.""" db_url = f"sqlite:///{tmp_path / 'test_harvest.db'}" pipeline = HarvestPipeline(db_url=db_url) # Inject mock harvesters def get_mock_harvester(source): return mock_harvesters.get(source) pipeline._get_harvester = get_mock_harvester - return pipeline + yield pipeline + await pipeline.close()
225-226: Rename unused loop variableitemto_.Ruff B007 flags this. Minor fix for lint compliance.
Proposed fix
- async for item in pipeline.run(config): - pass # Just run to completion + async for _ in pipeline.run(config): + pass # Just run to completionAlso applies to: 267-268
web/src/components/research/SavedPapersList.tsx (1)
99-104: Consider extracting the HTML error-response handling into a shared helper.The same pattern for detecting and sanitizing HTML error responses is duplicated across
loadSavedPapers,unsavePaper, andtoggleReadStatus. A small utility would reduce repetition.Example extraction
async function extractErrorMessage(res: Response): Promise<string> { const text = await res.text() if (text.startsWith("<!DOCTYPE") || text.startsWith("<html")) { return `Server error: ${res.status} ${res.statusText}` } return text }Also applies to: 140-146, 176-181
src/paperbot/infrastructure/harvesters/arxiv_harvester.py (3)
133-162: Version suffix stripping could incorrectly split IDs containing "v" in unusual positions.Line 140-141:
arxiv_id.split("v")[0]splits on the first lowercase"v"character, which works for standard arXiv IDs but is fragile. Using a regex to strip only a trailing version suffix is more robust.Proposed fix
+import re ... # Remove version suffix (e.g., "2301.12345v1" -> "2301.12345") - if "v" in arxiv_id: - arxiv_id = arxiv_id.split("v")[0] + arxiv_id = re.sub(r"v\d+$", "", arxiv_id)
58-74: Same year-filter truthiness note as the S2 harvester.Line 69:
if year_from or year_towill skipyear_from=0. Consistent with the suggestion on the S2 harvester, usingis not Nonewould be more precise.Proposed fix
- if year_from or year_to: + if year_from is not None or year_to is not None: start_date = f"{year_from}0101" if year_from else "199101" end_date = f"{year_to}1231" if year_to else "209912"Note: the inner ternaries on lines 70-71 would also need updating to use
is not Nonefor full consistency.
48-56: Moveimport timeto the top of the file.The
timemodule is imported inside_rate_limit()on every call. It's a stdlib module with negligible import cost — it should be at module level.Proposed fix
import asyncio import logging +import time from typing import List, Optional ... async def _rate_limit(self) -> None: """Enforce rate limiting between requests.""" - import time - now = time.time()src/paperbot/application/workflows/harvest_pipeline.py (3)
216-261: Harvesting is sequential, not parallel as documented.The comment on Line 216 says "Harvest from each source in parallel", but the
forloop on Line 223awaits each harvester one at a time. This means sources are harvested sequentially, which may significantly impact latency with multiple sources.Consider using
asyncio.gather(orasyncio.TaskGroup) for true concurrency. If you intentionally want sequential execution (e.g., to avoid rate-limiting), update the comment to match.♻️ Sketch for parallel harvesting
- # Phase 4: Harvest from each source in parallel - all_papers: List[HarvestedPaper] = [] - - # Build search query from expanded keywords - search_query = " ".join(expanded_keywords) - - # Harvest from each source - for source in sources: - yield HarvestProgress( - phase="Harvesting", - message=f"Fetching from {source}...", - details={"source": source}, - ) - - harvester = self._get_harvester(source) - if harvester is None: - errors[source] = f"Unknown source: {source}" - source_results[source] = {"papers": 0, "error": errors[source]} - continue - - try: - result = await harvester.search( - query=search_query, - max_results=config.max_results_per_source, - year_from=config.year_from, - year_to=config.year_to, - venues=venues, - ) - - all_papers.extend(result.papers) - source_results[source] = { - "papers": result.total_found, - "error": result.error, - } - - if result.error: - errors[source] = result.error - logger.warning(f"Error from {source}: {result.error}") - else: - logger.info(f"Harvested {result.total_found} papers from {source}") - - except Exception as e: - error_msg = str(e) - errors[source] = error_msg - source_results[source] = {"papers": 0, "error": error_msg} - logger.exception(f"Exception harvesting from {source}") + # Phase 4: Harvest from each source + all_papers: List[HarvestedPaper] = [] + search_query = " ".join(expanded_keywords) + + yield HarvestProgress( + phase="Harvesting", + message=f"Fetching from {len(sources)} source(s)...", + ) + + async def _harvest_one(source: str): + harvester = self._get_harvester(source) + if harvester is None: + return source, None, f"Unknown source: {source}" + try: + result = await harvester.search( + query=search_query, + max_results=config.max_results_per_source, + year_from=config.year_from, + year_to=config.year_to, + venues=venues, + ) + return source, result, None + except Exception as e: + return source, None, str(e) + + tasks = [_harvest_one(s) for s in sources] + results = await asyncio.gather(*tasks) + + for source, result, err in results: + if err: + errors[source] = err + source_results[source] = {"papers": 0, "error": err} + logger.warning(f"Error from {source}: {err}") + else: + all_papers.extend(result.papers) + source_results[source] = { + "papers": result.total_found, + "error": result.error, + } + if result.error: + errors[source] = result.error + else: + logger.info( + f"Harvested {result.total_found} papers from {source}" + )
337-355:run_syncis anasyncmethod — the name is misleading.The method is declared
async def run_sync(...)but "sync" typically implies it blocks without requiringawait. Callers still needawait pipeline.run_sync(...). Consider renaming torun_to_completionorrun_collectto better express the intent (consume the generator and return the final result).
357-370: Silently swallowing exceptions during cleanup.Lines 361-364 catch all exceptions with a bare
pass. If a harvester'sclose()fails, the error is completely lost, making debugging difficult.🛡️ Proposed fix: log instead of swallowing
for harvester in self._harvesters.values(): try: await harvester.close() - except Exception: - pass + except Exception: + logger.debug("Error closing harvester", exc_info=True)src/paperbot/api/routes/harvest.py (3)
163-181: DuplicatedHarvestRunResponseconstruction logic.The conversion from
HarvestRunModeltoHarvestRunResponseis repeated verbatim betweenlist_harvest_runsandget_harvest_run. Extract it into a helper to keep things DRY.♻️ Suggested helper
def _run_to_response(run: "HarvestRunModel") -> HarvestRunResponse: return HarvestRunResponse( run_id=run.run_id, keywords=run.get_keywords(), venues=run.get_venues(), sources=run.get_sources(), max_results_per_source=run.max_results_per_source or 50, status=run.status or "unknown", papers_found=run.papers_found or 0, papers_new=run.papers_new or 0, papers_deduplicated=run.papers_deduplicated or 0, errors=run.get_errors(), started_at=run.started_at.isoformat() if run.started_at else None, ended_at=run.ended_at.isoformat() if run.ended_at else None, )Also applies to: 193-206
230-252:PaperResponsemodel is defined but never used.No endpoint uses
PaperResponseas aresponse_model.PaperSearchResponse.papersis typed asList[Dict[str, Any]]andget_paperreturns a raw dict. Consider either usingPaperResponseas the response model (for schema documentation and validation) or removing it to avoid dead code.
264-290:search_papersuses POST for a query-like operation.Using POST for a search endpoint means results are not cacheable and the operation isn't idempotent from an HTTP semantics perspective. GET with query parameters is more conventional for search. If the reason for POST is the complexity of the filter body, this is a common pragmatic choice — but worth a brief comment in the docstring for clarity.
src/paperbot/infrastructure/stores/paper_store.py (4)
59-93: N+1 query pattern inupsert_papers_batch— up to 5 SELECTs per paper.
_find_existingissues up to 5 sequential queries per paper (doi, arxiv_id, semantic_scholar_id, openalex_id, title_hash). For a batch of 200 papers, this could mean ~1000 queries within a single transaction.Consider batching the lookups: pre-fetch all existing papers matching any of the incoming identifiers in a single query (or a small number of queries), then do the upsert loop against the in-memory results.
290-381:get_user_libraryloads all matching rows into memory for sorting/pagination.The method fetches all matching paper-feedback pairs (Line 338), deduplicates in Python, sorts in Python, and then applies offset/limit. For users with large libraries, this won't scale.
The dual-type join (numeric ID vs. Semantic Scholar ID) makes SQL-level pagination difficult, but this should be documented as a known limitation with a TODO for future optimization — e.g., by normalizing
paper_idstorage to always use the numeric DB ID.
498-503:close()swallows all exceptions silently.Same pattern as in the pipeline. Consider logging at debug level.
🧹 Fix
def close(self) -> None: """Close database connections.""" try: self._provider.engine.dispose() - except Exception: - pass + except Exception: + logger.debug("Error disposing engine", exc_info=True)
53-57:Base.metadata.create_allin__init__may conflict with Alembic migrations.The AI summary mentions Alembic migration
0003_paper_harvest_tables.pyfor these tables. Havingauto_create_schema=Trueby default means everyPaperStore()instantiation (including from API route handlers) will callcreate_all, which can create tables outside of Alembic's migration tracking. Consider defaultingauto_create_schematoFalsein production and only using it for tests.
| def upgrade() -> None: | ||
| if _is_offline(): | ||
| _upgrade_create_tables() | ||
| return | ||
| _upgrade_create_tables() | ||
| _upgrade_create_indexes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexes are skipped in offline mode — generated SQL will be incomplete.
When _is_offline() is True, upgrade() returns early after creating tables, never calling _upgrade_create_indexes(). Since _create_index already handles offline mode (Line 46-48, it calls op.create_index unconditionally), the indexes should be emitted in offline-generated SQL too.
Proposed fix
def upgrade() -> None:
- if _is_offline():
- _upgrade_create_tables()
- return
_upgrade_create_tables()
_upgrade_create_indexes()🤖 Prompt for AI Agents
In `@alembic/versions/0003_paper_harvest_tables.py` around lines 54 - 59, The
upgrade() function returns early when _is_offline() is true, so
_upgrade_create_indexes() is never called and generated offline SQL omits index
creation; remove the early return or restructure upgrade() so it always calls
_upgrade_create_indexes() after _upgrade_create_tables() (rely on
_create_index/op.create_index which already handles offline mode) — modify
upgrade() to invoke _upgrade_create_tables() then _upgrade_create_indexes()
regardless of _is_offline() so indexes are emitted in offline-generated SQL.
| class HarvestRequest(BaseModel): | ||
| """Request body for harvest endpoint.""" | ||
|
|
||
| keywords: List[str] = Field(..., min_items=1, description="Search keywords") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find project dependency files
fd -e "pyproject.toml" -e "requirements.txt" -e "setup.py" -e "setup.cfg" -e "poetry.lock" -e "Pipfile" | head -20Repository: jerry609/PaperBot
Length of output: 43
🏁 Script executed:
# Look for Pydantic version specification
cat pyproject.toml 2>/dev/null || cat setup.py 2>/dev/null || cat requirements.txt 2>/dev/null || echo "No common dependency file found"Repository: jerry609/PaperBot
Length of output: 1921
🌐 Web query:
Pydantic v2 Field min_items min_length List constraint validation
💡 Result:
In Pydantic v2, min_items / max_items are v1-era names for list-size constraints. For lists (and tuples) you should use:
Field(min_length=..., max_length=...)(these map to JSON SchemaminItems/maxItems) (docs.pydantic.dev)- or
conlist(item_type, min_length=..., max_length=...)(docs.pydantic.dev)
Examples (v2)
from typing import Annotated
from pydantic import BaseModel, Field, conlist
class M(BaseModel):
a: list[int] = Field(min_length=2, max_length=5)
b: Annotated[list[int], Field(min_length=2)]
c: conlist(int, min_length=2, max_length=5)If you try min_items in v2 (e.g., conlist(..., min_items=3)), you’ll get an “unexpected keyword argument” error; the fix is to use min_length / max_length instead. (stackoverflow.com)
Citations:
- 1: https://docs.pydantic.dev/latest/api/standard_library_types/?utm_source=openai
- 2: https://docs.pydantic.dev/latest/api/types/?utm_source=openai
- 3: https://stackoverflow.com/questions/79147028/typeerror-conlist-got-an-unexpected-keyword-argument-min-items?utm_source=openai
🏁 Script executed:
# Check the actual code at line 52 in the file
sed -n '49,55p' src/paperbot/api/routes/harvest.pyRepository: jerry609/PaperBot
Length of output: 477
🏁 Script executed:
# Also check if there are other Field definitions with min_items in the file
rg "min_items" src/paperbot/api/routes/harvest.pyRepository: jerry609/PaperBot
Length of output: 141
min_items is deprecated in Pydantic v2 — use min_length.
With Pydantic v2 (as specified in dependencies), min_items on Field is not recognized and raises TypeError: unexpected keyword argument. This will cause the code to fail at runtime when HarvestRequest is instantiated.
Fix
- keywords: List[str] = Field(..., min_items=1, description="Search keywords")
+ keywords: List[str] = Field(..., min_length=1, description="Search keywords")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| keywords: List[str] = Field(..., min_items=1, description="Search keywords") | |
| keywords: List[str] = Field(..., min_length=1, description="Search keywords") |
🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/harvest.py` at line 52, The Field declaration for
keywords in HarvestRequest uses the deprecated Pydantic v1 arg min_items which
causes a TypeError; update the Field call on the keywords attribute (keywords:
List[str] = Field(..., min_items=1, description="Search keywords")) to use
min_length=1 instead (i.e., replace min_items with min_length) so Pydantic v2
accepts the constraint. Ensure the change is made on the keywords field in the
HarvestRequest/schema where Field is used.
| @router.get("/papers/library", response_model=LibraryResponse) | ||
| def get_user_library( | ||
| user_id: str = Query("default", description="User ID"), | ||
| track_id: Optional[int] = Query(None, description="Filter by track"), | ||
| actions: Optional[str] = Query(None, description="Filter by actions (comma-separated)"), | ||
| sort_by: str = Query("saved_at", description="Sort field"), | ||
| sort_order: str = Query("desc", description="Sort order"), | ||
| limit: int = Query(50, ge=1, le=500), | ||
| offset: int = Query(0, ge=0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user_id accepted from client without authentication — security concern.
user_id is supplied as a query parameter (Line 325) or request body field (Line 386) defaulting to "default". Any caller can read or modify any user's library by passing a different user_id. If this is intentional for an MVP with no auth, consider adding a TODO/comment. Otherwise, user_id should come from an authenticated session or token.
Also applies to: 386-387
🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/harvest.py` around lines 323 - 331, get_user_library
currently accepts user_id from the query string (user_id param in
get_user_library) which allows callers to access/modify arbitrary users'
libraries; change this to derive the user ID from the authenticated
session/token (e.g., replace the user_id Query param with a dependency that
extracts the current_user/current_user.id from your auth middleware or token
validation function used elsewhere) and remove the default "default" value;
apply the same change to the other endpoint that accepts user_id in the request
body (lines referenced) so both endpoints use the authenticated user context; if
no authentication exists for this MVP, add a clear TODO comment inside
get_user_library and the other endpoint stating that user_id is temporary and
must be replaced with auth-derived user identification before production.
| @router.post("/papers/{paper_id}/save") | ||
| def save_paper_to_library(paper_id: int, request: SavePaperRequest): | ||
| """ | ||
| Save a paper to user's library. | ||
|
|
||
| Uses paper_feedback table with action='save'. | ||
| """ | ||
| from paperbot.infrastructure.stores.research_store import SqlAlchemyResearchStore | ||
|
|
||
| # Verify paper exists | ||
| store = _get_paper_store() | ||
| paper = store.get_paper_by_id(paper_id) | ||
| if not paper: | ||
| raise HTTPException(status_code=404, detail="Paper not found") | ||
|
|
||
| # Use research store to record feedback | ||
| research_store = SqlAlchemyResearchStore() | ||
| feedback = research_store.record_paper_feedback( | ||
| user_id=request.user_id, | ||
| paper_id=str(paper_id), | ||
| action="save", | ||
| track_id=request.track_id, | ||
| ) | ||
|
|
||
| return {"success": True, "feedback": feedback} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SqlAlchemyResearchStore() instantiated per request and never closed — potential resource leak.
Each call to save_paper_to_library creates a new SqlAlchemyResearchStore() on Line 406. Unlike _paper_store, it is not cached or closed after use. This leaks database connections over time.
Consider using the same lazy-initialization pattern or injecting a shared store.
♻️ Sketch: lazy-initialize research store
+_research_store: Optional["SqlAlchemyResearchStore"] = None
+
+def _get_research_store():
+ global _research_store
+ if _research_store is None:
+ from paperbot.infrastructure.stores.research_store import SqlAlchemyResearchStore
+ _research_store = SqlAlchemyResearchStore()
+ return _research_store
+
`@router.post`("/papers/{paper_id}/save")
def save_paper_to_library(paper_id: int, request: SavePaperRequest):
- from paperbot.infrastructure.stores.research_store import SqlAlchemyResearchStore
# Verify paper exists
store = _get_paper_store()
paper = store.get_paper_by_id(paper_id)
if not paper:
raise HTTPException(status_code=404, detail="Paper not found")
- research_store = SqlAlchemyResearchStore()
+ research_store = _get_research_store()
feedback = research_store.record_paper_feedback(🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/harvest.py` around lines 390 - 414,
save_paper_to_library creates a new SqlAlchemyResearchStore() per request and
never closes it, leaking DB resources; change this to reuse a
shared/lazy-initialized research store or inject one instead of instantiating
inline: add a module-level lazy getter (similar to _get_paper_store) or accept a
dependency and use that shared instance, then call the store's close/dispose
method if it has one after record_paper_feedback (or rely on the shared
instance's lifecycle); update references in save_paper_to_library and any other
handlers that instantiate SqlAlchemyResearchStore() so they use the new
get_research_store() or injected research_store and remove the per-request
constructor.
| library_paper_id: Optional[int] = None | ||
| actual_paper_id = req.paper_id | ||
|
|
||
| # If action is "save" and we have paper metadata, insert into papers table | ||
| if req.action == "save" and req.paper_title: | ||
| Logger.info("Save action detected, inserting paper into papers table", file=LogFiles.HARVEST) | ||
| try: | ||
| from paperbot.domain.harvest import HarvestedPaper, HarvestSource | ||
| from paperbot.infrastructure.stores.paper_store import PaperStore | ||
|
|
||
| paper_store = PaperStore() | ||
| paper = HarvestedPaper( | ||
| title=req.paper_title, | ||
| source=HarvestSource.SEMANTIC_SCHOLAR, | ||
| abstract=req.paper_abstract or "", | ||
| authors=req.paper_authors or [], | ||
| semantic_scholar_id=req.paper_id, | ||
| year=req.paper_year, | ||
| venue=req.paper_venue, | ||
| citation_count=req.paper_citation_count or 0, | ||
| url=req.paper_url, | ||
| ) | ||
| Logger.info("Calling paper store to upsert paper", file=LogFiles.HARVEST) | ||
| new_count, _ = paper_store.upsert_papers_batch([paper]) | ||
|
|
||
| # Get the paper ID from database | ||
| from paperbot.infrastructure.stores.models import PaperModel | ||
| from sqlalchemy import select | ||
| with paper_store._provider.session() as session: | ||
| result = session.execute( | ||
| select(PaperModel).where( | ||
| PaperModel.semantic_scholar_id == req.paper_id | ||
| ) | ||
| ).scalar_one_or_none() | ||
| if result: | ||
| library_paper_id = result.id | ||
| actual_paper_id = str(result.id) # Use integer ID for feedback | ||
| Logger.info(f"Paper saved to library with id={library_paper_id}", file=LogFiles.HARVEST) | ||
| except Exception as e: | ||
| Logger.warning(f"Failed to save paper to library: {e}", file=LogFiles.HARVEST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coded SEMANTIC_SCHOLAR source and per-request PaperStore instantiation.
Two concerns in the save-to-library flow:
-
Hard-coded source (Line 665): The paper source is always set to
HarvestSource.SEMANTIC_SCHOLAR, but papers could originate from arXiv or OpenAlex. Consider acceptingpaper_sourceas an optional field onPaperFeedbackRequestand using it here. -
Per-request
PaperStore()(Line 662): A newPaperStoreis instantiated on every save request. IfPaperStoremanages a connection pool or session factory, this creates unnecessary overhead. Consider initializing it once at module scope (like_research_store) or using a lazy singleton.
Proposed fix sketch
# At module scope, alongside other stores:
+from paperbot.infrastructure.stores.paper_store import PaperStore
+_paper_store = PaperStore()
# In add_paper_feedback, replace:
- from paperbot.infrastructure.stores.paper_store import PaperStore
- paper_store = PaperStore()
+ paper_store = _paper_storeAdd paper_source to PaperFeedbackRequest:
paper_url: Optional[str] = None
+ paper_source: Optional[str] = None # e.g. "arxiv", "semantic_scholar", "openalex"Then use it:
- source=HarvestSource.SEMANTIC_SCHOLAR,
+ source=HarvestSource(req.paper_source) if req.paper_source else HarvestSource.SEMANTIC_SCHOLAR,🧰 Tools
🪛 Ruff (0.15.0)
[warning] 675-675: Unpacked variable new_count is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
[warning] 690-690: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/research.py` around lines 652 - 691, The
save-to-library flow hardcodes HarvestSource.SEMANTIC_SCHOLAR and constructs a
new PaperStore per request; update PaperFeedbackRequest to include an optional
paper_source and use that when building the HarvestedPaper (falling back to
HarvestSource.SEMANTIC_SCHOLAR), and replace the per-request PaperStore()
instantiation by using a module-scoped or lazy singleton store (e.g.,
_research_store) so this handler reuses the same PaperStore instance when
calling upsert_papers_batch and accessing _provider.session().
| async def _get_session(self) -> aiohttp.ClientSession: | ||
| if self._session is None or self._session.closed: | ||
| self._session = aiohttp.ClientSession() | ||
| return self._session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, read the OpenAlex harvester file around the lines mentioned
cat -n src/paperbot/infrastructure/harvesters/openalex_harvester.py | head -220Repository: jerry609/PaperBot
Length of output: 8522
🏁 Script executed:
# Search for other harvesters and their timeout patterns
fd -e py "harvester" src/paperbot/infrastructure/harvesters/ | head -20Repository: jerry609/PaperBot
Length of output: 248
🏁 Script executed:
# Check if ClientTimeout is used anywhere in the codebase
rg "ClientTimeout" --type py -B 2 -A 2Repository: jerry609/PaperBot
Length of output: 10041
🏁 Script executed:
# Check all ClientSession instantiations
rg "ClientSession\(" --type py -B 2 -A 2Repository: jerry609/PaperBot
Length of output: 8422
🏁 Script executed:
# Check for test files related to harvesters
find . -name "*test*harvester*" -o -name "*harvester*test*" | head -20Repository: jerry609/PaperBot
Length of output: 99
🏁 Script executed:
# Check semantic scholar harvester for comparison
cat -n src/paperbot/infrastructure/harvesters/semantic_scholar_harvester.py | grep -A 5 "_get_session"Repository: jerry609/PaperBot
Length of output: 43
🏁 Script executed:
# Check if semantic_scholar_harvester exists
ls -la src/paperbot/infrastructure/harvesters/Repository: jerry609/PaperBot
Length of output: 457
🏁 Script executed:
# Look at the test file to understand coverage
cat -n tests/integration/test_harvesters.pyRepository: jerry609/PaperBot
Length of output: 20610
Add HTTP request timeout to prevent indefinite hangs.
The aiohttp.ClientSession at line 44 and the session.get() call at line 90 lack timeout configuration. If the OpenAlex API becomes unresponsive, the harvester will block indefinitely, stalling the harvest pipeline. The codebase already establishes timeout patterns elsewhere (see src/paperbot/utils/downloader.py and src/paperbot/infrastructure/services/api_client.py).
Proposed fix — add ClientTimeout to session creation
async def _get_session(self) -> aiohttp.ClientSession:
if self._session is None or self._session.closed:
- self._session = aiohttp.ClientSession()
+ timeout = aiohttp.ClientTimeout(total=30)
+ self._session = aiohttp.ClientSession(timeout=timeout)
return self._sessionNote: ArxivHarvester has the identical pattern and should also be fixed.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def _get_session(self) -> aiohttp.ClientSession: | |
| if self._session is None or self._session.closed: | |
| self._session = aiohttp.ClientSession() | |
| return self._session | |
| async def _get_session(self) -> aiohttp.ClientSession: | |
| if self._session is None or self._session.closed: | |
| timeout = aiohttp.ClientTimeout(total=30) | |
| self._session = aiohttp.ClientSession(timeout=timeout) | |
| return self._session |
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/harvesters/openalex_harvester.py` around lines 42
- 45, Add an HTTP timeout to prevent indefinite hangs by creating the aiohttp
session with a ClientTimeout in _get_session (use aiohttp.ClientTimeout with an
appropriate total/Connect/read values) instead of a bare aiohttp.ClientSession,
and ensure individual calls that use session.get(...) also pass a timeout (or
rely on the session default) so requests to the OpenAlex API cannot block
forever; apply the same changes to the ArxivHarvester equivalent methods.
| def search_papers( | ||
| self, | ||
| *, | ||
| query: Optional[str] = None, | ||
| keywords: Optional[List[str]] = None, | ||
| venues: Optional[List[str]] = None, | ||
| year_from: Optional[int] = None, | ||
| year_to: Optional[int] = None, | ||
| min_citations: Optional[int] = None, | ||
| sources: Optional[List[str]] = None, | ||
| sort_by: str = "citation_count", | ||
| sort_order: str = "desc", | ||
| limit: int = 50, | ||
| offset: int = 0, | ||
| ) -> Tuple[List[PaperModel], int]: | ||
| """ | ||
| Search papers with filters and pagination. | ||
|
|
||
| Returns: | ||
| Tuple of (papers, total_count) | ||
| """ | ||
| with self._provider.session() as session: | ||
| stmt = select(PaperModel).where(PaperModel.deleted_at.is_(None)) | ||
|
|
||
| # Full-text search (LIKE for v1) | ||
| if query: | ||
| pattern = f"%{query}%" | ||
| stmt = stmt.where( | ||
| or_( | ||
| PaperModel.title.ilike(pattern), | ||
| PaperModel.abstract.ilike(pattern), | ||
| ) | ||
| ) | ||
|
|
||
| # Year filters | ||
| if year_from: | ||
| stmt = stmt.where(PaperModel.year >= year_from) | ||
| if year_to: | ||
| stmt = stmt.where(PaperModel.year <= year_to) | ||
|
|
||
| # Citation filter | ||
| if min_citations: | ||
| stmt = stmt.where(PaperModel.citation_count >= min_citations) | ||
|
|
||
| # Venue filter | ||
| if venues: | ||
| venue_conditions = [PaperModel.venue.ilike(f"%{v}%") for v in venues] | ||
| stmt = stmt.where(or_(*venue_conditions)) | ||
|
|
||
| # Source filter | ||
| if sources: | ||
| stmt = stmt.where(PaperModel.primary_source.in_(sources)) | ||
|
|
||
| # Count total before pagination | ||
| count_stmt = select(func.count()).select_from(stmt.subquery()) | ||
| total_count = session.execute(count_stmt).scalar() or 0 | ||
|
|
||
| # Sort | ||
| sort_col = getattr(PaperModel, sort_by, PaperModel.citation_count) | ||
| if sort_order.lower() == "desc": | ||
| stmt = stmt.order_by(sort_col.desc()) | ||
| else: | ||
| stmt = stmt.order_by(sort_col.asc()) | ||
|
|
||
| # Pagination | ||
| stmt = stmt.offset(offset).limit(limit) | ||
|
|
||
| papers = session.execute(stmt).scalars().all() | ||
|
|
||
| return list(papers), total_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused keywords parameter and unvalidated sort_by open to attribute access.
Two issues in search_papers:
-
The
keywordsparameter (Line 213) is accepted but never used in the method body. This is dead code that misleads callers. -
sort_by(Line 267) usesgetattr(PaperModel, sort_by, PaperModel.citation_count)with a user-supplied string. This allows sorting by any model attribute (includingdeleted_at, internal fields, or non-column attributes). Whitelist allowed sort columns.
♻️ Proposed fix
+ _ALLOWED_SORT_COLUMNS = {"citation_count", "year", "title", "created_at", "updated_at"}
+
def search_papers(
self,
*,
query: Optional[str] = None,
- keywords: Optional[List[str]] = None,
venues: Optional[List[str]] = None,
year_from: Optional[int] = None,
year_to: Optional[int] = None,
min_citations: Optional[int] = None,
sources: Optional[List[str]] = None,
sort_by: str = "citation_count",
sort_order: str = "desc",
limit: int = 50,
offset: int = 0,
) -> Tuple[List[PaperModel], int]:
# ...
- sort_col = getattr(PaperModel, sort_by, PaperModel.citation_count)
+ if sort_by not in self._ALLOWED_SORT_COLUMNS:
+ sort_by = "citation_count"
+ sort_col = getattr(PaperModel, sort_by)🧰 Tools
🪛 Ruff (0.15.0)
[warning] 213-213: Unused method argument: keywords
(ARG002)
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 209 - 278,
The search_papers method currently ignores the keywords parameter and unsafely
uses getattr(PaperModel, sort_by...), so remove or implement keywords filtering
and restrict sort_by to a whitelist: update search_papers to either apply
keywords (e.g., build OR ilike conditions on PaperModel.title/abstract/keywords)
or delete the unused keywords argument and its callers, and replace the dynamic
getattr with a mapping/whitelist of allowed sort keys (e.g., {"citation_count":
PaperModel.citation_count, "year": PaperModel.year, "title": PaperModel.title})
then look up sort_col = ALLOWED_SORT_COLUMNS.get(sort_by,
PaperModel.citation_count) before ordering; reference function search_papers,
parameter keywords, parameter sort_by, and model PaperModel when making the
changes.
| if year_from: | ||
| stmt = stmt.where(PaperModel.year >= year_from) | ||
| if year_to: | ||
| stmt = stmt.where(PaperModel.year <= year_to) | ||
|
|
||
| # Citation filter | ||
| if min_citations: | ||
| stmt = stmt.where(PaperModel.citation_count >= min_citations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Falsy-value filters skip valid zero values.
if year_from: on Line 244 and if min_citations: on Line 250 are falsy checks. While year_from=0 is unrealistic, min_citations=0 is a valid filter meaning "include papers with zero or more citations" — but it would be skipped because 0 is falsy.
🐛 Fix
- if year_from:
+ if year_from is not None:
stmt = stmt.where(PaperModel.year >= year_from)
- if year_to:
+ if year_to is not None:
stmt = stmt.where(PaperModel.year <= year_to)
# Citation filter
- if min_citations:
+ if min_citations is not None:
stmt = stmt.where(PaperModel.citation_count >= min_citations)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if year_from: | |
| stmt = stmt.where(PaperModel.year >= year_from) | |
| if year_to: | |
| stmt = stmt.where(PaperModel.year <= year_to) | |
| # Citation filter | |
| if min_citations: | |
| stmt = stmt.where(PaperModel.citation_count >= min_citations) | |
| if year_from is not None: | |
| stmt = stmt.where(PaperModel.year >= year_from) | |
| if year_to is not None: | |
| stmt = stmt.where(PaperModel.year <= year_to) | |
| # Citation filter | |
| if min_citations is not None: | |
| stmt = stmt.where(PaperModel.citation_count >= min_citations) |
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 244 - 251,
The current falsy checks (if year_from:, if year_to:, if min_citations:) skip
valid zero values; change them to explicit None checks (e.g., use "is not None")
so filters on PaperModel.year and PaperModel.citation_count are applied when
year_from, year_to or min_citations equal 0. Update the conditionals around the
stmt.where calls that reference PaperModel.year and PaperModel.citation_count
(the blocks using year_from, year_to, min_citations) to use "is not None" checks
to ensure zero is treated as a valid filter value.
| base_stmt = ( | ||
| select(PaperModel, PaperFeedbackModel) | ||
| .join( | ||
| PaperFeedbackModel, | ||
| or_( | ||
| PaperModel.id == cast(PaperFeedbackModel.paper_id, Integer), | ||
| PaperModel.semantic_scholar_id == PaperFeedbackModel.paper_id, | ||
| ), | ||
| ) | ||
| .where( | ||
| PaperFeedbackModel.user_id == user_id, | ||
| PaperFeedbackModel.action.in_(actions), | ||
| PaperModel.deleted_at.is_(None), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAST(paper_id AS INTEGER) will fail on PostgreSQL for non-numeric strings.
The OR join on Lines 322-325 uses cast(PaperFeedbackModel.paper_id, Integer) to match numeric paper IDs. However, as the comments on Lines 312-314 acknowledge, paper_id can also be a Semantic Scholar ID (e.g., "abc123"). On PostgreSQL, CAST('abc123' AS INTEGER) raises a data-type error — it does not return NULL or 0.
This means any row in paper_feedback with a non-numeric paper_id will cause the entire query to fail on PostgreSQL.
Consider using a CASE expression or a regex filter to conditionally cast, or normalize paper_id storage to always use the numeric ID.
🐛 Possible approach using a safe cast with CASE
from sqlalchemy import case, text
# Safe cast: only attempt integer comparison when paper_id looks numeric
safe_int_match = case(
(
PaperFeedbackModel.paper_id.op("~")(r"^\d+$"),
PaperModel.id == cast(PaperFeedbackModel.paper_id, Integer),
),
else_=False,
)
base_stmt = (
select(PaperModel, PaperFeedbackModel)
.join(
PaperFeedbackModel,
or_(
safe_int_match,
PaperModel.semantic_scholar_id == PaperFeedbackModel.paper_id,
),
)
# ...
)A cleaner long-term fix would be to always store the numeric papers.id in paper_feedback.paper_id at save time, avoiding the dual-type join entirely.
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 318 - 331,
The join currently casts PaperFeedbackModel.paper_id to Integer (in base_stmt)
which will raise on PostgreSQL for non-numeric strings; replace the direct cast
with a safe conditional expression (e.g., build a case/boolean expression that
first checks PaperFeedbackModel.paper_id.op("~")(r"^\d+$") then compares
cast(PaperFeedbackModel.paper_id, Integer) only when the regex matches) and use
that safe_int_match in the or_ alongside PaperModel.semantic_scholar_id ==
PaperFeedbackModel.paper_id; update the join condition in base_stmt (where
cast(...) is referenced) to use the safe case/regex approach so non-numeric
paper_id values don’t trigger a CAST error (alternatively, normalize paper_id at
write time to always store numeric IDs).
| const { paperId } = await params | ||
| const url = new URL(req.url) | ||
| const upstream = `${apiBaseUrl()}/api/papers/${paperId}/save${url.search}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paperId is interpolated into the upstream URL without sanitization — potential path traversal risk.
A crafted paperId (e.g., ../../admin/delete) could manipulate the upstream request path. Consider validating or encoding paperId before interpolation:
🛡️ Proposed fix
const { paperId } = await params
+ const safePaperId = encodeURIComponent(paperId)
const url = new URL(req.url)
- const upstream = `${apiBaseUrl()}/api/papers/${paperId}/save${url.search}`
+ const upstream = `${apiBaseUrl()}/api/papers/${safePaperId}/save${url.search}`
return proxyJson(req, upstream, "DELETE")The same applies to the POST handler on Line 18.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const { paperId } = await params | |
| const url = new URL(req.url) | |
| const upstream = `${apiBaseUrl()}/api/papers/${paperId}/save${url.search}` | |
| const { paperId } = await params | |
| const safePaperId = encodeURIComponent(paperId) | |
| const url = new URL(req.url) | |
| const upstream = `${apiBaseUrl()}/api/papers/${safePaperId}/save${url.search}` | |
| return proxyJson(req, upstream, "DELETE") |
🤖 Prompt for AI Agents
In `@web/src/app/api/papers/`[paperId]/save/route.ts around lines 7 - 9, The
paperId is being interpolated directly into the upstream URL in route.ts
(upstream and apiBaseUrl()) and in the POST handler, allowing path-traversal
style values; sanitize or encode paperId before use (e.g., validate it against
an allowed pattern like /^[A-Za-z0-9_-]+$/ or otherwise reject invalid values,
or use a safe encoder such as encodeURIComponent on the paperId) and then use
that sanitized variable for building upstream and any other requests in the POST
handler to prevent crafted path segments from escaping the intended route.
Summary of ChangesHello @CJBshuosi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive paper harvesting and management system. It includes new API endpoints for initiating paper harvests from various sources, searching and filtering papers, managing user libraries, and tracking harvest run history. The changes involve database schema updates, new API routes, and the implementation of a multi-source paper harvest pipeline. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive paper harvesting and management system. Key changes include the addition of new database tables (papers and harvest_runs) via an Alembic migration to store harvested paper metadata and track harvest execution. A new FastAPI router (/api/harvest) is added, exposing endpoints for initiating multi-source paper harvests with streaming progress updates, listing and retrieving harvest run details, searching stored papers with various filters, and managing a user's paper library (saving/unsaving papers). The backend now includes new application services: PaperDeduplicator for in-memory deduplication of papers from different sources, QueryRewriter for expanding search keywords (e.g., abbreviations), and VenueRecommender for suggesting relevant academic venues. A HarvestPipeline workflow orchestrates these services and harvesters (Arxiv, Semantic Scholar, OpenAlex) to perform the end-to-end harvesting process. The research API's add_paper_feedback endpoint is enhanced to optionally save paper metadata to the new papers table when a 'save' action is performed, and logging is improved across several API routes. Frontend changes in web/ update the ResearchDashboard to pass full paper metadata when saving, and SavedPapersList is refactored to use the new /api/papers/library endpoint for fetching and managing saved papers, including new unsavePaper and toggleReadStatus functions. Additionally, several integration and unit tests for the new harvesting components and services are added.
Review comments highlight several security and efficiency concerns. Multiple endpoints (/api/papers/{paper_id}/save, /harvest/runs, /api/papers/library, /research/papers/feedback, /research/context) are identified as vulnerable to Insecure Direct Object Reference (IDOR) or unauthorized data access due to client-supplied user_id parameters without proper authentication. Information exposure is noted in harvest_stream for directly exposing raw exception messages. Efficiency concerns are raised regarding PaperDeduplicator's _merge_paper logic for re-indexing, PaperStore._find_existing's sequential database queries, and client-side venue filtering in OpenAlexHarvester and SemanticScholarHarvester which can lead to inefficient API usage. Other comments suggest improving full-text search performance, optimizing LIKE queries for venue filtering, standardizing paper_id types across frontend/backend, and reviewing column lengths in the papers table. The compute_title_hash method's handling of hyphens and the HarvestPipeline's paper count reporting during pipeline-level failures are also noted for potential refinement.
| existing.doi = new.doi | ||
| self._doi_index[new.doi.lower().strip()] = self._find_index(existing) | ||
| if not existing.arxiv_id and new.arxiv_id: | ||
| existing.arxiv_id = new.arxiv_id | ||
| self._arxiv_index[new.arxiv_id.lower().strip()] = self._find_index(existing) | ||
| if not existing.semantic_scholar_id and new.semantic_scholar_id: | ||
| existing.semantic_scholar_id = new.semantic_scholar_id | ||
| self._s2_index[new.semantic_scholar_id.lower().strip()] = self._find_index(existing) | ||
| if not existing.openalex_id and new.openalex_id: | ||
| existing.openalex_id = new.openalex_id | ||
| self._openalex_index[new.openalex_id.lower().strip()] = self._find_index(existing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for updating _doi_index, _arxiv_index, _s2_index, and _openalex_index within _merge_paper relies on self._find_index(existing). This _find_index method computes the title hash and looks it up in _title_hash_index. This approach is problematic because if the existing paper was initially indexed by an identifier (e.g., DOI) and not its title hash, _find_index might return -1, leading to incorrect index updates for the canonical identifiers. The indexes should be updated directly using the idx of the existing paper, which is already known from _find_duplicate.
| existing.doi = new.doi | |
| self._doi_index[new.doi.lower().strip()] = self._find_index(existing) | |
| if not existing.arxiv_id and new.arxiv_id: | |
| existing.arxiv_id = new.arxiv_id | |
| self._arxiv_index[new.arxiv_id.lower().strip()] = self._find_index(existing) | |
| if not existing.semantic_scholar_id and new.semantic_scholar_id: | |
| existing.semantic_scholar_id = new.semantic_scholar_id | |
| self._s2_index[new.semantic_scholar_id.lower().strip()] = self._find_index(existing) | |
| if not existing.openalex_id and new.openalex_id: | |
| existing.openalex_id = new.openalex_id | |
| self._openalex_index[new.openalex_id.lower().strip()] = self._find_index(existing) | |
| # Fill in missing identifiers | |
| if not existing.doi and new.doi: | |
| existing.doi = new.doi | |
| self._doi_index[new.doi.lower().strip()] = self._find_index(existing) # This line is problematic | |
| if not existing.arxiv_id and new.arxiv_id: | |
| existing.arxiv_id = new.arxiv_id | |
| self._arxiv_index[new.arxiv_id.lower().strip()] = self._find_index(existing) # This line is problematic | |
| if not existing.semantic_scholar_id and new.semantic_scholar_id: | |
| existing.semantic_scholar_id = new.semantic_scholar_id | |
| self._s2_index[new.semantic_scholar_id.lower().strip()] = self._find_index(existing) # This line is problematic | |
| if not existing.openalex_id and new.openalex_id: | |
| existing.openalex_id = new.openalex_id | |
| self._openalex_index[new.openalex_id.lower().strip()] = self._find_index(existing) # This line is problematic |
| @router.post("/papers/{paper_id}/save") | ||
| def save_paper_to_library(paper_id: int, request: SavePaperRequest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpoint /api/papers/{paper_id}/save is vulnerable to Broken Access Control (IDOR) because it accepts a user_id in the request body without verification. This allows an attacker to save papers to any user's library, leading to unauthorized modification of user data. Additionally, there's an inconsistency where paper_id is an integer in the path but expected as a string in SqlAlchemyResearchStore, which could lead to confusion if paper_id can also be a Semantic Scholar ID. Ensure the user_id in the request matches the authenticated requester's identity and clarify the paper_id type handling.
| @router.get("/harvest/runs", response_model=HarvestRunListResponse) | ||
| def list_harvest_runs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The /harvest/runs endpoint lists all harvest execution records in the system without any user-based filtering or authentication. Since harvest runs contain sensitive research data such as search keywords and targeted venues, this allows any user to monitor the research activities of others. Access to these records should be restricted to the user who initiated the run.
| @router.get("/papers/library", response_model=LibraryResponse) | ||
| def get_user_library( | ||
| user_id: str = Query("default", description="User ID"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpoint /api/papers/library accepts a user_id parameter directly from the client without any authentication. In a multi-user environment, this allows any user to view the saved papers of any other user by simply providing their user_id. This is an Insecure Direct Object Reference (IDOR) vulnerability. Authorization should be enforced based on an authenticated session or token.
| @router.delete("/papers/{paper_id}/save") | ||
| def remove_paper_from_library( | ||
| paper_id: int, | ||
| user_id: str = Query("default", description="User ID"), | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if query: | ||
| pattern = f"%{query}%" | ||
| stmt = stmt.where( | ||
| or_( | ||
| PaperModel.title.ilike(pattern), | ||
| PaperModel.abstract.ilike(pattern), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full-text search for query uses ilike(pattern) on both title and abstract. While this works for basic substring matching, ilike can be inefficient for large text fields and complex queries, especially without proper full-text indexing. For more advanced and performant full-text search capabilities, consider integrating with a dedicated full-text search solution (e.g., PostgreSQL's tsvector, SQLite's FTS5, or an external search engine like Elasticsearch).
| venue_conditions = [PaperModel.venue.ilike(f"%{v}%") for v in venues] | ||
| stmt = stmt.where(or_(*venue_conditions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The venue filter uses PaperModel.venue.ilike(f"%{v}%") for each venue. This LIKE operator with leading wildcards (%) can be very slow as it prevents the use of indexes on the venue column. If venue filtering is a common operation, consider normalizing venue names and storing them in a separate table or using a more efficient search mechanism for this field.
| sa.Column("doi", sa.String(length=256), nullable=True), | ||
| sa.Column("arxiv_id", sa.String(length=64), nullable=True), | ||
| sa.Column("semantic_scholar_id", sa.String(length=64), nullable=True), | ||
| sa.Column("openalex_id", sa.String(length=64), nullable=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lengths for doi, arxiv_id, semantic_scholar_id, and openalex_id columns in the papers table seem arbitrary. It's best practice to define these lengths based on the maximum possible length of these identifiers from their respective sources to prevent truncation or unnecessary over-allocation. For example, DOIs can be up to 256 characters, arXiv IDs are typically shorter (e.g., 10-15 chars plus optional version), Semantic Scholar IDs are UUID-like (32-36 chars), and OpenAlex IDs are also UUID-like. Consider reviewing the actual maximum lengths from the sources to set more precise limits.
| all_results = session.execute(base_stmt).all() | ||
| Logger.info(f"Query returned {len(all_results)} results before deduplication", file=LogFiles.HARVEST) | ||
|
|
||
| # Deduplicate by paper.id, keeping the one with latest timestamp | ||
| Logger.info("Deduplicating results by paper id", file=LogFiles.HARVEST) | ||
| paper_map: Dict[int, Tuple[PaperModel, PaperFeedbackModel]] = {} | ||
| for row in all_results: | ||
| paper = row[0] | ||
| feedback = row[1] | ||
| if paper.id not in paper_map or feedback.ts > paper_map[paper.id][1].ts: | ||
| paper_map[paper.id] = (paper, feedback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deduplication logic for get_user_library is performed in Python (paper_map: Dict[int, Tuple[PaperModel, PaperFeedbackModel]] = {}). While this works for smaller result sets, fetching all results from the database and then deduplicating in Python can be inefficient for large user libraries. It would be more performant to perform this deduplication at the database level, for example, using DISTINCT ON (PostgreSQL) or a subquery with ROW_NUMBER() to select the latest feedback per paper ID.
| if (action === "save" && paper) { | ||
| body.paper_title = paper.title | ||
| body.paper_abstract = paper.abstract || "" | ||
| body.paper_authors = paper.authors || [] | ||
| body.paper_year = paper.year | ||
| body.paper_venue = paper.venue | ||
| body.paper_citation_count = paper.citation_count | ||
| body.paper_url = paper.url | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if (action === "save" && paper) block conditionally adds paper metadata to the request body. This is a good improvement for the new save functionality. However, the paper_id sent in the sendFeedback function is p.paper_id, which is a string (e.g., Semantic Scholar ID). The backend's add_paper_feedback expects paper_id to be an integer if it's an internal ID, or a string if it's an external ID. The PaperStore's get_user_library handles both, but the save_paper_to_library in harvest.py explicitly converts paper_id to str(paper_id) before passing it to record_paper_feedback. This inconsistency in paper_id types (string vs. integer) across frontend and backend, and within backend logic, could lead to confusion or subtle bugs. It's best to standardize the paper_id type for feedback actions.
Closes #26
Summary by CodeRabbit
Release Notes