Skip to content

Conversation

@jerry609
Copy link
Owner

@jerry609 jerry609 commented Feb 10, 2026

Summary

Closes #24

Convert the DailyPaper pipeline from synchronous JSON to SSE streaming, add post-judge paper filtering, email push notifications, and persist all config settings across page refreshes.

What changed

Backend (paperscool.py, daily_push_service.py)

  • SSE streaming: New _dailypaper_stream() async generator yields events for each phase: search → build → LLM → judge → filter → save → notify → result
  • Post-judge filtering: Papers with recommendation "skip" or "skim" are removed after judge scoring. Only "must_read" and "worth_reading" papers are kept. Complete judge logs preserved in filter.log.
  • Email push: notify_email_to field on DailyPaperRequest allows frontend to pass recipient email, overriding env config via email_to_override param.
  • Judge cap: judge_max_items_per_query limit raised from 20 → 200.
  • Sync JSON fallback preserved for fast requests (no LLM/Judge).

Frontend (TopicWorkflowDashboard.tsx, workflow-store.ts, daily/route.ts)

  • SSE-aware proxy: Detects upstream content-type, pipes SSE streams through without buffering.
  • StreamProgressCard: Real-time progress with phase pipeline indicator, progress bar, elapsed time, and scrollable log showing per-paper judge scores ([must_read 4.50] PaperTitle) and filter decisions.
  • runDailyPaperStream(): Replaces sync fetch — progressively updates store as SSE events arrive (papers appear one-by-one).
  • Persisted config: All feature toggles moved from useState to zustand persist store. All features default to enabled. No more re-checking on page refresh.
  • Email notification UI: Email input + enable checkbox in settings sheet.

Bug fixes (errors.py, executable.py)

  • Fixed @dataclass inheritance: LLMError, APIError, ValidationError now properly override parent defaults.
  • Fixed ensure_execution_result: raw dicts without result keys (success/status/data/error) are now treated as data, not parsed as result dicts.

Testing

  • 215 passed, 3 skipped, 0 failures (was 5 failures before bug fixes)
  • 3 new end-to-end tests:
    • test_dailypaper_sse_filter_removes_low_papers — "skip"/"skim" removed, "must_read" kept
    • test_dailypaper_sse_full_pipeline_llm_judge_filter — full LLM+Judge+Filter pipeline
    • test_dailypaper_sync_path_no_llm_no_judge — sync JSON fallback still works
  • npx next build passes cleanly

Summary by CodeRabbit

  • New Features

    • Streaming DailyPaper generation with real-time progress, per-item updates, and fast synchronous fallback
    • LLM enrichment (summaries, relevance, trends, insights) and judge-based filtering (must-read/worth-reading)
    • GitHub repository enrichment showing repo metadata (stars, language) for papers
    • UI controls for HF Daily and email notification overrides
  • Improvements

    • Persisted notification/config settings in the frontend; enhanced error reporting and pipeline telemetry
  • Documentation

    • README updated with SSE pipeline and usage examples

…ted config

- Convert /daily endpoint to SSE streaming when LLM/Judge enabled
  (search → build → LLM → judge → filter → save → notify → result)
- Add post-judge paper filtering: remove "skip"/"skim", keep "worth_reading"+"must_read"
- Add email push with UI-provided recipient (notify_email_to field)
- Persist all config toggles in zustand store (defaults all enabled)
- Add StreamProgressCard with real-time phase/progress/log display
- SSE-aware Next.js proxy for daily route (detect content-type, stream or JSON)
- Raise judge_max_items_per_query cap from 20 to 200
- Fix @DataClass inheritance in errors.py (subclass defaults)
- Fix ensure_execution_result treating raw dicts as result dicts
- Add 3 end-to-end tests for SSE pipeline (filter, full pipeline, sync fallback)

Closes #24
Copilot AI review requested due to automatic review settings February 10, 2026 16:46
@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

Implements an SSE-capable, multi‑phase DailyPaper pipeline (search → build → LLM enrichment → judge → filter → persist/notify) with a fast synchronous fallback when LLM/Judge are disabled; adds GitHub repo enrichment, registry ingestion of papers/judge scores, email override support, DB models/migrations, and frontend streaming UI/state.

Changes

Cohort / File(s) Summary
DailyPaper API & pipeline
src/paperbot/api/routes/paperscool.py, src/paperbot/infrastructure/queue/arq_worker.py, src/paperbot/presentation/cli/main.py
Adds SSE stream generator (_dailypaper_stream), sync fast-path, LLM enrichment events, judge/filter phases, registry ingestion calls (ingest_daily_report_to_registry, persist_judge_scores_to_registry), notify_email_to passthrough, and error capture in the report.
Workflow helpers & persistence
src/paperbot/application/workflows/dailypaper.py, src/paperbot/application/services/daily_push_service.py
New ingest/persist helper functions to upsert papers and judge scores to SQL registry; DailyPushService supports email_to_override with safe restore on completion.
SQL models, store, and research APIs
src/paperbot/infrastructure/stores/models.py, src/paperbot/infrastructure/stores/paper_store.py, src/paperbot/infrastructure/stores/research_store.py
Adds canonical PaperModel, PaperJudgeScoreModel, PaperReadingStatusModel, and SqlAlchemyPaperStore and SqlAlchemyResearchStore methods for upsert, judge-score persistence, reading status, saved lists, and detail views.
Migrations
alembic/versions/0003_paper_registry.py, alembic/versions/0004_paper_feedback_judge_links.py, alembic/versions/0005_paper_reading_status.py
New Alembic revisions to create papers, paper_judge_scores, and paper_reading_status tables with indices and guarded/idempotent index/column creation.
PapersCool repo enrichment & utilities
src/paperbot/api/routes/paperscool.py (additional endpoints), src/paperbot/application/workflows/dailypaper.py (helpers), src/paperbot/domain/paper_identity.py
Adds PapersCoolReposRequest/Response, enrich_papers_with_repo_data endpoint, GitHub URL normalization/extraction, flatten/dedup helpers, and paper identity normalization (arXiv/DOI utilities).
Core fixes
src/paperbot/core/abstractions/executable.py, src/paperbot/core/errors/errors.py
Fix ensure_execution_result dict handling to avoid misclassification of raw dicts; convert several error classes to dataclasses with typed fields.
Tests
tests/unit/test_paperscool_route.py, tests/unit/test_paper_store.py, tests/unit/test_paper_judge_persistence.py, tests/unit/test_paper_identity.py, tests/unit/test_research_*
Extensive tests and mocks for SSE pipeline, LLM/judge behaviors, repo enrichment, persistence/idempotency, research routes, and identity normalization; added SSE parsing helpers and multi-paper workflow tests.
Frontend streaming proxy & routes
web/src/app/api/research/paperscool/daily/route.ts, web/src/app/api/research/*
Replaces proxyJson usage with explicit fetch/proxying; detects and pipes text/event-stream upstream; adds multiple new nodejs App Router proxy endpoints for research routes.
Frontend UI & state
web/src/components/research/TopicWorkflowDashboard.tsx, web/src/lib/stores/workflow-store.ts, web/src/lib/api.ts
Adds StreamProgressCard, runDailyPaperStream and runRepoEnrichment flows, streaming state and persisted WorkflowConfig/notifyEmail, repo enrichment UI, HF Daily and email toggles, and API changes to fetch scholar/network/trends.
Docs & README
README.md
Documented SSE pipeline, filtering behavior, notify/email override, and persisted config and examples.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as Daily API
    participant Search as Search Engine
    participant Builder as Report Builder
    participant LLM as LLM Service
    participant Judge as Judge Service
    participant Registry as Paper Registry
    participant Notifier as Notifier

    Client->>API: POST /api/research/paperscool/daily (request)
    API->>Client: HTTP 200 + text/event-stream (SSE start)

    API->>Search: run queries
    Search-->>API: candidate papers
    API-->>Client: event: search_done / progress

    API->>Builder: build report
    Builder-->>API: report
    API-->>Client: event: report_built

    alt enable_llm_analysis
        API->>LLM: enrich per-paper (summary, relevance, trends, insight)
        LLM-->>API: per-item enrichment
        API-->>Client: events: llm_summary / trend / insight / llm_done
    end

    alt enable_judge
        API->>Judge: score papers (per-query)
        Judge-->>API: judgments
        API-->>Client: events: judge (per-item), judge_done
        API->>API: filter by recommendation (must_read/worth_reading)
        API-->>Client: event: filter_done
    end

    API->>Registry: ingest report & persist judge scores
    Registry-->>API: ingestion result
    API-->>Client: event: result (final report)
    API-->>Client: connection closed
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰✨ I hopped through streams and wrapped each phase,
Summarized, judged, and filtered the maze.
Repos adorned with stars so bright,
Emails sent as day turns night—
Hop, SSE, hooray for pipeline days! 🥕📡

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: SSE streaming for DailyPaper, post-judge filtering, email push notifications, and persisted frontend configuration.
Linked Issues check ✅ Passed The PR fully implements all coded requirements from issue #24: SSE streaming endpoint, post-judge filtering, email override, judge cap increase, SSE-aware proxy, StreamProgressCard, persistent config, and bug fixes.
Out of Scope Changes check ✅ Passed All changes directly support the PR objectives: database migrations and models for paper registry, scholar routes, research store enhancements, and frontend pages are all required for saved-papers and scholar features mentioned in PR scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dailypaper-sse-stream

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @jerry609, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the DailyPaper workflow by introducing real-time streaming for long-running operations, improving the quality of generated reports through intelligent filtering based on LLM judgments, and adding user-requested features like email notifications and persistent configuration. These changes aim to provide a more interactive, efficient, and user-friendly experience for generating and consuming daily paper digests.

Highlights

  • SSE Streaming for DailyPaper Pipeline: The DailyPaper generation process has been converted from a synchronous JSON response to a Server-Sent Events (SSE) streaming model. This allows for real-time progress updates on the frontend through various phases like search, build, LLM enrichment, judging, filtering, saving, and notification.
  • Post-Judge Paper Filtering: A new filtering step has been introduced after the LLM judging phase. Papers recommended as 'skip' or 'skim' are now automatically removed from the final report, ensuring only 'must_read' and 'worth_reading' papers are presented. Complete judge logs are preserved for auditing.
  • Email Push Notifications: The DailyPaper pipeline now supports email push notifications. The frontend can specify recipient email addresses, overriding environment configurations, to send the generated report via email.
  • Persisted Frontend Configuration: All feature toggles and configuration settings in the frontend (e.g., LLM analysis, judge settings, data sources) are now persisted across page refreshes using a Zustand persist store, significantly improving user experience by retaining preferences.
  • Increased Judge Item Limit: The judge_max_items_per_query limit has been substantially increased from 20 to 200, allowing the LLM judge to process a larger number of papers per query.
  • New GitHub Repository Enrichment: A new API endpoint and corresponding frontend functionality have been added to extract GitHub repository URLs from papers and enrich them with metadata (stars, language, etc.) fetched from the GitHub API.
  • Bug Fixes and Improved Error Handling: Fixed @dataclass inheritance for custom error types (LLMError, APIError, ValidationError) and improved ensure_execution_result to correctly handle raw dictionaries as data rather than parsing them as execution result objects.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/paperbot/api/routes/paperscool.py
    • Added imports for os, re, urllib.parse, and requests for new functionalities.
    • Imported ingest_daily_report_to_registry and persist_judge_scores_to_registry for data persistence.
    • Imported extract_github_url for repository URL extraction.
    • Updated DailyPaperRequest to include top_n and judge_max_items_per_query limits up to 200, and added notify_email_to field for email notifications.
    • Increased judge_max_items_per_query in PapersCoolAnalyzeRequest to 200.
    • Introduced new PapersCoolReposRequest and PapersCoolReposResponse models for GitHub repository enrichment.
    • Refactored generate_daily_report into an asynchronous _dailypaper_stream generator for SSE, handling various phases (search, build, LLM, judge, filter, save, notify).
    • Implemented post-judge filtering within _dailypaper_stream to remove 'skip' and 'skim' recommendations.
    • Added logic to persist judge scores and report summaries to the registry within the streaming pipeline.
    • Created _sync_daily_report function to serve as a fast, synchronous fallback for requests without LLM analysis or judging.
    • Added _normalize_github_repo_url, _extract_repo_url_from_paper, _flatten_report_papers, and _fetch_github_repo_metadata helper functions for GitHub repository data retrieval.
    • Implemented enrich_papers_with_repo_data endpoint to provide GitHub repository information for papers.
  • src/paperbot/application/services/daily_push_service.py
    • Added email_to_override parameter to push_dailypaper method.
    • Implemented logic to temporarily override the self.config.email_to with provided email_to_override recipients before sending emails, ensuring the original configuration is restored afterwards.
  • src/paperbot/core/abstractions/executable.py
    • Modified ensure_execution_result to check for specific result keys (success, status, data, error) before treating a dictionary as an ExecutionResult object.
    • If a dictionary lacks these keys, it is now treated as raw data and wrapped in a successful ExecutionResult.
  • src/paperbot/core/errors/errors.py
    • Added @dataclass decorator to LLMError, APIError, and ValidationError classes.
    • Converted code and severity attributes in these error classes to fields with default values, allowing proper inheritance and instantiation with dataclass.
  • tests/unit/test_paperscool_route.py
    • Added _parse_sse_events helper function to parse Server-Sent Events from test responses.
    • Updated _FakeWorkflow to accept min_score parameter.
    • Replaced direct mocking of enrich_daily_paper_report and apply_judge_scores_to_report with mock _FakeLLMService and _FakeJudge classes to simulate SSE streaming behavior.
    • Modified existing tests to assert against SSE stream events rather than direct JSON responses for LLM enrichment and judge functionalities.
    • Added test_paperscool_repos_route_extracts_and_enriches to verify GitHub repository enrichment.
    • Added test_paperscool_daily_route_persists_judge_scores to confirm judge score persistence.
    • Introduced _FakeWorkflowMultiPaper to simulate multiple papers for filtering tests.
    • Added test_dailypaper_sse_filter_removes_low_papers to test the post-judge filtering logic.
    • Added test_dailypaper_sse_full_pipeline_llm_judge_filter for an end-to-end test of the full SSE pipeline.
    • Added test_dailypaper_sync_path_no_llm_no_judge to ensure the synchronous JSON fallback works correctly.
  • web/src/app/api/research/paperscool/daily/route.ts
    • Modified the POST handler to act as an SSE-aware proxy.
    • It now detects the Content-Type from the upstream API and pipes SSE streams directly to the client without buffering.
    • Maintains a JSON fallback path for synchronous responses when LLM/Judge features are disabled.
  • web/src/components/research/TopicWorkflowDashboard.tsx
    • Added useCallback, useRef, and MailIcon imports.
    • Defined RepoRow type for repository enrichment data.
    • Introduced StreamProgressCard component for real-time visualization of the DailyPaper streaming pipeline progress, including phase indicators, progress bar, elapsed time, and a scrollable log.
    • Updated ConfigSheetBody to include new configuration options for useHFDaily, notifyEmail, and notifyEnabled.
    • Refactored configuration state management to use the useWorkflowStore (Zustand persist store) for most settings, ensuring persistence across sessions.
    • Replaced runDailyPaper with runDailyPaperStream to handle SSE events, progressively updating the UI with search results, LLM summaries, judge scores, and filter decisions.
    • Implemented runRepoEnrichment function to fetch and display GitHub repository data for papers.
    • Updated loading indicators and hints to reflect the streaming nature of the DailyPaper process.
    • Adjusted paper display to show the data source (DailyPaper or Search) and updated placeholder text.
  • web/src/lib/stores/workflow-store.ts
    • Defined WorkflowConfig interface to centralize all configurable workflow parameters.
    • Created DEFAULT_CONFIG object with initial values for all workflow settings, with most features enabled by default.
    • Added notifyEmail, notifyEnabled, and config (of type WorkflowConfig) to the WorkflowState interface.
    • Implemented setNotifyEmail, setNotifyEnabled, and updateConfig actions to manage these new state properties.
    • Updated clearAll and the Zustand persist configuration to include the new notifyEmail, notifyEnabled, and config states, ensuring they are properly cleared and persisted.
Activity
  • The core DailyPaper pipeline was refactored to use SSE streaming, enabling real-time updates for long-running operations.
  • Post-judge filtering was implemented to refine the DailyPaper report by removing less relevant papers.
  • Email push notification functionality was added, allowing users to receive reports directly.
  • Frontend configuration settings were migrated to a persistent store, enhancing user experience by remembering preferences.
  • Bug fixes were applied to @dataclass inheritance for error classes and the ensure_execution_result utility function.
  • Three new end-to-end tests were added to cover the new SSE filter, full LLM+Judge+Filter pipeline, and the synchronous fallback path.
  • All code changes passed npx next build cleanly, indicating no new build issues were introduced.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality, including SSE streaming for the DailyPaper pipeline, post-judging paper filtering, repository enrichment, and refactoring to persist UI configuration. However, it also introduces security vulnerabilities. Specifically, the new email notification feature is vulnerable to header injection, and the report saving logic is vulnerable to path traversal. Additionally, verbose error messages are returned to the user, which could leak sensitive internal information. I've also added a couple of suggestions for minor refactoring to improve efficiency and code clarity in some complex areas.

top_n: int = Field(10, ge=1, le=200)
formats: List[str] = Field(default_factory=lambda: ["both"])
save: bool = False
output_dir: str = "./reports/dailypaper"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The output_dir parameter in DailyPaperRequest is taken directly from user input and used to construct file paths for saving reports. An attacker can provide an absolute path or a relative path containing .. sequences to write report files to arbitrary locations on the server's filesystem, potentially overwriting files or cluttering sensitive directories.

ingest_summary = ingest_daily_report_to_registry(report)
report["registry_ingest"] = ingest_summary
except Exception as exc:
report["registry_ingest"] = {"error": str(exc)}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The application catches generic exceptions during the report generation and registry ingestion phases and returns the raw exception string (str(exc)) to the user in the API response. This can expose sensitive internal details such as server file paths, database schema information, or third-party service configurations.

query_index = int(row.get("query_index") or 0)
item_index = int(row.get("item_index") or 0)

queries = list(report.get("queries") or [])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For efficiency, this list creation can be moved outside the for loop (before line 276). Creating the queries list on every iteration is unnecessary and can be optimized.

Comment on lines +946 to +971
store.updateDailyResult((prev) => {
const sourceQueries = prev.report.queries || []
let matched = false
const nextQueries = sourceQueries.map((query) => {
const queryName = query.normalized_query || query.raw_query || ""
if (queryName !== d.query) return query
const nextItems = (query.top_items || []).map((item) => {
if (item.title === d.title) { matched = true; return { ...item, judge: d.judge } }
return item
})
return { ...query, top_items: nextItems }
})
if (!matched) {
const fallbackQueries = nextQueries.map((query) => {
if (matched) return query
const nextItems = (query.top_items || []).map((item) => {
if (!matched && item.title === d.title) { matched = true; return { ...item, judge: d.judge } }
return item
})
return { ...query, top_items: nextItems }
})
return { ...prev, report: { ...prev.report, queries: fallbackQueries } }
}
return { ...prev, report: { ...prev.report, queries: nextQueries } }
})
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to find and update a paper in the store is complex and inefficient due to multiple passes over the data. It can be refactored for better readability and performance by first finding the target paper's indices and then applying an immutable update. This avoids nested maps and confusing flag logic.

          if (d.query && d.title && d.judge) {
            store.updateDailyResult((prev) => {
              const { report } = prev;
              const queries = report.queries || [];
              let targetQueryIndex = -1;
              let targetItemIndex = -1;

              // Find by query name first, then fallback to searching all queries
              const primaryQueryIndex = queries.findIndex(q => (q.normalized_query || q.raw_query || "") === d.query);
              if (primaryQueryIndex !== -1) {
                const itemIdx = (queries[primaryQueryIndex].top_items || []).findIndex(it => it.title === d.title);
                if (itemIdx !== -1) {
                  targetQueryIndex = primaryQueryIndex;
                  targetItemIndex = itemIdx;
                }
              }
              if (targetItemIndex === -1) {
                for (let i = 0; i < queries.length; i++) {
                  const itemIdx = (queries[i].top_items || []).findIndex(it => it.title === d.title);
                  if (itemIdx !== -1) {
                    targetQueryIndex = i;
                    targetItemIndex = itemIdx;
                    break;
                  }
                }
              }

              // If found, apply immutable update
              if (targetQueryIndex !== -1 && targetItemIndex !== -1) {
                const newQueries = queries.map((query, qIdx) => {
                  if (qIdx !== targetQueryIndex) return query;
                  const newItems = (query.top_items || []).map((item, iIdx) => {
                    if (iIdx !== targetItemIndex) return item;
                    return { ...item, judge: d.judge };
                  });
                  return { ...query, top_items: newItems };
                });
                return { ...prev, report: { ...report, queries: newQueries } };
              }

              return prev; // No change
            });
          }

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the DailyPaper workflow to support SSE streaming for long-running LLM/Judge steps, adds post-judge filtering and email notifications, persists UI configuration via zustand, and introduces a repos-enrichment endpoint/UI.

Changes:

  • Implement SSE streaming for /api/research/paperscool/daily (while keeping a sync JSON fast path) and update the Next.js proxy + UI to consume and display live progress/events.
  • Add post-judge filtering (keep only must_read / worth_reading) and email notification overrides from the UI.
  • Persist workflow configuration in zustand and add repo-enrichment UI + backend endpoint.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
web/src/lib/stores/workflow-store.ts Adds persisted workflow config + email notification settings to the zustand store.
web/src/components/research/TopicWorkflowDashboard.tsx Consumes DailyPaper SSE stream, displays live progress/logs, persists config via store, adds email notification UI and repo enrichment UI.
web/src/app/api/research/paperscool/daily/route.ts Updates API proxy to pass through SSE without buffering, with JSON fallback.
tests/unit/test_paperscool_route.py Adds/updates tests for SSE daily pipeline (LLM/Judge/filter), sync fallback, and repos enrichment endpoint.
src/paperbot/core/errors/errors.py Fixes dataclass inheritance/default overriding for custom error types.
src/paperbot/core/abstractions/executable.py Fixes ensure_execution_result to treat “plain dict” outputs as data unless result keys are present.
src/paperbot/application/services/daily_push_service.py Adds per-request email recipient override for notifications.
src/paperbot/api/routes/paperscool.py Adds DailyPaper SSE generator, judge filtering, registry persistence hooks, email recipient override, and a new repos enrichment endpoint.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +225 to +226
type StreamPhase = "idle" | "search" | "build" | "llm" | "judge" | "filter" | "save" | "notify" | "done" | "error"

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamPhase does not include an "insight" phase, but the backend stream can emit progress events with phase: "insight" (see paperscool daily SSE). When that happens the phase indicator/progress math falls back to indexOf(...) === -1, so the progress bar and step dots can behave incorrectly. Align the phase contract by either adding "insight" to StreamPhase/PHASE_LABELS/PHASE_ORDER (as part of LLM), or by mapping unknown phases to a known bucket (e.g., treat "insight" as "llm").

Copilot uses AI. Check for mistakes.
Comment on lines +1290 to +1294
body: JSON.stringify({
report: dailyResult.report,
max_items: 500,
include_github_api: true,
}),
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repo enrichment request asks the backend for max_items: 500 with include_github_api: true, which can translate into hundreds of outbound GitHub API calls for a single click (slow UI and easy to hit rate limits/timeouts). Consider lowering the default max_items, making GitHub API enrichment opt-in, and/or surfacing a smaller configurable limit in the UI.

Copilot uses AI. Check for mistakes.
return None

owner, repo = match.group(1), match.group(2)
repo = repo.removesuffix(".git")
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo.removesuffix(".git") requires Python 3.9+, but pyproject.toml declares requires-python = ">=3.8". This will crash at runtime under Python 3.8. Replace with a 3.8-compatible implementation (e.g., an endswith check + slicing) or bump the minimum supported Python version.

Suggested change
repo = repo.removesuffix(".git")
if repo.endswith(".git"):
repo = repo[: -len(".git")]

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +102
class PapersCoolReposRequest(BaseModel):
report: Optional[Dict[str, Any]] = None
papers: List[Dict[str, Any]] = Field(default_factory=list)
max_items: int = Field(100, ge=1, le=1000)
include_github_api: bool = True

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PapersCoolReposRequest defaults include_github_api to True and allows max_items up to 1000. Combined with the implementation making a GitHub API call per candidate, this endpoint can trigger a very large number of outbound requests from a single unauthenticated call (rate limiting/DoS risk and very long request times). Consider defaulting include_github_api to False, enforcing a much smaller server-side cap when it is True, and/or adding authentication/rate limiting.

Copilot uses AI. Check for mistakes.
Comment on lines +686 to +703
selected = deduped[: max(1, int(req.max_items))]
token = os.getenv("GITHUB_TOKEN") or os.getenv("GH_TOKEN")

repos: List[Dict[str, Any]] = []
for item in selected:
repo_url = _extract_repo_url_from_paper(item)
if not repo_url:
continue

row: Dict[str, Any] = {
"title": item.get("title") or "Untitled",
"query": item.get("_query") or ", ".join(item.get("matched_queries") or []),
"paper_url": item.get("url") or item.get("external_url") or "",
"repo_url": repo_url,
}
if req.include_github_api:
row["github"] = _fetch_github_repo_metadata(repo_url=repo_url, token=token)
repos.append(row)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repos enrichment loop performs GitHub API requests sequentially for every selected paper. With the current limits this can lead to multi-minute requests and ties up worker threads. Consider adding a hard cap (especially when include_github_api is enabled), caching per repo URL, and/or parallelizing requests with bounded concurrency (or switching to async HTTP client).

Copilot uses AI. Check for mistakes.
}
} catch (err) { setError(String(err)); store.setPhase("error") } finally { setLoadingAnalyze(false) }
} catch (err) {
streamFailed = true
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value assigned to streamFailed here is unused.

Suggested change
streamFailed = true

Copilot uses AI. Check for mistakes.
def judge_with_calibration(self, *, paper, query, n_runs=1):
return _FakeJudgment()

monkeypatch.setattr(paperscool_route, "get_llm_service", lambda: object())
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 'lambda' is just a simple wrapper around a callable object. Use that object directly.

Copilot uses AI. Check for mistakes.
def judge_with_calibration(self, *, paper, query, n_runs=1):
return _FakeJudgment()

monkeypatch.setattr(paperscool_route, "get_llm_service", lambda: object())
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 'lambda' is just a simple wrapper around a callable object. Use that object directly.

Suggested change
monkeypatch.setattr(paperscool_route, "get_llm_service", lambda: object())
monkeypatch.setattr(paperscool_route, "get_llm_service", object)

Copilot uses AI. Check for mistakes.
def judge_with_calibration(self, *, paper, query, n_runs=1):
return _VaryingJudgment(paper.get("title", ""))

monkeypatch.setattr(paperscool_route, "get_llm_service", lambda: object())
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 'lambda' is just a simple wrapper around a callable object. Use that object directly.

Suggested change
monkeypatch.setattr(paperscool_route, "get_llm_service", lambda: object())
monkeypatch.setattr(paperscool_route, "get_llm_service", object)

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +19
except Exception:
pass
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except json.JSONDecodeError:
# Ignore malformed JSON payloads in the SSE stream.
continue

Copilot uses AI. Check for mistakes.
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the DailyPaper pipeline by converting it to SSE streaming, adding post-judge paper filtering, email push notifications, and persisting configuration settings, alongside backend API and frontend UI updates, and a comprehensive test suite. However, it introduces critical security vulnerabilities including a Path Traversal vulnerability in the report saving logic, an Email Header Injection vulnerability in the notification service, and a Cross-Site Scripting (XSS) vulnerability in the frontend dashboard, stemming from insecure handling of user-supplied paths, email addresses, and URLs. These issues require immediate attention through proper input validation and sanitization.

json_path = None
notify_result: Optional[Dict[str, Any]] = None
if req.save:
reporter = DailyPaperReporter(output_dir=req.output_dir)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The output_dir parameter in DailyPaperRequest is used directly to construct file system paths without sanitization. An attacker could provide a path like /etc/ or ../../ to create directories and write report files in arbitrary locations on the server.

Recommendation: Sanitize the output_dir input to ensure it is a relative path and does not contain directory traversal sequences. Alternatively, enforce a restricted base directory for all file operations.

Comment on lines +108 to +111
if email_to_override:
cleaned = [e.strip() for e in email_to_override if (e or "").strip()]
if cleaned:
self.config.email_to = cleaned

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The email_to_override parameter allows users to specify arbitrary email recipients. These addresses are joined into the To header of the outgoing email without validating for newline characters. An attacker could inject additional headers (e.g., Bcc:) by including \r\n in the email address string.

Recommendation: Validate that each email address in the list conforms to a standard email format and does not contain control characters or newlines.

{repoRows.map((row, idx) => (
<tr key={`${row.repo_url}-${idx}`} className="odd:bg-muted/20">
<td className="border-b px-3 py-2">
{row.paper_url ? <a href={row.paper_url} target="_blank" rel="noreferrer" className="hover:underline text-primary">{row.title}</a> : row.title}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The application renders paper URLs directly into the href attribute of anchor tags in the repository enrichment table. Since these URLs can be controlled by user input or external sources and are not sanitized, an attacker could inject javascript: URLs to execute arbitrary code in the context of the user's session when the link is clicked.

Recommendation: Sanitize all URLs before rendering them in the frontend. Ensure that only safe protocols (e.g., http:, https:) are allowed.

Comment on lines +344 to +371
total_before = 0
total_after = 0
for query in report.get("queries") or []:
query_name = query.get("normalized_query") or query.get("raw_query") or ""
items_before = list(query.get("top_items") or [])
total_before += len(items_before)
kept: List[Dict[str, Any]] = []
removed: List[Dict[str, Any]] = []
for item in items_before:
j = item.get("judge")
if isinstance(j, dict):
rec = j.get("recommendation", "")
if rec in KEEP_RECOMMENDATIONS:
kept.append(item)
else:
removed.append(item)
filter_log.append({
"query": query_name,
"title": item.get("title") or "Untitled",
"recommendation": rec,
"overall": j.get("overall"),
"action": "removed",
})
else:
# No judge score — keep by default (unjudged papers)
kept.append(item)
total_after += len(kept)
query["top_items"] = kept

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The filter_log is created and populated within the _dailypaper_stream function, and then included in the report["filter"] dictionary. This filter block is then yielded in the filter_done event. However, the report object itself is modified in place to reflect the filtered items (query["top_items"] = kept). While this works, it means the report yielded in the final result event already contains the filtered data, making the filter_done event's log potentially redundant if the frontend only cares about the final report state. Consider if the filter_done event needs to explicitly send the report object or if the log is sufficient, and clarify the expected state of report at each yield point.

<div className="ml-6 grid grid-cols-3 gap-2">
<div className="space-y-1"><Label className="text-xs">Runs</Label><Input type="number" min={1} max={5} value={judgeRuns} onChange={(e) => setJudgeRuns(Number(e.target.value || 1))} className="h-8 text-sm" /></div>
<div className="space-y-1"><Label className="text-xs">Max Items</Label><Input type="number" min={1} max={20} value={judgeMaxItems} onChange={(e) => setJudgeMaxItems(Number(e.target.value || 5))} className="h-8 text-sm" /></div>
<div className="space-y-1"><Label className="text-xs">Max Items</Label><Input type="number" min={1} value={judgeMaxItems} onChange={(e) => setJudgeMaxItems(Number(e.target.value || 20))} className="h-8 text-sm" /></div>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The judgeMaxItems input field is missing a max attribute. The backend judge_max_items_per_query was updated to allow up to 200 items, but the frontend UI still implicitly limits it or doesn't provide the correct upper bound to the user. Please update the max attribute to 200 to align with the backend's new limit.

Suggested change
<div className="space-y-1"><Label className="text-xs">Max Items</Label><Input type="number" min={1} value={judgeMaxItems} onChange={(e) => setJudgeMaxItems(Number(e.target.value || 20))} className="h-8 text-sm" /></div>
<div className="space-y-1"><Label className="text-xs">Max Items</Label><Input type="number" min={1} max={200} value={judgeMaxItems} onChange={(e) => setJudgeMaxItems(Number(e.target.value || 20))} className="h-8 text-sm" /></div>

Comment on lines +947 to +968
const sourceQueries = prev.report.queries || []
let matched = false
const nextQueries = sourceQueries.map((query) => {
const queryName = query.normalized_query || query.raw_query || ""
if (queryName !== d.query) return query
const nextItems = (query.top_items || []).map((item) => {
if (item.title === d.title) { matched = true; return { ...item, judge: d.judge } }
return item
})
return { ...query, top_items: nextItems }
})
if (!matched) {
const fallbackQueries = nextQueries.map((query) => {
if (matched) return query
const nextItems = (query.top_items || []).map((item) => {
if (!matched && item.title === d.title) { matched = true; return { ...item, judge: d.judge } }
return item
})
return { ...query, top_items: nextItems }
})
return { ...prev, report: { ...prev.report, queries: fallbackQueries } }
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for updating dailyResult in the judge event handler uses a matched flag as a side effect within map callbacks, and then has a fallbackQueries block that re-iterates. This pattern can be confusing and is generally discouraged in functional programming contexts like React state updates. A cleaner approach would be to first find the specific item to update, and then construct the new immutable state based on that finding. This improves readability and maintainability.

            store.updateDailyResult((prev) => {
              const nextReport = { ...prev.report };
              let updated = false;

              // Try to find and update in the specific query first
              nextReport.queries = (nextReport.queries || []).map((query) => {
                  const queryName = query.normalized_query || query.raw_query || "";
                  if (queryName === d.query) {
                      const nextItems = (query.top_items || []).map((item) => {
                          if (item.title === d.title) {
                              updated = true;
                              return { ...item, judge: d.judge };
                          }
                          return item;
                      });
                      return { ...query, top_items: nextItems };
                  }
                  return query;
              });

              // If not updated in the specific query, try to find and update in any other query
              if (!updated) {
                  nextReport.queries = (nextReport.queries || []).map((query) => {
                      const nextItems = (query.top_items || []).map((item) => {
                          if (item.title === d.title && !item.judge) { // Only update if not already judged
                              updated = true;
                              return { ...item, judge: d.judge };
                          }
                          return item;
                      });
                      return { ...query, top_items: nextItems };
                  });
              }

              return { ...prev, report: nextReport };
            })

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
web/src/lib/stores/workflow-store.ts (1)

159-169: ⚠️ Potential issue | 🟡 Minor

Persisted config won't merge with new defaults on upgrade.

Zustand's persist middleware replaces the stored value as-is. Users with a previously persisted config object will be missing any newly added keys (e.g., useHFDaily, judgeTokenBudget). Their UI toggles for new features will be undefined instead of the intended defaults.

Consider adding a version + migrate to the persist options, or use a merge strategy that fills in missing keys from DEFAULT_CONFIG:

Example merge approach
     {
       name: "paperbot-workflow",
+      merge: (persisted, current) => ({
+        ...current,
+        ...(persisted as object),
+        config: { ...DEFAULT_CONFIG, ...((persisted as any)?.config || {}) },
+      }),
       partialize: (state) => ({
🤖 Fix all issues with AI agents
In `@src/paperbot/api/routes/paperscool.py`:
- Around line 689-703: The loop that calls _fetch_github_repo_metadata for each
selected item can be very slow and hit rate limits; when req.include_github_api
is true enforce a sensible default cap (e.g., max_concurrent_items =
min(requested_max_items, 50)) and only iterate the first N items, run the
fetches concurrently with a bounded ThreadPoolExecutor (or asyncio.gather
equivalent) to parallelize requests and set per-request timeouts, and modify
_fetch_github_repo_metadata to return a distinguishable rate-limit result or
raise a RateLimit exception on HTTP 403/429 so the caller can short-circuit the
remaining fetches when a rate-limit response is observed. Ensure you reference
the selected loop that builds row and the repo_url variable, replace the
synchronous per-item call in that loop with concurrent submission/collection and
stop further processing on rate-limit signals.
- Around line 373-384: The global_top filtering currently builds global_kept
from global_before but doesn't record removed papers in filter_log; update the
block that processes report["global_top"] (variables global_before, global_kept)
to append each removed item into report["filter_log"] with the same
structure/reason used for per-query top_items (include judge/recommendation and
a clear reason like "filtered from global_top: recommendation not in
KEEP_RECOMMENDATIONS"), so any item skipped (i.e., not added to global_kept) is
logged before setting report["global_top"] = global_kept.
- Around line 132-147: The async generator _dailypaper_stream currently calls
blocking synchronous functions (e.g., PapersCoolTopicSearchWorkflow.run,
llm_service.summarize_paper, judge.judge_single, registry.ingest,
notification.push and any other heavy sync methods) directly which will stall
the event loop; change each blocking call inside _dailypaper_stream to run off
the event loop via asyncio.to_thread(...) or loop.run_in_executor(...), e.g.,
call await asyncio.to_thread(PapersCoolTopicSearchWorkflow().run, ...) and
similarly wrap llm_service.summarize_paper, judge.judge_single, registry.ingest,
notification.push (and any helper methods invoked within the generator) so the
SSE generator yields progress events while the work runs in background threads.
- Around line 24-27: The build fails because paperscool.py imports
ingest_daily_report_to_registry and persist_judge_scores_to_registry from
paperbot.application.workflows.dailypaper but those functions are missing; fix
by either adding matching functions to the dailypaper module (implementing the
expected signatures and behavior so paperscool.py can call
ingest_daily_report_to_registry and persist_judge_scores_to_registry) or remove
the imports and refactor the callers in paperscool.py to use the correct
existing helpers (replace calls to ingest_daily_report_to_registry and
persist_judge_scores_to_registry with the appropriate implemented functions like
normalize_llm_features/normalize_output_formats or other registry persistence
utilities) and update any tests/usages accordingly.

In `@src/paperbot/application/services/daily_push_service.py`:
- Around line 106-137: The code mutates shared self.config.email_to causing
concurrency hazards; instead compute a local effective_recipient list from
email_to_override (e.g., cleaned or fallback to self.config.email_to) before the
channels loop, remove the temporary assignment/restoration of
self.config.email_to, and pass that local variable into _send_email; update
_send_email signature to accept an optional email_to parameter (e.g., def
_send_email(..., email_to: Optional[List[str]] = None) and use recipients =
email_to or self.config.email_to internally) and update the call site in the
channels loop to _send_email(subject=subject, body=text,
email_to=effective_recipient).
🧹 Nitpick comments (11)
web/src/app/api/research/paperscool/daily/route.ts (1)

5-49: Clean SSE proxy implementation.

The explicit fetch + conditional streaming logic is well-structured. A couple of notes:

  1. Connection: keep-alive (Line 36) is a hop-by-hop header that HTTP/2 forbids. If Next.js serves over HTTP/2 in production, this is silently dropped or may trigger warnings in strict proxies. Consider removing it — the SSE behavior works fine without it.

  2. Consider adding cache: "no-store" to the fetch options to be explicit about bypassing any Next.js data cache on this mutable endpoint.

Suggested tweaks
     upstream = await fetch(`${apiBaseUrl()}/api/research/paperscool/daily`, {
       method: "POST",
       headers: {
         "Content-Type": contentType,
         Accept: "text/event-stream, application/json",
       },
       body,
+      cache: "no-store",
     })
     return new Response(upstream.body, {
       status: upstream.status,
       headers: {
         "Content-Type": "text/event-stream",
         "Cache-Control": "no-cache",
-        Connection: "keep-alive",
       },
     })
web/src/lib/stores/workflow-store.ts (1)

198-205: clearAll doesn't reset notifyEmail, notifyEnabled, or config.

If "clearAll" is meant to be a full reset, these new fields should be included. If preserving user preferences is intentional, consider renaming to clearResults for clarity.

If full reset is desired
       clearAll: () =>
         set({
           searchResult: null,
           dailyResult: null,
           phase: "idle",
           analyzeLog: [],
           lastUpdated: null,
+          notifyEmail: "",
+          notifyEnabled: false,
+          config: { ...DEFAULT_CONFIG },
         }),
web/src/components/research/TopicWorkflowDashboard.tsx (3)

253-253: Elapsed timer won't auto-update between SSE events.

elapsed is computed once per render (Date.now() - startTime). Between SSE events, there's no re-render trigger, so the timer appears frozen until the next event arrives. For long phases (e.g., judge scoring many papers), the displayed time will lag.

A useEffect with a 1-second setInterval updating a now state would provide a smooth ticking timer.


835-1035: SSE event handlers are largely duplicated between runDailyPaperStream and runAnalyzeStream.

The judge, trend, insight, judge_done, and result event handlers in runDailyPaperStream (Lines 835–1035) are near-identical copies of those in runAnalyzeStream (Lines 1079–1263). Consider extracting a shared handler function (e.g., handleSSEEvent(event, store, ...)) to reduce duplication and keep the two code paths in sync.


600-647: Config accessors are verbose but functional.

Each config property creates a getter + setter wrapping uc(...). Since every setter is a new closure per render, consider using useMemo or extracting a thin hook (e.g., useConfigField("topK")) to reduce boilerplate if this pattern grows. Not urgent given current scope.

tests/unit/test_paperscool_route.py (2)

7-20: Consider raising on parse errors in _parse_sse_events.

The bare except Exception: pass on Line 18 silently drops malformed SSE payloads. In tests, this can mask real issues — a test might pass with fewer events than expected without any indication that events were lost. At minimum, appending a sentinel (e.g., {"_parse_error": payload}) would surface problems during assertions.

Suggestion
             try:
                 events.append(json.loads(payload))
-            except Exception:
-                pass
+            except json.JSONDecodeError:
+                events.append({"_parse_error": payload})

As per coding guidelines, "Prefer deterministic tests and mock external APIs when possible" — surfacing parse errors ensures test determinism.


169-192: Extract duplicated fake classes to module level.

_FakeJudgment, _FakeJudge, and _FakeLLMService are defined nearly identically inside 4–5 separate test functions. Extracting them to module-level (like _FakeWorkflow) would cut ~150 lines of duplication and ensure consistency if the fakes need updating.

The _VaryingJudgment variants could remain local since they have test-specific behavior, but the standard fakes should be shared.

Also applies to: 340-363, 446-478, 534-585

src/paperbot/api/routes/paperscool.py (4)

476-479: Sync function called from an async endpoint blocks the event loop.

_sync_daily_report is a regular synchronous function called directly from the async def generate_daily_report handler (line 466). This blocks the asyncio event loop for the duration of workflow.run(...) and all subsequent processing. Since this is the "fast path" (no LLM/Judge), the impact is smaller than the streaming case, but it still blocks concurrent request handling.

Consider wrapping the call in asyncio.to_thread():

-        return _sync_daily_report(req, cleaned_queries)
+        return await asyncio.to_thread(_sync_daily_report, req, cleaned_queries)

677-684: Deduplication logic is duplicated from _flatten_report_papers (lines 603–611).

Both use the same url|title composite key. Extract a small helper to avoid drift.

Example helper
def _dedupe_papers(papers: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    seen: set[str] = set()
    result: List[Dict[str, Any]] = []
    for item in papers:
        key = f"{item.get('url') or ''}|{item.get('title') or ''}"
        if key not in seen:
            seen.add(key)
            result.append(item)
    return result

870-873: Consistent with the streaming pattern — same event-loop blocking concern applies.

The persist_judge_scores_to_registry call here has the same async-blocking issue noted for _dailypaper_stream. When you address that, apply the same asyncio.to_thread() treatment here.


161-168: Emitting the full report in the report_built SSE event may be unnecessarily large.

The complete report dict is sent at line 166, then the same report (now enriched) is sent again in the final result event (line 449). With up to 200 papers, this can be a significant payload duplicated over the SSE stream. Consider sending only summary metadata here and reserving the full report for the final result event.

Comment on lines +132 to +147
async def _dailypaper_stream(req: DailyPaperRequest):
"""SSE generator for the full DailyPaper pipeline."""
cleaned_queries = [q.strip() for q in req.queries if (q or "").strip()]

# Phase 1 — Search
yield StreamEvent(type="progress", data={"phase": "search", "message": "Searching papers..."})
workflow = PapersCoolTopicSearchWorkflow()
effective_top_k = max(int(req.top_k_per_query), int(req.top_n), 1)
search_result = workflow.run(
queries=cleaned_queries,
sources=req.sources,
branches=req.branches,
top_k_per_query=effective_top_k,
show_per_branch=req.show_per_branch,
min_score=req.min_score,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Blocking synchronous calls inside an async generator will stall the event loop.

_dailypaper_stream is an async def generator, but every heavy operation it awaits is synchronous (workflow.run, llm_service.summarize_paper, judge.judge_single, etc.). While the generator is executing a blocking call, the entire asyncio event loop is frozen — no other requests can be served, and keep-alive pings cannot be sent.

Wrap blocking calls with asyncio.to_thread() (or loop.run_in_executor()), e.g.:

Example fix for the search phase
+import asyncio
 ...
-    search_result = workflow.run(
-        queries=cleaned_queries,
+    search_result = await asyncio.to_thread(
+        workflow.run,
+        queries=cleaned_queries,
         sources=req.sources,
         branches=req.branches,
         top_k_per_query=effective_top_k,
         show_per_branch=req.show_per_branch,
         min_score=req.min_score,
     )

The same pattern should be applied to every synchronous call in this generator (LLM summarize/relevance/trends/insight, judge calls, registry ingestion, notification push, etc.).

🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/paperscool.py` around lines 132 - 147, The async
generator _dailypaper_stream currently calls blocking synchronous functions
(e.g., PapersCoolTopicSearchWorkflow.run, llm_service.summarize_paper,
judge.judge_single, registry.ingest, notification.push and any other heavy sync
methods) directly which will stall the event loop; change each blocking call
inside _dailypaper_stream to run off the event loop via asyncio.to_thread(...)
or loop.run_in_executor(...), e.g., call await
asyncio.to_thread(PapersCoolTopicSearchWorkflow().run, ...) and similarly wrap
llm_service.summarize_paper, judge.judge_single, registry.ingest,
notification.push (and any helper methods invoked within the generator) so the
SSE generator yields progress events while the work runs in background threads.

Comment on lines +373 to +384
# Also filter global_top
global_before = list(report.get("global_top") or [])
global_kept = []
for item in global_before:
j = item.get("judge")
if isinstance(j, dict):
rec = j.get("recommendation", "")
if rec in KEEP_RECOMMENDATIONS:
global_kept.append(item)
else:
global_kept.append(item)
report["global_top"] = global_kept
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Papers removed from global_top are not recorded in filter_log.

The filter log (lines 360-366) only captures removals from per-query top_items. Papers filtered out of global_top here are silently dropped, making it harder to trace why a paper disappeared.

Proposed fix
     global_kept = []
     for item in global_before:
         j = item.get("judge")
         if isinstance(j, dict):
             rec = j.get("recommendation", "")
             if rec in KEEP_RECOMMENDATIONS:
                 global_kept.append(item)
+            else:
+                filter_log.append({
+                    "query": "_global_top",
+                    "title": item.get("title") or "Untitled",
+                    "recommendation": rec,
+                    "overall": j.get("overall"),
+                    "action": "removed",
+                })
         else:
             global_kept.append(item)
🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/paperscool.py` around lines 373 - 384, The global_top
filtering currently builds global_kept from global_before but doesn't record
removed papers in filter_log; update the block that processes
report["global_top"] (variables global_before, global_kept) to append each
removed item into report["filter_log"] with the same structure/reason used for
per-query top_items (include judge/recommendation and a clear reason like
"filtered from global_top: recommendation not in KEEP_RECOMMENDATIONS"), so any
item skipped (i.e., not added to global_kept) is logged before setting
report["global_top"] = global_kept.

Comment on lines +689 to +703
repos: List[Dict[str, Any]] = []
for item in selected:
repo_url = _extract_repo_url_from_paper(item)
if not repo_url:
continue

row: Dict[str, Any] = {
"title": item.get("title") or "Untitled",
"query": item.get("_query") or ", ".join(item.get("matched_queries") or []),
"paper_url": item.get("url") or item.get("external_url") or "",
"repo_url": repo_url,
}
if req.include_github_api:
row["github"] = _fetch_github_repo_metadata(repo_url=repo_url, token=token)
repos.append(row)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sequential GitHub API calls can be extremely slow and will likely hit rate limits.

With max_items up to 1000, each _fetch_github_repo_metadata call makes a synchronous HTTP request with a 15s timeout. GitHub's unauthenticated rate limit is 60 req/hour (5,000 authenticated). This endpoint could easily exceed limits or take many minutes to complete, causing request timeouts.

Consider:

  1. Adding a reasonable default/cap for max_items when include_github_api is true (e.g., 50).
  2. Short-circuiting on rate-limit responses (HTTP 403/429) to avoid burning through remaining items.
  3. Using concurrent requests (e.g., concurrent.futures.ThreadPoolExecutor) to speed up fetches.
🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/paperscool.py` around lines 689 - 703, The loop that
calls _fetch_github_repo_metadata for each selected item can be very slow and
hit rate limits; when req.include_github_api is true enforce a sensible default
cap (e.g., max_concurrent_items = min(requested_max_items, 50)) and only iterate
the first N items, run the fetches concurrently with a bounded
ThreadPoolExecutor (or asyncio.gather equivalent) to parallelize requests and
set per-request timeouts, and modify _fetch_github_repo_metadata to return a
distinguishable rate-limit result or raise a RateLimit exception on HTTP 403/429
so the caller can short-circuit the remaining fetches when a rate-limit response
is observed. Ensure you reference the selected loop that builds row and the
repo_url variable, replace the synchronous per-item call in that loop with
concurrent submission/collection and stop further processing on rate-limit
signals.

Comment on lines +106 to +137
# Allow UI-provided email recipients to override env config
original_email_to = self.config.email_to
if email_to_override:
cleaned = [e.strip() for e in email_to_override if (e or "").strip()]
if cleaned:
self.config.email_to = cleaned

subject = self._build_subject(report)
text = self._build_text(
report, markdown=markdown, markdown_path=markdown_path, json_path=json_path
)

results: Dict[str, Any] = {"sent": False, "channels": channels, "results": {}}
any_success = False
for channel in channels:
try:
if channel == "email":
self._send_email(subject=subject, body=text)
elif channel == "slack":
self._send_slack(subject=subject, body=text)
elif channel in {"dingtalk", "dingding"}:
self._send_dingtalk(subject=subject, body=text)
else:
raise ValueError(f"unsupported channel: {channel}")
results["results"][channel] = {"ok": True}
any_success = True
except Exception as exc: # pragma: no cover - runtime specific
logger.warning("Daily push failed channel=%s err=%s", channel, exc)
results["results"][channel] = {"ok": False, "error": str(exc)}
try:
for channel in channels:
try:
if channel == "email":
self._send_email(subject=subject, body=text)
elif channel == "slack":
self._send_slack(subject=subject, body=text)
elif channel in {"dingtalk", "dingding"}:
self._send_dingtalk(subject=subject, body=text)
else:
raise ValueError(f"unsupported channel: {channel}")
results["results"][channel] = {"ok": True}
any_success = True
except Exception as exc: # pragma: no cover - runtime specific
logger.warning("Daily push failed channel=%s err=%s", channel, exc)
results["results"][channel] = {"ok": False, "error": str(exc)}
finally:
self.config.email_to = original_email_to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mutating shared self.config.email_to creates a concurrency hazard.

The override temporarily replaces self.config.email_to on the shared config object and restores it in finally. If multiple requests invoke push_dailypaper concurrently (common in async FastAPI), one request's override can leak into another's _send_email call.

A safer approach is to pass the effective recipient list as a local variable:

🔒 Proposed fix — avoid shared state mutation
     def push_dailypaper(
         self,
         *,
         report: Dict[str, Any],
         markdown: str = "",
         markdown_path: Optional[str] = None,
         json_path: Optional[str] = None,
         channels_override: Optional[List[str]] = None,
         email_to_override: Optional[List[str]] = None,
     ) -> Dict[str, Any]:
         channels = channels_override or self.config.channels
         channels = [c.strip().lower() for c in channels if c and c.strip()]
 
         if not self.config.enabled:
             return {"sent": False, "reason": "notify disabled", "channels": channels}
         if not channels:
             return {"sent": False, "reason": "no channels configured", "channels": channels}
 
-        # Allow UI-provided email recipients to override env config
-        original_email_to = self.config.email_to
-        if email_to_override:
-            cleaned = [e.strip() for e in email_to_override if (e or "").strip()]
-            if cleaned:
-                self.config.email_to = cleaned
+        # Resolve effective email recipients without mutating shared config
+        effective_email_to = self.config.email_to
+        if email_to_override:
+            cleaned = [e.strip() for e in email_to_override if (e or "").strip()]
+            if cleaned:
+                effective_email_to = cleaned
 
         subject = self._build_subject(report)
         text = self._build_text(
             report, markdown=markdown, markdown_path=markdown_path, json_path=json_path
         )
 
         results: Dict[str, Any] = {"sent": False, "channels": channels, "results": {}}
         any_success = False
-        try:
-            for channel in channels:
-                try:
-                    if channel == "email":
-                        self._send_email(subject=subject, body=text)
-                    ...
-                except Exception as exc:
-                    ...
-        finally:
-            self.config.email_to = original_email_to
+        for channel in channels:
+            try:
+                if channel == "email":
+                    self._send_email(subject=subject, body=text, email_to=effective_email_to)
+                ...
+            except Exception as exc:
+                ...

Then update _send_email to accept an optional email_to parameter:

def _send_email(self, *, subject: str, body: str, email_to: Optional[List[str]] = None) -> None:
    recipients = email_to or self.config.email_to
    ...
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 130-130: Abstract raise to an inner function

(TRY301)


[warning] 130-130: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 133-133: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@src/paperbot/application/services/daily_push_service.py` around lines 106 -
137, The code mutates shared self.config.email_to causing concurrency hazards;
instead compute a local effective_recipient list from email_to_override (e.g.,
cleaned or fallback to self.config.email_to) before the channels loop, remove
the temporary assignment/restoration of self.config.email_to, and pass that
local variable into _send_email; update _send_email signature to accept an
optional email_to parameter (e.g., def _send_email(..., email_to:
Optional[List[str]] = None) and use recipients = email_to or
self.config.email_to internally) and update the call site in the channels loop
to _send_email(subject=subject, body=text, email_to=effective_recipient).

Add paper_store, paper_identity, alembic migrations, research routes,
and other files from the saved-papers feature that the SSE streaming
code depends on. Fixes CI ImportError for ingest_daily_report_to_registry.
…ted settings

- Expand push config section with QQ/Gmail SMTP examples
- Document UI email override and localStorage persistence
- Add SSE pipeline diagram and post-judge filter explanation
- Update module descriptions for DailyPaper/Judge/Analyze
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In `@src/paperbot/api/routes/research.py`:
- Around line 918-931: After calling client.get_author(...) in the handler (the
block that also calls client.get_author_papers(...)), add a guard that if author
is None you close the client and return/raise a 404 error (e.g.,
HTTPException(status_code=404, detail="author not found")) instead of proceeding
to build zeroed stats and an empty graph; apply the same check in the
scholar_trends handler (the code around the existing scholar_trends function) to
ensure non-existent scholar_ids surface as 404s rather than silent degraded
responses.

In `@src/paperbot/application/workflows/dailypaper.py`:
- Around line 58-79: The dedup key in _iter_report_papers currently uses only
url and title so items with both empty collapse to the same key ("|"); change
the key-generation logic where key = f"{item.get('url') or
''}|{item.get('title') or ''}" to incorporate an additional unique fallback
(e.g., item.get('paper_id') or item.get('arxiv_id') or the report source plus
the item's index) or skip deduping when both url and title are missing; update
the seen set usage accordingly so distinct items with empty url/title are not
silently dropped.

In `@src/paperbot/infrastructure/stores/paper_store.py`:
- Around line 134-140: The fallback call to
session.execute(...).scalar_one_or_none() can raise MultipleResultsFound when
url is empty; update the condition and/or query to avoid that: either tighten
the guard to if row is None and title and url: to skip the fallback when url is
falsy, or replace scalar_one_or_none() with a defensive retrieval such as
.scalars().first() (or .first()) on the executed select to return a single row
when duplicates exist; adjust the code around PaperModel/title/url,
session.execute and the select(func.lower(PaperModel.title) == title.lower(),
PaperModel.url == url) to implement one of these fixes.
- Around line 271-275: The code assigns row.judge_cost_tier using
int(judge.get("judge_cost_tier")) which will raise ValueError for non-numeric
strings; change this to follow the file's safe-parsing pattern by attempting to
coerce to int inside a try/except (or call the existing helper used elsewhere)
and set row.judge_cost_tier to None on failure or when the value is None.
Specifically, replace the direct int(...) with a safe parse that checks
judge.get("judge_cost_tier") for truthiness, tries int(value) in a try/except
ValueError/TypeError, and assigns None on exception; reference the same parsing
helper or approach used for other fields in this module to keep behavior
consistent with the rest of the file.
- Around line 170-172: The current upsert unconditionally calls
row.set_authors(_safe_list(authors)), row.set_keywords(_safe_list(keywords)),
and row.set_metadata(metadata) which erases existing data when incoming values
are empty; change this to preserve existing values by only setting when the
incoming value is non-empty (e.g., authors = _safe_list(authors); if authors:
row.set_authors(authors)) and do the same for keywords and metadata (mirror the
merge pattern used for title: `title or row.title or ""`) so sparse re-ingests
don't wipe richer stored data.

In `@src/paperbot/infrastructure/stores/research_store.py`:
- Around line 1095-1114: Queries that use scalar_one_or_none() for URL and title
lookups (the select(...).where(...) blocks that search PaperModel by
url_candidates and by func.lower(PaperModel.title)) can raise
MultipleResultsFound when duplicates exist; change these to return a single row
safely by using .limit(1).scalar_one_or_none() or replace with .first() on the
query result so the URL/title lookup returns at most one PaperModel row (check
both the url_candidates block and the title lookup block).

In `@web/src/lib/api.ts`:
- Around line 210-331: fetchScholarDetails was changed to call
postJson/fetchPapers and implement mapping logic but lacks tests; add unit tests
for fetchScholarDetails that (1) stub postJson to return both network and trends
and assert returned ScholarDetails fields (name, affiliation, stats,
expertise_radar, publications length and tags), (2) stub postJson to return only
network or only trends and assert partial-mapping behavior (co_authors from
network.nodes or publications from trends.recent_papers), and (3) stub postJson
to return null/undefined for both and stub fetchPapers to return mock papers and
assert the fallback mock data is returned. In tests, mock or spy on postJson and
fetchPapers, exercise fetchScholarDetails(id), and assert key outputs (id, name,
h_index, papers_tracked, co_authors, publications, stats) to cover
ScholarNetworkResponse/ScholarTrendsResponse mapping and the fallback path.
- Around line 240-272: The fallback return currently contains Dawn Song–specific
hardcoded fields; update the conditional branch that runs when (!network &&
!trends) to produce generic/interpolated data: use scholarName (and id) in the
name and bio (e.g., `bio: `${scholarName} is a researcher...``), replace
hardcoded affiliation/location/website with generic or null values, and replace
specific co_authors with generic placeholders or an empty array; keep
publications = await fetchPapers() and stats but remove Dawn Song–specific
numbers or derive them from fetched papers if possible. Ensure you modify the
returned object in that block (references: scholarName, id, fetchPapers,
publications, co_authors) so the fallback is correct for any requested scholar.
🧹 Nitpick comments (15)
web/src/lib/api.ts (3)

13-26: postJson has no request timeout and silently swallows errors.

  1. No timeoutfetch without an AbortSignal can hang indefinitely. In a Next.js server context this blocks rendering.
  2. Silent catch {} — errors vanish with no logging, making production debugging difficult.

Consider adding a timeout via AbortSignal.timeout() and at minimum a console.warn in the catch block.

Suggested improvement
 async function postJson<T>(path: string, payload: Record<string, unknown>): Promise<T | null> {
     try {
         const res = await fetch(`${API_BASE_URL}${path}`, {
             method: "POST",
             headers: { "Content-Type": "application/json" },
             body: JSON.stringify(payload),
             cache: "no-store",
+            signal: AbortSignal.timeout(15_000),
         })
         if (!res.ok) return null
         return await res.json() as T
-    } catch {
+    } catch (err) {
+        console.warn(`[postJson] ${path} failed:`, err)
         return null
     }
 }

278-286: All publications receive the same set of tags from the global topic distribution.

Each paper in publications gets identical tags derived from the scholar's overall topic_distribution (line 285), rather than per-paper topics. This produces a homogeneous display — every paper row shows the same tag chips.

If the backend doesn't provide per-paper topics this is an acceptable approximation, but it's worth noting and perhaps limiting to the top 2 tags to reduce visual noise.


284-284: Hardcoded status: "analyzing" for all publications.

Every mapped publication gets "analyzing" status. The Paper type allows "pending" | "analyzing" | "Reproduced". If there's no backend signal for status, this is fine, but consider "pending" as a more accurate default since these papers haven't actually been analyzed.

alembic/versions/0003_paper_registry.py (2)

69-73: created_at and updated_at lack server_default.

Unlike other columns in this table, created_at and updated_at are nullable=False but have no server_default. If any insert bypasses the ORM (e.g., raw SQL, a different service, or a future migration backfill), it will fail with a NOT NULL constraint violation. Consider adding server_default=sa.func.now() for safety.

Suggested fix
-            sa.Column("created_at", sa.DateTime(timezone=True), nullable=False),
-            sa.Column("updated_at", sa.DateTime(timezone=True), nullable=False),
+            sa.Column("created_at", sa.DateTime(timezone=True), nullable=False, server_default=sa.func.now()),
+            sa.Column("updated_at", sa.DateTime(timezone=True), nullable=False, server_default=sa.func.now()),

22-50: Helper functions are duplicated across migrations 0003, 0004, and 0005.

_is_offline, _insp, _has_table, _get_indexes, and _create_index are copy-pasted identically in all three migration files. This is a common pattern in Alembic (since each migration should be self-contained and runnable independently), so it's acceptable — just noting it for awareness.

Regarding the Ruff BLE001 warning on line 25: the bare except Exception in _is_offline() is reasonable here as a defensive fallback for Alembic context detection edge cases.

alembic/versions/0004_paper_feedback_judge_links.py (1)

101-112: Downgrade silently swallows all exceptions — consider logging.

The try/except Exception: pass blocks on lines 103-110 will mask not just "doesn't exist" errors but also genuine failures like connection drops or permission issues. At minimum, log the exception so operators can diagnose downgrade issues.

Suggested fix
+import logging
+
+logger = logging.getLogger(__name__)
+
 def downgrade() -> None:
     with op.batch_alter_table("paper_feedback") as batch_op:
         try:
             batch_op.drop_index("ix_paper_feedback_paper_ref_id")
         except Exception:
-            pass
+            logger.warning("Could not drop index ix_paper_feedback_paper_ref_id, may not exist")
         try:
             batch_op.drop_column("paper_ref_id")
         except Exception:
-            pass
+            logger.warning("Could not drop column paper_ref_id, may not exist")
 
     op.drop_table("paper_judge_scores")
README.md (1)

89-97: Add a language specifier to the fenced code block.

This ASCII diagram block lacks a language identifier, which triggers a markdownlint warning (MD040). Use ```text to silence it.

src/paperbot/domain/paper_identity.py (1)

7-11: DOI regex may be too restrictive for some valid DOIs.

The character class [-._;()/:A-Z0-9]+ (with IGNORECASE) excludes characters that can appear in real DOIs such as <, >, [, ], and #. For example, DOIs from Wiley or Springer occasionally contain angle brackets or square brackets. This won't cause runtime errors (it'll return None for unmatched DOIs), but it may cause missed matches during paper deduplication.

Fine for now if the primary use case is arXiv-adjacent papers; worth revisiting if broader DOI coverage is needed.

tests/unit/test_paper_identity.py (1)

4-17: Consider adding edge-case coverage.

The happy-path cases are good but thin. Consider adding tests for:

  • None / empty-string inputs → should return None
  • Old-style arXiv IDs (hep-th/0101001)
  • dx.doi.org/ prefix
  • Input containing both arXiv-like and DOI-like substrings (verifying arxiv-first priority)
  • Malformed strings that should yield None

These are pure functions, so expanding coverage is low-effort and high-value for a normalization utility that underpins dedup across the entire paper registry.

As per coding guidelines, "If behavior changes, add or update tests".

src/paperbot/infrastructure/stores/models.py (1)

500-534: Silent except Exception: pass in JSON accessors.

Ruff flags S110/BLE001 here. While returning safe defaults from data accessors is a reasonable pattern, silently swallowing exceptions can mask data corruption (e.g., a metadata_json column containing garbled bytes). Consider logging at DEBUG level so corruption is visible during troubleshooting without affecting normal operation.

This is low-priority and applies equally to the existing AgentEventModel accessors.

src/paperbot/infrastructure/stores/paper_store.py (1)

182-203: upsert_many opens a new DB session per paper — consider batching.

Each call to upsert_paper opens and commits its own session. For large ingestion runs (the PR raises judge_max_items_per_query to 200), this means hundreds of individual transactions. A single-session batch would be significantly faster.

Not blocking, but worth a follow-up for larger-scale ingestion.

src/paperbot/application/workflows/dailypaper.py (1)

82-105: Duplicated datetime parsing logic — consider reusing _parse_datetime from paper_store.

Lines 92–101 manually parse an ISO timestamp with "Z" suffix handling and UTC fallback. This is essentially the same logic as paper_store._parse_datetime. Extracting it to a shared utility (e.g., in paper_identity.py or a new utils.py) would eliminate the duplication.

src/paperbot/presentation/cli/main.py (1)

321-331: Registry ingestion is appropriately guarded against failures.

The try/except blocks ensure the CLI doesn't crash if the DB is unavailable or misconfigured. The error is captured in the report dict for visibility.

One minor note: ingest_daily_report_to_registry (line 323) is called unconditionally — even when no DB is configured. This is fine since the try/except handles it, but consider logging a warning (not just embedding the error in the report dict) so CLI users are aware of the failure when not outputting --json.

src/paperbot/infrastructure/stores/research_store.py (1)

518-527: N+1 query: fetching judge scores one-by-one in a loop.

With limit=200, this issues up to 200 individual SELECT statements. Consider a single query with a window function or a subquery to fetch the latest judge score per paper in one round-trip.

♻️ Suggested approach using a subquery
-            latest_judge_by_paper: Dict[int, PaperJudgeScoreModel] = {}
-            for pid in paper_ids:
-                judge = session.execute(
-                    select(PaperJudgeScoreModel)
-                    .where(PaperJudgeScoreModel.paper_id == pid)
-                    .order_by(desc(PaperJudgeScoreModel.scored_at), desc(PaperJudgeScoreModel.id))
-                    .limit(1)
-                ).scalar_one_or_none()
-                if judge is not None:
-                    latest_judge_by_paper[pid] = judge
+            latest_judge_by_paper: Dict[int, PaperJudgeScoreModel] = {}
+            if paper_ids:
+                latest_id_sub = (
+                    select(
+                        PaperJudgeScoreModel.paper_id,
+                        func.max(PaperJudgeScoreModel.id).label("max_id"),
+                    )
+                    .where(PaperJudgeScoreModel.paper_id.in_(paper_ids))
+                    .group_by(PaperJudgeScoreModel.paper_id)
+                    .subquery()
+                )
+                judges = (
+                    session.execute(
+                        select(PaperJudgeScoreModel).join(
+                            latest_id_sub,
+                            PaperJudgeScoreModel.id == latest_id_sub.c.max_id,
+                        )
+                    )
+                    .scalars()
+                    .all()
+                )
+                for j in judges:
+                    latest_judge_by_paper[int(j.paper_id)] = j

Note: Using max(id) is a simplification that assumes higher IDs are more recent. If scored_at ordering is strictly required and doesn't correlate with id, a more complex window function (ROW_NUMBER() OVER (PARTITION BY paper_id ORDER BY scored_at DESC, id DESC)) would be needed. Adjust based on your data guarantees.

src/paperbot/api/routes/research.py (1)

909-929: Extract shared Semantic Scholar client lifecycle and payload construction.

Both scholar_network and scholar_trends duplicate:

  1. API-key lookup + client creation + try/finally close() (lines 916-929 vs 1068-1089)
  2. scholar_payload dict construction (lines 1039-1046 vs 1161-1168)

A small context-manager helper (e.g. _semantic_scholar_client()) and a _build_scholar_payload(author, ...) function would remove this duplication and make future changes (e.g. adding rate-limit headers or caching the client) single-point.

♻️ Proposed helper for client lifecycle
from contextlib import asynccontextmanager

`@asynccontextmanager`
async def _semantic_scholar_client():
    api_key = os.getenv("SEMANTIC_SCHOLAR_API_KEY") or os.getenv("S2_API_KEY")
    client = SemanticScholarClient(api_key=api_key)
    try:
        yield client
    finally:
        await client.close()


def _build_scholar_payload(
    scholar_id: str,
    author: Optional[Dict[str, Any]],
    resolved_name: Optional[str],
    fallback_name: Optional[str],
) -> Dict[str, Any]:
    return {
        "scholar_id": str(scholar_id),
        "name": (author or {}).get("name") or resolved_name or fallback_name or str(scholar_id),
        "affiliations": (author or {}).get("affiliations") or [],
        "paper_count": _safe_int((author or {}).get("paperCount"), 0),
        "citation_count": _safe_int((author or {}).get("citationCount"), 0),
        "h_index": _safe_int((author or {}).get("hIndex"), 0),
    }

Usage in endpoints:

async def scholar_network(req: ScholarNetworkRequest):
    scholar_id, resolved_name = _resolve_scholar_identity(...)
    async with _semantic_scholar_client() as client:
        author = await client.get_author(scholar_id, fields=[...])
        paper_rows = await client.get_author_papers(scholar_id, ...)
    # ... rest of logic
    return ScholarNetworkResponse(
        scholar=_build_scholar_payload(scholar_id, author, resolved_name, req.scholar_name),
        ...
    )

Also applies to: 1061-1089

Comment on lines +918 to +931
try:
author = await client.get_author(
scholar_id,
fields=["name", "affiliations", "paperCount", "citationCount", "hIndex"],
)
paper_rows = await client.get_author_papers(
scholar_id,
limit=max(1, int(req.max_papers)),
fields=["title", "year", "citationCount", "authors", "url", "publicationDate"],
)
finally:
await client.close()

target_name = (author or {}).get("name") or resolved_name or req.scholar_name or scholar_id
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing author should surface as an error, not a silently degraded response.

If client.get_author(scholar_id) returns None (unknown author on Semantic Scholar), the endpoint silently falls through with zeroed-out stats and an empty graph. The caller has no way to distinguish "valid scholar with no collaborators" from "scholar_id doesn't exist." Consider returning a 404 when author is None.

The same applies to scholar_trends at line 1071.

Proposed guard
     finally:
         await client.close()
 
+    if author is None:
+        raise HTTPException(status_code=404, detail=f"Author {scholar_id} not found on Semantic Scholar")
+
     target_name = (author or {}).get("name") or resolved_name or req.scholar_name or scholar_id
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
author = await client.get_author(
scholar_id,
fields=["name", "affiliations", "paperCount", "citationCount", "hIndex"],
)
paper_rows = await client.get_author_papers(
scholar_id,
limit=max(1, int(req.max_papers)),
fields=["title", "year", "citationCount", "authors", "url", "publicationDate"],
)
finally:
await client.close()
target_name = (author or {}).get("name") or resolved_name or req.scholar_name or scholar_id
try:
author = await client.get_author(
scholar_id,
fields=["name", "affiliations", "paperCount", "citationCount", "hIndex"],
)
paper_rows = await client.get_author_papers(
scholar_id,
limit=max(1, int(req.max_papers)),
fields=["title", "year", "citationCount", "authors", "url", "publicationDate"],
)
finally:
await client.close()
if author is None:
raise HTTPException(status_code=404, detail=f"Author {scholar_id} not found on Semantic Scholar")
target_name = (author or {}).get("name") or resolved_name or req.scholar_name or scholar_id
🤖 Prompt for AI Agents
In `@src/paperbot/api/routes/research.py` around lines 918 - 931, After calling
client.get_author(...) in the handler (the block that also calls
client.get_author_papers(...)), add a guard that if author is None you close the
client and return/raise a 404 error (e.g., HTTPException(status_code=404,
detail="author not found")) instead of proceeding to build zeroed stats and an
empty graph; apply the same check in the scholar_trends handler (the code around
the existing scholar_trends function) to ensure non-existent scholar_ids surface
as 404s rather than silent degraded responses.

Comment on lines +58 to +79
def _iter_report_papers(report: Dict[str, Any]) -> List[Dict[str, Any]]:
rows: List[Dict[str, Any]] = []
for query in report.get("queries") or []:
for item in query.get("top_items") or []:
row = dict(item)
row.setdefault("source", (report.get("sources") or [report.get("source")])[0])
rows.append(row)

for item in report.get("global_top") or []:
row = dict(item)
row.setdefault("source", (report.get("sources") or [report.get("source")])[0])
rows.append(row)

deduped: List[Dict[str, Any]] = []
seen: set[str] = set()
for item in rows:
key = f"{item.get('url') or ''}|{item.get('title') or ''}"
if key in seen:
continue
seen.add(key)
deduped.append(item)
return deduped
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dedup key collision when both url and title are empty.

If multiple report items lack both url and title, they all produce the dedup key "|" and only the first survives. This silently drops distinct papers with incomplete metadata. Consider including an additional field (e.g., paper_id or arxiv_id) in the key, or skipping dedup for items that lack both fields.

🤖 Prompt for AI Agents
In `@src/paperbot/application/workflows/dailypaper.py` around lines 58 - 79, The
dedup key in _iter_report_papers currently uses only url and title so items with
both empty collapse to the same key ("|"); change the key-generation logic where
key = f"{item.get('url') or ''}|{item.get('title') or ''}" to incorporate an
additional unique fallback (e.g., item.get('paper_id') or item.get('arxiv_id')
or the report source plus the item's index) or skip deduping when both url and
title are missing; update the seen set usage accordingly so distinct items with
empty url/title are not silently dropped.

Comment on lines +134 to +140
if row is None and title:
row = session.execute(
select(PaperModel).where(
func.lower(PaperModel.title) == title.lower(),
PaperModel.url == url,
)
).scalar_one_or_none()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

scalar_one_or_none() may raise MultipleResultsFound on the title+url fallback.

When url is empty ("") and title is non-empty, the query matches all rows with an empty URL and the same case-insensitive title. If duplicates exist, SQLAlchemy raises MultipleResultsFound. Consider using .first() instead, or adding a guard if row is None and title and url:.

Proposed fix
-            if row is None and title:
+            if row is None and title and url:
                 row = session.execute(
                     select(PaperModel).where(
                         func.lower(PaperModel.title) == title.lower(),

Alternatively, switch to .scalars().first() for a defensive fallback.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if row is None and title:
row = session.execute(
select(PaperModel).where(
func.lower(PaperModel.title) == title.lower(),
PaperModel.url == url,
)
).scalar_one_or_none()
if row is None and title and url:
row = session.execute(
select(PaperModel).where(
func.lower(PaperModel.title) == title.lower(),
PaperModel.url == url,
)
).scalar_one_or_none()
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 134 - 140,
The fallback call to session.execute(...).scalar_one_or_none() can raise
MultipleResultsFound when url is empty; update the condition and/or query to
avoid that: either tighten the guard to if row is None and title and url: to
skip the fallback when url is falsy, or replace scalar_one_or_none() with a
defensive retrieval such as .scalars().first() (or .first()) on the executed
select to return a single row when duplicates exist; adjust the code around
PaperModel/title/url, session.execute and the
select(func.lower(PaperModel.title) == title.lower(), PaperModel.url == url) to
implement one of these fixes.

Comment on lines +170 to +172
row.set_authors(authors)
row.set_keywords(keywords)
row.set_metadata(metadata)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unconditional overwrite of authors, keywords, and metadata may cause data loss on upsert.

When an incoming paper dict omits authors (or provides []), _safe_list returns [] and row.set_authors([]) wipes any previously stored authors. This differs from the merge-with-existing pattern used for scalar fields on lines 162–168 (title or row.title or "").

If a paper is first ingested from a rich source (with authors) and later re-ingested from a sparser source (without authors), the existing data is silently erased.

🛡️ Proposed fix: preserve existing values when new data is empty
-        row.set_authors(authors)
-        row.set_keywords(keywords)
-        row.set_metadata(metadata)
+        if authors:
+            row.set_authors(authors)
+        if keywords:
+            row.set_keywords(keywords)
+        if metadata and any(v for v in metadata.values() if v is not None):
+            existing_meta = row.get_metadata()
+            existing_meta.update({k: v for k, v in metadata.items() if v is not None})
+            row.set_metadata(existing_meta)
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 170 - 172,
The current upsert unconditionally calls row.set_authors(_safe_list(authors)),
row.set_keywords(_safe_list(keywords)), and row.set_metadata(metadata) which
erases existing data when incoming values are empty; change this to preserve
existing values by only setting when the incoming value is non-empty (e.g.,
authors = _safe_list(authors); if authors: row.set_authors(authors)) and do the
same for keywords and metadata (mirror the merge pattern used for title: `title
or row.title or ""`) so sparse re-ingests don't wipe richer stored data.

Comment on lines +271 to +275
row.judge_cost_tier = (
int(judge.get("judge_cost_tier"))
if judge.get("judge_cost_tier") is not None
else None
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

int() conversion of judge_cost_tier may raise ValueError for non-numeric values.

If the JSON contains a non-int-coercible value (e.g., "high"), this will crash. Use a pattern consistent with the other safe parsers in this file.

Proposed fix
                     row.judge_cost_tier = (
-                        int(judge.get("judge_cost_tier"))
-                        if judge.get("judge_cost_tier") is not None
-                        else None
+                        int(judge["judge_cost_tier"])
+                        if isinstance(judge.get("judge_cost_tier"), (int, float))
+                        else None
                     )
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/paper_store.py` around lines 271 - 275,
The code assigns row.judge_cost_tier using int(judge.get("judge_cost_tier"))
which will raise ValueError for non-numeric strings; change this to follow the
file's safe-parsing pattern by attempting to coerce to int inside a try/except
(or call the existing helper used elsewhere) and set row.judge_cost_tier to None
on failure or when the value is None. Specifically, replace the direct int(...)
with a safe parse that checks judge.get("judge_cost_tier") for truthiness, tries
int(value) in a try/except ValueError/TypeError, and assigns None on exception;
reference the same parsing helper or approach used for other fields in this
module to keep behavior consistent with the rest of the file.

Comment on lines +1095 to +1114
if url_candidates:
row = session.execute(
select(PaperModel).where(
or_(
PaperModel.url.in_(url_candidates),
PaperModel.external_url.in_(url_candidates),
PaperModel.pdf_url.in_(url_candidates),
)
)
).scalar_one_or_none()
if row is not None:
return int(row.id)

title = str(metadata.get("title") or "").strip()
if title:
row = session.execute(
select(PaperModel).where(func.lower(PaperModel.title) == title.lower())
).scalar_one_or_none()
if row is not None:
return int(row.id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

scalar_one_or_none() can raise MultipleResultsFound for URL and title lookups.

If multiple PaperModel rows share the same URL or title (plausible for academic papers), these queries will throw an unhandled exception. Use .limit(1).scalar_one_or_none() or .first() instead.

🐛 Proposed fix
         if url_candidates:
             row = session.execute(
                 select(PaperModel).where(
                     or_(
                         PaperModel.url.in_(url_candidates),
                         PaperModel.external_url.in_(url_candidates),
                         PaperModel.pdf_url.in_(url_candidates),
                     )
                 )
-            ).scalar_one_or_none()
+                .limit(1)
+            ).scalar_one_or_none()
             if row is not None:
                 return int(row.id)

         title = str(metadata.get("title") or "").strip()
         if title:
             row = session.execute(
-                select(PaperModel).where(func.lower(PaperModel.title) == title.lower())
-            ).scalar_one_or_none()
+                select(PaperModel).where(func.lower(PaperModel.title) == title.lower()).limit(1)
+            ).scalar_one_or_none()
             if row is not None:
                 return int(row.id)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if url_candidates:
row = session.execute(
select(PaperModel).where(
or_(
PaperModel.url.in_(url_candidates),
PaperModel.external_url.in_(url_candidates),
PaperModel.pdf_url.in_(url_candidates),
)
)
).scalar_one_or_none()
if row is not None:
return int(row.id)
title = str(metadata.get("title") or "").strip()
if title:
row = session.execute(
select(PaperModel).where(func.lower(PaperModel.title) == title.lower())
).scalar_one_or_none()
if row is not None:
return int(row.id)
if url_candidates:
row = session.execute(
select(PaperModel).where(
or_(
PaperModel.url.in_(url_candidates),
PaperModel.external_url.in_(url_candidates),
PaperModel.pdf_url.in_(url_candidates),
)
)
.limit(1)
).scalar_one_or_none()
if row is not None:
return int(row.id)
title = str(metadata.get("title") or "").strip()
if title:
row = session.execute(
select(PaperModel).where(func.lower(PaperModel.title) == title.lower()).limit(1)
).scalar_one_or_none()
if row is not None:
return int(row.id)
🤖 Prompt for AI Agents
In `@src/paperbot/infrastructure/stores/research_store.py` around lines 1095 -
1114, Queries that use scalar_one_or_none() for URL and title lookups (the
select(...).where(...) blocks that search PaperModel by url_candidates and by
func.lower(PaperModel.title)) can raise MultipleResultsFound when duplicates
exist; change these to return a single row safely by using
.limit(1).scalar_one_or_none() or replace with .first() on the query result so
the URL/title lookup returns at most one PaperModel row (check both the
url_candidates block and the title lookup block).

Comment on lines 210 to 331
export async function fetchScholarDetails(id: string): Promise<ScholarDetails> {
const papers = await fetchPapers()
const scholarName = slugToName(id)

type ScholarNetworkResponse = {
scholar?: { name?: string; affiliations?: string[]; citation_count?: number; paper_count?: number; h_index?: number }
stats?: { papers_used?: number }
nodes?: Array<{ name?: string; type?: string; collab_papers?: number }>
}

type ScholarTrendsResponse = {
scholar?: { name?: string; affiliations?: string[]; citation_count?: number; paper_count?: number; h_index?: number }
trend_summary?: { publication_trend?: "up" | "down" | "flat"; citation_trend?: "up" | "down" | "flat" }
topic_distribution?: Array<{ topic?: string; count?: number }>
recent_papers?: Array<{ title?: string; year?: number; citation_count?: number; venue?: string; url?: string }>
}

const [network, trends] = await Promise.all([
postJson<ScholarNetworkResponse>("/research/scholar/network", {
scholar_name: scholarName,
max_papers: 120,
recent_years: 5,
max_nodes: 30,
}),
postJson<ScholarTrendsResponse>("/research/scholar/trends", {
scholar_name: scholarName,
max_papers: 200,
year_window: 10,
}),
])

// Fallback to mock data if the scholar is not configured in subscriptions yet.
if (!network && !trends) {
const papers = await fetchPapers()
return {
id,
name: scholarName,
affiliation: "University of California, Berkeley",
h_index: 120,
papers_tracked: 45,
recent_activity: "Published 2 days ago",
status: "active",
bio: "Dawn Song is a Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley. Her research interest lies in deep learning, security, and blockchain.",
location: "Berkeley, CA",
website: "https://dawnsong.io",
expertise_radar: [
{ subject: "Security", A: 100, fullMark: 100 },
{ subject: "Deep Learning", A: 90, fullMark: 100 },
{ subject: "Blockchain", A: 80, fullMark: 100 },
{ subject: "Systems", A: 85, fullMark: 100 },
{ subject: "Privacy", A: 95, fullMark: 100 },
],
publications: papers,
co_authors: [
{ name: "Dan Hendrycks", avatar: "https://avatar.vercel.sh/dan.png" },
{ name: "Kevin Eykholt", avatar: "https://avatar.vercel.sh/kevin.png" },
],
stats: {
total_citations: 54321,
papers_count: 230,
h_index: 120,
},
}
}

const scholar = network?.scholar || trends?.scholar || {}
const topicDist = (trends?.topic_distribution || []).slice(0, 5)
const maxTopicCount = Math.max(1, ...topicDist.map((t) => Number(t.count || 0)))

const publications: Paper[] = (trends?.recent_papers || []).slice(0, 15).map((paper, idx) => ({
id: `sch-${id}-paper-${idx}`,
title: String(paper.title || "Untitled"),
venue: String(paper.venue || "Unknown venue"),
authors: String(scholar.name || scholarName),
citations: Number(paper.citation_count || 0),
status: "analyzing",
tags: topicDist.map((t) => String(t.topic || "")).filter(Boolean).slice(0, 3),
}))

const coauthors = (network?.nodes || [])
.filter((n) => n.type === "coauthor")
.slice(0, 12)
.map((n) => {
const name = String(n.name || "Unknown")
const collab = Number(n.collab_papers || 0)
return {
name: collab > 0 ? `${name} (${collab})` : name,
avatar: `https://avatar.vercel.sh/${encodeURIComponent(name)}.png`,
}
})

const publicationTrend = trends?.trend_summary?.publication_trend || "flat"
const recentActivity = publicationTrend === "up"
? "Publication trend up"
: publicationTrend === "down"
? "Publication trend down"
: "Publication trend stable"

return {
id,
name: id.split('-').map(w => w.charAt(0).toUpperCase() + w.slice(1)).join(' '),
affiliation: "University of California, Berkeley",
h_index: 120,
papers_tracked: 45,
recent_activity: "Published 2 days ago",
status: "active",
bio: "Dawn Song is a Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley. Her research interest lies in deep learning, security, and blockchain.",
location: "Berkeley, CA",
website: "https://dawnsong.io",
expertise_radar: [
{ subject: 'Security', A: 100, fullMark: 100 },
{ subject: 'Deep Learning', A: 90, fullMark: 100 },
{ subject: 'Blockchain', A: 80, fullMark: 100 },
{ subject: 'Systems', A: 85, fullMark: 100 },
{ subject: 'Privacy', A: 95, fullMark: 100 },
],
publications: papers,
co_authors: [
{ name: "Dan Hendrycks", avatar: "https://avatar.vercel.sh/dan.png" },
{ name: "Kevin Eykholt", avatar: "https://avatar.vercel.sh/kevin.png" }
],
name: String(scholar.name || scholarName),
affiliation: String((scholar.affiliations || ["Unknown affiliation"])[0] || "Unknown affiliation"),
h_index: Number(scholar.h_index || 0),
papers_tracked: Number(scholar.paper_count || 0),
recent_activity: recentActivity,
status: publicationTrend === "up" ? "active" : "idle",
bio: `Trend snapshot: ${trends?.trend_summary?.citation_trend || "flat"} citation trend over the recent analysis window.`,
location: "N/A",
website: "",
expertise_radar: topicDist.map((t) => ({
subject: String(t.topic || "Topic"),
A: Math.round((Number(t.count || 0) / maxTopicCount) * 100),
fullMark: 100,
})),
publications,
co_authors: coauthors,
stats: {
total_citations: 54321,
papers_count: 230,
h_index: 120
}
total_citations: Number(scholar.citation_count || 0),
papers_count: Number(scholar.paper_count || 0),
h_index: Number(scholar.h_index || 0),
},
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Behavior change in fetchScholarDetails lacks test coverage.

fetchScholarDetails was substantially rewritten — from returning static mock data to making live API calls with a fallback path. As per coding guidelines, if behavior changes, tests should be added or updated. Consider adding tests that cover:

  • Successful network + trends response mapping
  • Partial responses (only one of network/trends available)
  • Both null → mock fallback path

As per coding guidelines, {src,web/src}/**/*.{py,ts,tsx}: "If behavior changes, add or update tests".

🤖 Prompt for AI Agents
In `@web/src/lib/api.ts` around lines 210 - 331, fetchScholarDetails was changed
to call postJson/fetchPapers and implement mapping logic but lacks tests; add
unit tests for fetchScholarDetails that (1) stub postJson to return both network
and trends and assert returned ScholarDetails fields (name, affiliation, stats,
expertise_radar, publications length and tags), (2) stub postJson to return only
network or only trends and assert partial-mapping behavior (co_authors from
network.nodes or publications from trends.recent_papers), and (3) stub postJson
to return null/undefined for both and stub fetchPapers to return mock papers and
assert the fallback mock data is returned. In tests, mock or spy on postJson and
fetchPapers, exercise fetchScholarDetails(id), and assert key outputs (id, name,
h_index, papers_tracked, co_authors, publications, stats) to cover
ScholarNetworkResponse/ScholarTrendsResponse mapping and the fallback path.

Comment on lines +240 to +272
// Fallback to mock data if the scholar is not configured in subscriptions yet.
if (!network && !trends) {
const papers = await fetchPapers()
return {
id,
name: scholarName,
affiliation: "University of California, Berkeley",
h_index: 120,
papers_tracked: 45,
recent_activity: "Published 2 days ago",
status: "active",
bio: "Dawn Song is a Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley. Her research interest lies in deep learning, security, and blockchain.",
location: "Berkeley, CA",
website: "https://dawnsong.io",
expertise_radar: [
{ subject: "Security", A: 100, fullMark: 100 },
{ subject: "Deep Learning", A: 90, fullMark: 100 },
{ subject: "Blockchain", A: 80, fullMark: 100 },
{ subject: "Systems", A: 85, fullMark: 100 },
{ subject: "Privacy", A: 95, fullMark: 100 },
],
publications: papers,
co_authors: [
{ name: "Dan Hendrycks", avatar: "https://avatar.vercel.sh/dan.png" },
{ name: "Kevin Eykholt", avatar: "https://avatar.vercel.sh/kevin.png" },
],
stats: {
total_citations: 54321,
papers_count: 230,
h_index: 120,
},
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mock fallback is hardcoded to a single scholar ("Dawn Song").

When both API calls return null, the fallback bio, affiliation, location, and co-authors are all Dawn Song–specific, regardless of which scholar id was requested. This will display incorrect information for any other scholar.

At minimum, make the bio generic or interpolate scholarName:

Suggested fix for the bio and co-authors
-            bio: "Dawn Song is a Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley. Her research interest lies in deep learning, security, and blockchain.",
+            bio: `No detailed profile available for ${scholarName}. Data will appear once the scholar is configured in subscriptions.`,
-            location: "Berkeley, CA",
-            website: "https://dawnsong.io",
+            location: "N/A",
+            website: "",
🤖 Prompt for AI Agents
In `@web/src/lib/api.ts` around lines 240 - 272, The fallback return currently
contains Dawn Song–specific hardcoded fields; update the conditional branch that
runs when (!network && !trends) to produce generic/interpolated data: use
scholarName (and id) in the name and bio (e.g., `bio: `${scholarName} is a
researcher...``), replace hardcoded affiliation/location/website with generic or
null values, and replace specific co_authors with generic placeholders or an
empty array; keep publications = await fetchPapers() and stats but remove Dawn
Song–specific numbers or derive them from fetched papers if possible. Ensure you
modify the returned object in that block (references: scholarName, id,
fetchPapers, publications, co_authors) so the fallback is correct for any
requested scholar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: DailyPaper SSE streaming, LLM filter, email push, persisted config

1 participant