Skip to content

Add first-class Reddit support with canonicalization, adapter, and backfill#8

Merged
gdiab merged 5 commits intomainfrom
codex/reddit-first-class
Feb 24, 2026
Merged

Add first-class Reddit support with canonicalization, adapter, and backfill#8
gdiab merged 5 commits intomainfrom
codex/reddit-first-class

Conversation

@gdiab
Copy link
Owner

@gdiab gdiab commented Feb 24, 2026

Summary

  • add reddit as a first-class source_type
  • normalize Reddit URLs (redd.it, old/new/m.reddit.com) to https://www.reddit.com/comments/post-id
  • add RedditAdapter that ingests post metadata/body and top comments from Reddit listing JSON
  • wire Reddit into worker adapter chain with article fallback
  • add legacy backfill service + script (npm run backfill:reddit with optional --dry-run) for reclassifying old article Reddit rows
  • update CLI/docs/test plans for --type reddit and Reddit manual validation

Tests

  • npm run typecheck
  • npm test
  • npm run build

Manual validation

  • backfill dry run executed on real DB (no legacy rows found)
  • manual Reddit validation completed with real links (reported good)

gdiab and others added 2 commits February 24, 2026 12:08
Fixes OIDC token error by creating a dedicated workflow with
id-token: write permission required by claude-code-action.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gdiab gdiab force-pushed the codex/reddit-first-class branch from d119c38 to 0b26794 Compare February 24, 2026 20:09
@gdiab
Copy link
Owner Author

gdiab commented Feb 24, 2026

@claude can you review?

@gdiab
Copy link
Owner Author

gdiab commented Feb 24, 2026

Follow-up on Claude workflow behavior:

  • We aligned .github/workflows/claude-review.yml to match main exactly so OIDC workflow validation passes.
  • This workflow triggers on pull_request opened/ready_for_review and on @claude issue comments.
  • It does not trigger on synchronize (new pushes), so after the fix push you will not see an automatic Claude run on this PR unless someone comments @claude.
  • New PRs should get Claude review automatically on open, and this PR can still be reviewed now via an @claude comment.

@gdiab
Copy link
Owner Author

gdiab commented Feb 24, 2026

@claude please review this PR

@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @gdiab's task in 3m 11s —— View job


PR Review: Add first-class Reddit support

Tasks:

  • Gather context
  • Read changed files
  • Analyze implementation
  • Post review feedback

Overall this is a well-structured addition. The URL normalization is thorough, the adapter follows existing conventions, and the test coverage is solid. A few issues worth addressing:


Issues

1. SQL Parameterization Convention — reddit-backfill-service.ts:120

The UPDATE statement uses positional ? parameters, but the project convention (per CLAUDE.md and all existing repositories) is named @param bindings:

// Current — positional (inconsistent)
.run(canonical, now, row.id);

// Expected — named (matches project convention)
.prepare(`UPDATE items SET source_type = 'reddit', canonical_url = @canonical, updated_at = @updatedAt WHERE id = @id`)
.run({ canonical, updatedAt: now, id: row.id });

This is a correctness/maintainability concern: positional parameters are fragile when the SQL statement changes and are inconsistent with how every other write in this codebase is done. Fix this →

2. Backfill Updates Not Wrapped in a Single Transaction — reddit-backfill-service.ts:112-123

The loop issues individual UPDATE calls outside any transaction. If the process is interrupted mid-run, some rows will be updated with source_type = 'reddit' but none of the syncItem calls will have happened (those run after the loop finishes at line 126-128). This leaves the search index inconsistent with the items table.

Wrapping all updates and syncs in a single db.transaction() would make the backfill atomic:

const tx = this.context.db.transaction(() => {
  for (const itemId of updatedItemIds) {
    // apply updates here
    this.indexService.syncItem(itemId);
  }
});
tx();

Since this is a one-time migration script the risk is low, but it's worth fixing for correctness. Fix this →

3. isRedditPageUrl Duplicates isRedditHostreddit-adapter.ts:62-69

There is already an exported isRedditHost in src/lib/reddit.ts. The local isRedditPageUrl in the adapter reimplements it:

// reddit-adapter.ts (lines 62-69) — wraps new URL() + isRedditHost
const isRedditPageUrl = (value: string): boolean => {
  try {
    const parsed = new URL(value);
    return isRedditHost(parsed.hostname.toLowerCase());
  } catch {
    return false;
  }
};

This is minor but could be replaced with a direct call to isRedditHost after parsing, or the helper can be kept in src/lib/reddit.ts and exported. Fix this →


Minor / Non-blocking Observations

Adapter test coverage gap: reddit-adapter.test.ts covers successful parse and 5xx retryable failure, but has no test for a 404 response (which should be non-retryable per the response.status >= 500 || response.status === 429 check at line 96-97). Adding a 404 case would close that gap.

MAX_COMMENTS = 5 vs limit=8 in API request (reddit-adapter.ts:87): The API requests 8 comments but the loop breaks at 5 (MAX_COMMENTS). This works correctly (3 extras are silently dropped) but the intent isn't obvious. A comment explaining the buffer or aligning the values would help.


What's Good

  • URL normalization is thorough: redd.it short links, old/new/m.reddit.com variants, gallery URLs, and noise params (context, depth, sort, share_id) are all handled correctly and the canonicalization tests are well-specified.
  • Correct retryability classification: 5xx and 429 are retryable; 4xx are not.
  • Article fallback in adapter chain is consistent with other adapters (x, youtube, bluesky, linkedin all follow the same pattern).
  • Conflict detection in backfill via plannedCanonicalOwners correctly handles both in-batch and pre-existing DB conflicts without leaving orphaned rows.
  • Tests restore global state in finally blocks throughout — consistent with project convention.
  • Non-retryable PARSE_FAILED when no post ID can be extracted is the right behavior for non-post Reddit URLs (subreddits, profiles).
  • Dry-run mode in the backfill script is a nice safety net and is tested.

@gdiab
Copy link
Owner Author

gdiab commented Feb 24, 2026

@claude changes made, re-review?

@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @gdiab's task in 1m 44s —— View job


Re-Review: Add first-class Reddit support

Tasks:

  • Gather context
  • Read changed files to verify fixes
  • Analyze implementation against previous review findings
  • Post updated review feedback

Two of the three issues from the previous review are resolved. One minor issue remains, along with the two minor observations that were not addressed.


Fixed

Issue #1 — Named SQL parameters (reddit-backfill-service.ts:116-118): ✅ Fixed.
The UPDATE statement now correctly uses @canonical, @updatedAt, @id with an object passed to .run(). This is consistent with the rest of the codebase.

Issue #2 — Transaction atomicity (reddit-backfill-service.ts:112-131): ✅ Fixed.
All updates and syncItem calls are now wrapped in a single db.transaction(). The syncItem reads updated item data from within the transaction (since items was already updated), so the FTS index entries will reflect the new canonical_url and source_type values — correct behavior.


Still Present

Issue #3isRedditPageUrl duplicates isRedditHost (reddit-adapter.ts:62-69): minor, not fixed.

const isRedditPageUrl = (value: string): boolean => {
  try {
    const parsed = new URL(value);
    return isRedditHost(parsed.hostname.toLowerCase());
  } catch {
    return false;
  }
};

This is still a local wrapper around isRedditHost from src/lib/reddit.ts. It could be inlined at the usage site (line 118) or the helper could be exported from reddit.ts. Not blocking, but the duplication is unnecessary. Fix this →


Minor Observations (not addressed, still non-blocking)

404 not testedreddit-adapter.test.ts has 2 tests (success + 5xx retryable). A 404 case would confirm the non-retryable path (response.status >= 500 || response.status === 429 evaluates to false for 404, so retryable: false). Low risk to skip since the logic is straightforward.

MAX_COMMENTS = 5 vs limit=8 in API request (reddit-adapter.ts:87) — Still unannotated. A brief comment explaining the intent (requesting extra to account for deleted/removed comments that get skipped by the kind !== 't1' filter) would make this less surprising to future readers.


Overall

The two substantive issues are fixed. The remaining items are all minor. This is ready to merge if you're comfortable leaving the isRedditPageUrl duplication in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant