-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Problem Statement
The code-review agent test document () indicates all test cases are untested and marked as "To test (dogfooding)". This means the agent created in issue #38 has not been validated through real-world usage.
Current Status
All test cases (TC-1 through TC-7) have unchecked validation boxes:
- TC-1: Agent Configuration - NOT verified
- TC-2: Agent Invocation - NOT verified
- TC-3: Isolated Context Execution - NOT verified
- TC-4: Review Execution - NOT verified
- TC-5: Error Handling - NOT verified
- TC-6: Long Context Handling - NOT verified
- TC-7: Comparison with Command - NOT verified
Dogfooding Validation Fields (All TBD)
- First Use Date: TBD
- PR Tested: TBD
- Agent initialization: TBD
- Review execution: TBD
- Report quality: TBD
- Context handling: TBD
- Issues Found: TBD
- Validation Notes: TBD
Proposed Solution
Goal
Complete dogfooding validation of the agent by running actual code reviews on real PRs and documenting the findings.
Success Criteria
- TC-1: Agent YAML frontmatter correctly configured (name, description, tools, model, skills)
- TC-2: Agent successfully invoked via Task tool
- TC-3: Agent operates in isolated context (no parent conversation access)
- TC-4: Agent performs complete code review (all 3 phases of review-standard skill)
- TC-5: Agent provides helpful error messages for edge cases
- TC-6: Agent handles large diffs (>10 files, >500 lines) within timeout
- TC-7: Agent review quality comparable to command review
- All validation fields in docs/test/code-review-agent.md populated with actual findings
Implementation Steps
Step 1: Select PR for Testing (Estimated: 5 min)
- Choose a recent merged PR that has meaningful code changes
- Example candidates: PR [refactor][#488] Fix permission evaluation ordering with clear rule priority #489 (permission ordering), PR [plan][refactor]: Fix permission evaluation ordering with clear rule priority and unified Telegram escalation #488 (plan for permission fix), or similar
Step 2: Run Agent Review (Estimated: 15 min)
- Invoke agent via Task tool
- Provide the merged PR as context
- Capture full review output
Step 3: Validate TC-1 (Configuration) (Estimated: 5 min)
- Verify agent file
- Check YAML frontmatter: name, description, tools, model, skills
Step 4: Validate TC-2 (Invocation) (Estimated: 5 min)
- Confirm agent starts successfully
- Verify Opus model is used
- Confirm review-standard skill is loaded
Step 5: Validate TC-3 (Isolated Context) (Estimated: 5 min)
- Verify agent workspace is clean
- Confirm no parent conversation history leak
- Agent returns only final report
Step 6: Validate TC-4 (Review Execution) (Estimated: 10 min)
- Confirm branch validation (not main)
- Check changed files retrieved
- Verify full diff obtained
- Confirm all 3 review phases executed
- Review structured report quality
Step 7: Validate TC-5 (Error Handling) (Estimated: 10 min)
- Test main branch detection
- Test no-changes detection
- Test non-git-repo detection
- Verify error messages are clear
Step 8: Validate TC-6 (Long Context) (Estimated: 15 min)
- Use large PR (>10 files or >500 lines)
- Verify completion within 600s timeout
- Assess thoroughness of analysis
Step 9: Validate TC-7 (Comparison) (Estimated: 15 min)
- Run command-based review on same PR
- Compare Phase 1, 2, 3 findings
- Assess agent's context handling advantage
Step 10: Document Findings (Estimated: 15 min)
- Update docs/test/code-review-agent.md with:
- First Use Date (actual date)
- PR Tested (actual PR number)
- Agent initialization findings
- Review execution findings
- Report quality assessment
- Context handling assessment
- Actual issues found (if any)
- Validation notes
Test Execution Commands
./tests/test-all.sh
Running all Agentize SDK tests
Shell: bash
✓ test-agentize-cli-invalid-agentize-home
✓ test-agentize-cli-missing-agentize-home
✓ test-agentize-server-feat-request
✓ test-agentize-server-filtering
✗ test-agentize-server-module-exports FAILED
✓ test-agentize-server-pr-discovery
✓ test-agentize-server-refinement
✓ test-agentize-server-session-lookup
✓ test-agentize-server-telegram-notify
✗ test-agentize-server-workers FAILED
✓ test-handsoff-session-path
✗ test-hook-permission-matching FAILED
✓ test-install-script
✓ test-lol-claude-clean
✓ test-lol-command-functions-loaded
✓ test-lol-complete-commands
✓ test-lol-complete-flags
✓ test-lol-help-text
✓ test-lol-python-cli
✓ test-lol-serve-graphql
✓ test-lol-serve
✓ test-lol-usage
✓ test-lol-version
✓ test-test-all-strict-shells
✓ test-wt-bare-repo-required
✓ test-wt-clone-basic
✓ test-wt-complete-commands
✓ test-wt-complete-flags
✓ test-wt-complete-goto-targets
✓ test-wt-goto
✓ test-wt-pathto
✓ test-wt-purge
✓ test-wt-rebase
✓ test-wt-spawn-claims-status
✓ test-wt-spawn-headless-nonblocking
✓ test-wt-zsh-completion-crash
✓ test-external-consensus-doc-planning
✓ test-lol-zsh-completion-file
✓ test-makefile-setup-zsh-completion
✓ test-plugin-manifest-valid
✓ test-wt-zsh-completion-file
✗ test-ccr-logs-permission FAILED
✓ test-external-consensus-issue-interface
✗ test-gh-credential FAILED
✓ test-lint-documentation
✓ test-lol-project-associate
✓ test-lol-project-auto-field
✓ test-lol-project-automation-write
✓ test-lol-project-automation
✓ test-lol-project-create-user
✓ test-lol-project-create
✓ test-lol-project-help
✓ test-lol-project-metadata-preservation
✓ test-lol-project-missing-metadata
✓ test-open-issue-draft-non-plan
✓ test-open-issue-update-maintains-format
✓ test-open-issue-update-mode
✓ test-open-issue-with-draft
✓ test-open-issue-without-draft
✗ test-sandbox-build FAILED
✗ test-sandbox-run-cmd-option FAILED
✓ test-sandbox-run
✗ test-sandbox-session-management FAILED
✗ test-sandbox-volume-permissions FAILED
✓ test-worktree-flag-order-after-issue
✓ test-worktree-reject-description-arg
✓ test-worktree-spawn-yolo-no-agent
✓ test-wt-cross-init-creates-main
✓ test-wt-cross-invalid-agentize-home
✓ test-wt-cross-missing-agentize-home
✓ test-wt-cross-spawn-from-linked
======================================
Test Summary for bash
Total: 71
Passed: 62
Failed: 9
Some tests failed in bash!
======================================
Running all Agentize SDK tests
Shell: zsh
✓ test-agentize-cli-invalid-agentize-home
✓ test-agentize-cli-missing-agentize-home
✓ test-agentize-server-feat-request
✓ test-agentize-server-filtering
✗ test-agentize-server-module-exports FAILED
✓ test-agentize-server-pr-discovery
✓ test-agentize-server-refinement
✓ test-agentize-server-session-lookup
✓ test-agentize-server-telegram-notify
✗ test-agentize-server-workers FAILED
✓ test-handsoff-session-path
✓ test-hook-permission-matching (skipped: hook tests only run in bash)
✓ test-install-script
✓ test-lol-claude-clean
✓ test-lol-command-functions-loaded
✓ test-lol-complete-commands
✓ test-lol-complete-flags
✓ test-lol-help-text
✓ test-lol-python-cli
✓ test-lol-serve-graphql
✓ test-lol-serve
✓ test-lol-usage
✓ test-lol-version
✓ test-test-all-strict-shells
✓ test-wt-bare-repo-required
✓ test-wt-clone-basic
✓ test-wt-complete-commands
✓ test-wt-complete-flags
✓ test-wt-complete-goto-targets
✓ test-wt-goto
✓ test-wt-pathto
✓ test-wt-purge
✓ test-wt-rebase
✓ test-wt-spawn-claims-status
✓ test-wt-spawn-headless-nonblocking
✓ test-wt-zsh-completion-crash
✓ test-external-consensus-doc-planning
✓ test-lol-zsh-completion-file
✓ test-makefile-setup-zsh-completion
✓ test-plugin-manifest-valid
✓ test-wt-zsh-completion-file
✗ test-ccr-logs-permission FAILED
✓ test-external-consensus-issue-interface
✗ test-gh-credential FAILED
✓ test-lint-documentation
✓ test-lol-project-associate
✓ test-lol-project-auto-field
✓ test-lol-project-automation-write
✓ test-lol-project-automation
✓ test-lol-project-create-user
✓ test-lol-project-create
✓ test-lol-project-help
✓ test-lol-project-metadata-preservation
✓ test-lol-project-missing-metadata
✓ test-open-issue-draft-non-plan
✓ test-open-issue-update-maintains-format
✓ test-open-issue-update-mode
✓ test-open-issue-with-draft
✓ test-open-issue-without-draft
✗ test-sandbox-build FAILED
✗ test-sandbox-run-cmd-option FAILED
✓ test-sandbox-run
✗ test-sandbox-session-management FAILED
✗ test-sandbox-volume-permissions FAILED
✓ test-worktree-flag-order-after-issue
✓ test-worktree-reject-description-arg
✓ test-worktree-spawn-yolo-no-agent
✓ test-wt-cross-init-creates-main
✓ test-wt-cross-invalid-agentize-home
✓ test-wt-cross-missing-agentize-home
✓ test-wt-cross-spawn-from-linked
======================================
Test Summary for zsh
Total: 71
Passed: 63
Failed: 8
Some tests failed in zsh!
Estimated Total Time: 80-100 minutes
Files to Modify
| File | Purpose |
|---|---|
| Update status from "To test" to validated, populate all TBD fields |
Related Issues
- [plan][agent.skill] Enhanced code review with advanced quality standards and agent infrastructure #38 - Original code-review agent creation
- [refactor][#488] Fix permission evaluation ordering with clear rule priority #489 - Recent PR that can serve as dogfooding target
Test Strategy
- Dogfooding-first validation: Use agentize to validate itself
- Real PR testing: Test on actual merged PRs with meaningful changes
- Structured validation: Follow TC-1 through TC-7 systematically
- Documentation update: Populate all validation fields for future reference
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Agent fails to start | L | M | Verify Task tool configuration before testing |
| Review takes too long | M | L | Use timeout and document observed duration |
| Review quality poor | M | M | Compare with command-based review (TC-7) |
| Edge cases not covered | L | L | Document as "known limitations" in findings |