feat: add LLM fixture suite for regression testing #60

Merged
rcsheets merged 2 commits from feat/fixture-suite into main 2026-05-05 08:13:34 +00:00
Owner

Seven obvious-bug Go diffs in internal/reviewer/testdata/fixtures, one per
category (logic, security/creds, error handling, concurrency, resource leak,
performance, security/injection). TestFixtures runs each through the live
Anthropic backend's quick-scan path and asserts Result.Verdict matches
want.json.

Gated on ANTHROPIC_API_KEY so go test ./... and per-PR CI skip cleanly.
A dedicated .forgejo/workflows/fixtures.yaml runs the suite on
workflow_dispatch + nightly cron with the API key wired from secrets.

This is v1; controls (clean-refactor, docs-only) and harder difficulty
tiers are tracked as follow-ups so we can also measure false-positive rate.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Seven obvious-bug Go diffs in internal/reviewer/testdata/fixtures, one per category (logic, security/creds, error handling, concurrency, resource leak, performance, security/injection). TestFixtures runs each through the live Anthropic backend's quick-scan path and asserts Result.Verdict matches want.json. Gated on ANTHROPIC_API_KEY so `go test ./...` and per-PR CI skip cleanly. A dedicated .forgejo/workflows/fixtures.yaml runs the suite on workflow_dispatch + nightly cron with the API key wired from secrets. This is v1; controls (clean-refactor, docs-only) and harder difficulty tiers are tracked as follow-ups so we can also measure false-positive rate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat: add LLM fixture suite for regression testing
All checks were successful
ci / check (pull_request) Successful in 44s
481f60fdf5
Seven obvious-bug Go diffs in internal/reviewer/testdata/fixtures, one per
category (logic, security/creds, error handling, concurrency, resource leak,
performance, security/injection). TestFixtures runs each through the live
Anthropic backend's quick-scan path and asserts Result.Verdict matches
want.json.

Gated on ANTHROPIC_API_KEY so `go test ./...` and per-PR CI skip cleanly.
A dedicated .forgejo/workflows/fixtures.yaml runs the suite on
workflow_dispatch + nightly cron with the API key wired from secrets.

This is v1; controls (clean-refactor, docs-only) and harder difficulty
tiers are tracked as follow-ups so we can also measure false-positive rate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator

Automated review by pr-reviewer v0.27.1 | Full Review | Claude Sonnet 4 | tracking id r-f9a59a-1d418d
This is an AI-generated reveiw and may contain mistakes.

Status: Completed


Verdict: Approve — Well-designed regression test suite with comprehensive fixtures covering key bug categories, proper CI integration, and clean implementation that validates the core LLM review functionality.
Complexity: trivial

LLM Fixture Suite for Regression Testing

This PR introduces a comprehensive test suite for validating the LLM reviewer's ability to detect obvious bugs across seven key categories. Let me work through the implementation:

Overall Approach

The design is well-thought-out:

  • Regression testing focus: Seven carefully crafted fixtures covering logic errors, security issues, error handling, concurrency bugs, resource leaks, performance regressions, and injection vulnerabilities
  • Live LLM integration: Tests against the actual Anthropic backend using quick-scan tier, providing real validation of the reviewer's capabilities
  • CI integration: Smart gating on ANTHROPIC_API_KEY keeps regular CI free while enabling dedicated fixture runs via workflow dispatch and nightly cron
  • Structured validation: Each fixture includes a want.json with expected verdict and explanatory notes

Test Implementation (internal/reviewer/fixtures_test.go)

The test logic is solid:

  • Proper environment-based skipping when API key unavailable
  • Sensible defaults (uses DefaultAnthropicModel when ANTHROPIC_MODEL unset)
  • Realistic request construction using TierQuick with appropriate context
  • Clear error reporting that includes both expected/actual verdicts and the LLM's reasoning
  • 60-second timeout per fixture is reasonable for LLM calls

Fixture Quality

Each of the seven fixtures represents a genuinely obvious bug that should be caught:

  1. Logic (inverted boolean): !user.IsAdminuser.IsAdmin - classic auth bypass
  2. Security (hardcoded token): Replacing env var with literal API token - clear credential exposure
  3. Error handling (discarded unmarshal): Silent JSON parsing failure - data corruption risk
  4. Concurrency (removed mutex): Concurrent map access without synchronization - panic-inducing race condition
  5. Resource leak (missing close): File descriptor leak on every call
  6. Performance (nested loop): O(n+m) → O(n*m) with documented large input sizes
  7. Security (command injection): Safe exec args → shell interpolation of untrusted input

The diffs are minimal and focused, making them good test cases for the reviewer's ability to spot critical issues without noise.

CI Workflow (.forgejo/workflows/fixtures.yaml)

The workflow design addresses the key requirements:

  • Triggered appropriately: Manual dispatch for ad-hoc testing, daily cron for regression detection
  • Environment isolation: Uses dedicated runner and Go 1.26 container
  • Secret handling: Properly wires ANTHROPIC_API_KEY from secrets
  • Targeted execution: Runs only the fixture tests, not the full suite

The checkout logic properly handles the Forgejo token for private repo access.

Architecture Alignment

This fits well with the project's testing philosophy:

  • Leverages the existing reviewer.Request and backend interfaces
  • Tests the actual production code path (quick-scan tier)
  • Uses the structured output validation that the real system depends on
  • Follows the project's approach of practical, focused testing

Edge Cases Considered

  • API key availability check prevents failures in contributor environments
  • Model override capability for testing different versions
  • Proper timeout handling for network calls
  • Structured error messages that aid debugging when fixtures fail

Potential Concerns

None significant. The automated observations flagged "possible secrets" in the fixture files, but these are intentionally crafted test data (the hardcoded token fixture contains fjt_2bSk9aRpLnQwVxYzM7HvBcDeFgT3kJq8xPmW1zRy which is clearly a test token pattern).

Minor suggestion: The fixtures could benefit from a README explaining their purpose and how to add new ones, but this isn't blocking.

Value Proposition

This suite provides:

  • Regression detection: Ensures the LLM continues catching obvious bugs as models/prompts evolve
  • Confidence in deployments: Nightly validation that the system works end-to-end
  • Development workflow: Manual testing during LLM backend changes
  • Future extensibility: Clear framework for adding harder fixtures or false-positive controls

The implementation is clean, focused, and addresses a real need for validating the core value proposition of the PR reviewer system.

📚 Context used in this review

Project guidance files:

  • CLAUDE.md

Repository fragments:

  • internal/reviewer/reviewer.go (internal/reviewer/reviewer.go:16-77) — extracted in answer to: "The definitions of TierQuick constant, Request struct, PRContext struct, and any verdict constants or expected values"
  • internal/reviewer/backend_anthropic.go (internal/reviewer/backend_anthropic.go:12-28) — extracted in answer to: "The DefaultAnthropicModel constant value and NewAnthropic function signature"
*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.27.1 | Full Review | Claude Sonnet 4 | tracking id `r-f9a59a-1d418d`* *This is an AI-generated reveiw and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: Approve** — Well-designed regression test suite with comprehensive fixtures covering key bug categories, proper CI integration, and clean implementation that validates the core LLM review functionality. **Complexity:** trivial # LLM Fixture Suite for Regression Testing This PR introduces a comprehensive test suite for validating the LLM reviewer's ability to detect obvious bugs across seven key categories. Let me work through the implementation: ## Overall Approach ✅ The design is well-thought-out: - **Regression testing focus**: Seven carefully crafted fixtures covering logic errors, security issues, error handling, concurrency bugs, resource leaks, performance regressions, and injection vulnerabilities - **Live LLM integration**: Tests against the actual Anthropic backend using quick-scan tier, providing real validation of the reviewer's capabilities - **CI integration**: Smart gating on `ANTHROPIC_API_KEY` keeps regular CI free while enabling dedicated fixture runs via workflow dispatch and nightly cron - **Structured validation**: Each fixture includes a `want.json` with expected verdict and explanatory notes ## Test Implementation (`internal/reviewer/fixtures_test.go`) ✅ The test logic is solid: - Proper environment-based skipping when API key unavailable - Sensible defaults (uses `DefaultAnthropicModel` when `ANTHROPIC_MODEL` unset) - Realistic request construction using `TierQuick` with appropriate context - Clear error reporting that includes both expected/actual verdicts and the LLM's reasoning - 60-second timeout per fixture is reasonable for LLM calls ## Fixture Quality ✅ Each of the seven fixtures represents a genuinely obvious bug that should be caught: 1. **Logic (inverted boolean)**: `!user.IsAdmin` → `user.IsAdmin` - classic auth bypass 2. **Security (hardcoded token)**: Replacing env var with literal API token - clear credential exposure 3. **Error handling (discarded unmarshal)**: Silent JSON parsing failure - data corruption risk 4. **Concurrency (removed mutex)**: Concurrent map access without synchronization - panic-inducing race condition 5. **Resource leak (missing close)**: File descriptor leak on every call 6. **Performance (nested loop)**: O(n+m) → O(n*m) with documented large input sizes 7. **Security (command injection)**: Safe exec args → shell interpolation of untrusted input The diffs are minimal and focused, making them good test cases for the reviewer's ability to spot critical issues without noise. ## CI Workflow (`.forgejo/workflows/fixtures.yaml`) ✅ The workflow design addresses the key requirements: - **Triggered appropriately**: Manual dispatch for ad-hoc testing, daily cron for regression detection - **Environment isolation**: Uses dedicated runner and Go 1.26 container - **Secret handling**: Properly wires `ANTHROPIC_API_KEY` from secrets - **Targeted execution**: Runs only the fixture tests, not the full suite The checkout logic properly handles the Forgejo token for private repo access. ## Architecture Alignment ✅ This fits well with the project's testing philosophy: - Leverages the existing `reviewer.Request` and backend interfaces - Tests the actual production code path (quick-scan tier) - Uses the structured output validation that the real system depends on - Follows the project's approach of practical, focused testing ## Edge Cases Considered ✅ - API key availability check prevents failures in contributor environments - Model override capability for testing different versions - Proper timeout handling for network calls - Structured error messages that aid debugging when fixtures fail ## Potential Concerns **None significant.** The automated observations flagged "possible secrets" in the fixture files, but these are intentionally crafted test data (the hardcoded token fixture contains `fjt_2bSk9aRpLnQwVxYzM7HvBcDeFgT3kJq8xPmW1zRy` which is clearly a test token pattern). **Minor suggestion**: The fixtures could benefit from a README explaining their purpose and how to add new ones, but this isn't blocking. ## Value Proposition This suite provides: - **Regression detection**: Ensures the LLM continues catching obvious bugs as models/prompts evolve - **Confidence in deployments**: Nightly validation that the system works end-to-end - **Development workflow**: Manual testing during LLM backend changes - **Future extensibility**: Clear framework for adding harder fixtures or false-positive controls The implementation is clean, focused, and addresses a real need for validating the core value proposition of the PR reviewer system. <details> <summary>📚 Context used in this review</summary> **Project guidance files:** - `CLAUDE.md` **Repository fragments:** - `internal/reviewer/reviewer.go` (internal/reviewer/reviewer.go:16-77) — extracted in answer to: "The definitions of TierQuick constant, Request struct, PRContext struct, and any verdict constants or expected values" - `internal/reviewer/backend_anthropic.go` (internal/reviewer/backend_anthropic.go:12-28) — extracted in answer to: "The DefaultAnthropicModel constant value and NewAnthropic function signature" </details>
doc: add README for the reviewer fixture suite
All checks were successful
ci / check (pull_request) Successful in 43s
74bd8522a2
Explains layout, want.json schema, how to add a new fixture, how to run
locally vs in CI, and what's tracked for future iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rcsheets deleted branch feat/fixture-suite 2026-05-05 08:13:34 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
brooktrails/pr-reviewer!60
No description provided.