brooktrails/pr-reviewer

Fork 0

feat: add LLM fixture suite for regression testing #60

Merged

rcsheets merged 2 commits from feat/fixture-suite into main

2026-05-05 08:13:34 +00:00

rcsheets commented

2026-05-05 08:08:57 +00:00

Owner

Gated on ANTHROPIC_API_KEY so go test ./... and per-PR CI skip cleanly.
A dedicated .forgejo/workflows/fixtures.yaml runs the suite on
workflow_dispatch + nightly cron with the API key wired from secrets.

This is v1; controls (clean-refactor, docs-only) and harder difficulty
tiers are tracked as follow-ups so we can also measure false-positive rate.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Seven obvious-bug Go diffs in internal/reviewer/testdata/fixtures, one per category (logic, security/creds, error handling, concurrency, resource leak, performance, security/injection). TestFixtures runs each through the live Anthropic backend's quick-scan path and asserts Result.Verdict matches want.json. Gated on ANTHROPIC_API_KEY so `go test ./...` and per-PR CI skip cleanly. A dedicated .forgejo/workflows/fixtures.yaml runs the suite on workflow_dispatch + nightly cron with the API key wired from secrets. This is v1; controls (clean-refactor, docs-only) and harder difficulty tiers are tracked as follow-ups so we can also measure false-positive rate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rcsheets added 1 commit

2026-05-05 08:08:57 +00:00

feat: add LLM fixture suite for regression testing

ci / check (pull_request) Successful in 44s

Details

481f60fdf5

Seven obvious-bug Go diffs in internal/reviewer/testdata/fixtures, one per
category (logic, security/creds, error handling, concurrency, resource leak,
performance, security/injection). TestFixtures runs each through the live
Anthropic backend's quick-scan path and asserts Result.Verdict matches
want.json.

Gated on ANTHROPIC_API_KEY so `go test ./...` and per-PR CI skip cleanly.
A dedicated .forgejo/workflows/fixtures.yaml runs the suite on
workflow_dispatch + nightly cron with the API key wired from secrets.

This is v1; controls (clean-refactor, docs-only) and harder difficulty
tiers are tracked as follow-ups so we can also measure false-positive rate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pr-reviewer-bot commented

2026-05-05 08:08:58 +00:00

Collaborator

Automated review by pr-reviewer v0.27.1 | Full Review | Claude Sonnet 4 | tracking id r-f9a59a-1d418d
This is an AI-generated reveiw and may contain mistakes.

Status: ✅ Completed

✅ Verdict: Approve — Well-designed regression test suite with comprehensive fixtures covering key bug categories, proper CI integration, and clean implementation that validates the core LLM review functionality.
Complexity: trivial

LLM Fixture Suite for Regression Testing

This PR introduces a comprehensive test suite for validating the LLM reviewer's ability to detect obvious bugs across seven key categories. Let me work through the implementation:

Overall Approach ✅

The design is well-thought-out:

Regression testing focus: Seven carefully crafted fixtures covering logic errors, security issues, error handling, concurrency bugs, resource leaks, performance regressions, and injection vulnerabilities
Live LLM integration: Tests against the actual Anthropic backend using quick-scan tier, providing real validation of the reviewer's capabilities
CI integration: Smart gating on ANTHROPIC_API_KEY keeps regular CI free while enabling dedicated fixture runs via workflow dispatch and nightly cron
Structured validation: Each fixture includes a want.json with expected verdict and explanatory notes

Test Implementation (`internal/reviewer/fixtures_test.go`) ✅

The test logic is solid:

Proper environment-based skipping when API key unavailable
Sensible defaults (uses DefaultAnthropicModel when ANTHROPIC_MODEL unset)
Realistic request construction using TierQuick with appropriate context
Clear error reporting that includes both expected/actual verdicts and the LLM's reasoning
60-second timeout per fixture is reasonable for LLM calls

Fixture Quality ✅

Each of the seven fixtures represents a genuinely obvious bug that should be caught:

Logic (inverted boolean): !user.IsAdmin → user.IsAdmin - classic auth bypass
Security (hardcoded token): Replacing env var with literal API token - clear credential exposure
Error handling (discarded unmarshal): Silent JSON parsing failure - data corruption risk
Concurrency (removed mutex): Concurrent map access without synchronization - panic-inducing race condition
Resource leak (missing close): File descriptor leak on every call
Performance (nested loop): O(n+m) → O(n*m) with documented large input sizes
Security (command injection): Safe exec args → shell interpolation of untrusted input

The diffs are minimal and focused, making them good test cases for the reviewer's ability to spot critical issues without noise.

CI Workflow (`.forgejo/workflows/fixtures.yaml`) ✅

The workflow design addresses the key requirements:

Triggered appropriately: Manual dispatch for ad-hoc testing, daily cron for regression detection
Environment isolation: Uses dedicated runner and Go 1.26 container
Secret handling: Properly wires ANTHROPIC_API_KEY from secrets
Targeted execution: Runs only the fixture tests, not the full suite

The checkout logic properly handles the Forgejo token for private repo access.

Architecture Alignment ✅

This fits well with the project's testing philosophy:

Leverages the existing reviewer.Request and backend interfaces
Tests the actual production code path (quick-scan tier)
Uses the structured output validation that the real system depends on
Follows the project's approach of practical, focused testing

Edge Cases Considered ✅

API key availability check prevents failures in contributor environments
Model override capability for testing different versions
Proper timeout handling for network calls
Structured error messages that aid debugging when fixtures fail

Potential Concerns

None significant. The automated observations flagged "possible secrets" in the fixture files, but these are intentionally crafted test data (the hardcoded token fixture contains fjt_2bSk9aRpLnQwVxYzM7HvBcDeFgT3kJq8xPmW1zRy which is clearly a test token pattern).

Minor suggestion: The fixtures could benefit from a README explaining their purpose and how to add new ones, but this isn't blocking.

Value Proposition

This suite provides:

Regression detection: Ensures the LLM continues catching obvious bugs as models/prompts evolve
Confidence in deployments: Nightly validation that the system works end-to-end
Development workflow: Manual testing during LLM backend changes
Future extensibility: Clear framework for adding harder fixtures or false-positive controls

The implementation is clean, focused, and addresses a real need for validating the core value proposition of the PR reviewer system.

📚 Context used in this review

Project guidance files:

CLAUDE.md

Repository fragments:

internal/reviewer/reviewer.go (internal/reviewer/reviewer.go:16-77) — extracted in answer to: "The definitions of TierQuick constant, Request struct, PRContext struct, and any verdict constants or expected values"
internal/reviewer/backend_anthropic.go (internal/reviewer/backend_anthropic.go:12-28) — extracted in answer to: "The DefaultAnthropicModel constant value and NewAnthropic function signature"

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.27.1 | Full Review | Claude Sonnet 4 | tracking id `r-f9a59a-1d418d`* *This is an AI-generated reveiw and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: Approve** — Well-designed regression test suite with comprehensive fixtures covering key bug categories, proper CI integration, and clean implementation that validates the core LLM review functionality. **Complexity:** trivial # LLM Fixture Suite for Regression Testing This PR introduces a comprehensive test suite for validating the LLM reviewer's ability to detect obvious bugs across seven key categories. Let me work through the implementation: ## Overall Approach ✅ The design is well-thought-out: - **Regression testing focus**: Seven carefully crafted fixtures covering logic errors, security issues, error handling, concurrency bugs, resource leaks, performance regressions, and injection vulnerabilities - **Live LLM integration**: Tests against the actual Anthropic backend using quick-scan tier, providing real validation of the reviewer's capabilities - **CI integration**: Smart gating on `ANTHROPIC_API_KEY` keeps regular CI free while enabling dedicated fixture runs via workflow dispatch and nightly cron - **Structured validation**: Each fixture includes a `want.json` with expected verdict and explanatory notes ## Test Implementation (`internal/reviewer/fixtures_test.go`) ✅ The test logic is solid: - Proper environment-based skipping when API key unavailable - Sensible defaults (uses `DefaultAnthropicModel` when `ANTHROPIC_MODEL` unset) - Realistic request construction using `TierQuick` with appropriate context - Clear error reporting that includes both expected/actual verdicts and the LLM's reasoning - 60-second timeout per fixture is reasonable for LLM calls ## Fixture Quality ✅ Each of the seven fixtures represents a genuinely obvious bug that should be caught: 1. **Logic (inverted boolean)**: `!user.IsAdmin` → `user.IsAdmin` - classic auth bypass 2. **Security (hardcoded token)**: Replacing env var with literal API token - clear credential exposure 3. **Error handling (discarded unmarshal)**: Silent JSON parsing failure - data corruption risk 4. **Concurrency (removed mutex)**: Concurrent map access without synchronization - panic-inducing race condition 5. **Resource leak (missing close)**: File descriptor leak on every call 6. **Performance (nested loop)**: O(n+m) → O(n*m) with documented large input sizes 7. **Security (command injection)**: Safe exec args → shell interpolation of untrusted input The diffs are minimal and focused, making them good test cases for the reviewer's ability to spot critical issues without noise. ## CI Workflow (`.forgejo/workflows/fixtures.yaml`) ✅ The workflow design addresses the key requirements: - **Triggered appropriately**: Manual dispatch for ad-hoc testing, daily cron for regression detection - **Environment isolation**: Uses dedicated runner and Go 1.26 container - **Secret handling**: Properly wires `ANTHROPIC_API_KEY` from secrets - **Targeted execution**: Runs only the fixture tests, not the full suite The checkout logic properly handles the Forgejo token for private repo access. ## Architecture Alignment ✅ This fits well with the project's testing philosophy: - Leverages the existing `reviewer.Request` and backend interfaces - Tests the actual production code path (quick-scan tier) - Uses the structured output validation that the real system depends on - Follows the project's approach of practical, focused testing ## Edge Cases Considered ✅ - API key availability check prevents failures in contributor environments - Model override capability for testing different versions - Proper timeout handling for network calls - Structured error messages that aid debugging when fixtures fail ## Potential Concerns **None significant.** The automated observations flagged "possible secrets" in the fixture files, but these are intentionally crafted test data (the hardcoded token fixture contains `fjt_2bSk9aRpLnQwVxYzM7HvBcDeFgT3kJq8xPmW1zRy` which is clearly a test token pattern). **Minor suggestion**: The fixtures could benefit from a README explaining their purpose and how to add new ones, but this isn't blocking. ## Value Proposition This suite provides: - **Regression detection**: Ensures the LLM continues catching obvious bugs as models/prompts evolve - **Confidence in deployments**: Nightly validation that the system works end-to-end - **Development workflow**: Manual testing during LLM backend changes - **Future extensibility**: Clear framework for adding harder fixtures or false-positive controls The implementation is clean, focused, and addresses a real need for validating the core value proposition of the PR reviewer system. <details> <summary>📚 Context used in this review</summary> **Project guidance files:** - `CLAUDE.md` **Repository fragments:** - `internal/reviewer/reviewer.go` (internal/reviewer/reviewer.go:16-77) — extracted in answer to: "The definitions of TierQuick constant, Request struct, PRContext struct, and any verdict constants or expected values" - `internal/reviewer/backend_anthropic.go` (internal/reviewer/backend_anthropic.go:12-28) — extracted in answer to: "The DefaultAnthropicModel constant value and NewAnthropic function signature" </details>

rcsheets requested review from pr-reviewer-bot

2026-05-05 08:09:01 +00:00

rcsheets added 1 commit

2026-05-05 08:13:05 +00:00

doc: add README for the reviewer fixture suite

ci / check (pull_request) Successful in 43s

Details

74bd8522a2

Explains layout, want.json schema, how to add a new fixture, how to run
locally vs in CI, and what's tracked for future iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rcsheets merged commit bfce38b212 into main

2026-05-05 08:13:34 +00:00

rcsheets deleted branch feat/fixture-suite

2026-05-05 08:13:34 +00:00

rcsheets referenced this pull request from a commit

2026-05-05 08:13:34 +00:00

Merge pull request 'feat: add LLM fixture suite for regression testing' (#60) from feat/fixture-suite into main

No reviewers

No labels

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

brooktrails/pr-reviewer!60

No description provided.

Rows
Columns

feat: add LLM fixture suite for regression testing #60

LLM Fixture Suite for Regression Testing

Overall Approach ✅

Test Implementation (internal/reviewer/fixtures_test.go) ✅

Fixture Quality ✅

CI Workflow (.forgejo/workflows/fixtures.yaml) ✅

Architecture Alignment ✅

Edge Cases Considered ✅

Potential Concerns

Value Proposition

Test Implementation (`internal/reviewer/fixtures_test.go`) ✅

CI Workflow (`.forgejo/workflows/fixtures.yaml`) ✅