fix: order submit_review fields so verdict is committed last #61
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/tool-schema-verdict-last"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The submit_review tool's required-field order was [verdict,
verdict_descriptive, complexity, body], which means tool-call streaming
emitted the verdict first and the body last. The model effectively
committed to a verdict before working through the diff, then wrote
the body as post-hoc justification. We saw this in real reviews
where Opus's body walked back its headline concern mid-paragraph but
couldn't update the already-streamed Flag verdict.
Reorder required to [body, complexity, verdict_descriptive, verdict].
The body becomes the model's scratchpad, complexity is an
intermediate observation, verdict_descriptive is a one-sentence
synthesis, and verdict is the final commitment based on what the
model just wrote. Field descriptions are tightened to encode the
expectation: body says "treat as scratchpad", verdict says "set this
only after writing the body and verdict_descriptive above".
A test pins the required-array order so future refactors don't
silently regress.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Automated review by pr-reviewer v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id
r-f9a7ae-f692e1This is an AI-generated review and may contain mistakes.
Status: ✅ Completed
✅ Verdict: LGTM — The changes reorder tool schema fields to improve AI reasoning flow and add protective tests - no security or logic issues detected.
Complexity: moderate
Reviewing this diff that reorders the submit_review tool's required fields to fix a calibration issue...
The core change is moving
bodyfirst andverdictlast in the required field order. The PR description explains this addresses a streaming issue where the model would commit to a verdict before reasoning through the problem, then write post-hoc justification that sometimes contradicted the already-emitted verdict.Looking at the changes:
Field reordering:
["verdict", "verdict_descriptive", "complexity", "body"]→["body", "complexity", "verdict_descriptive", "verdict"]- This puts reasoning before conclusion, which makes sense for AI tool calling.Enhanced descriptions: The field descriptions are updated to be more explicit about the intended workflow:
bodynow says "Write this first" and "Treat this field as your scratchpad"verdictnow says "Final commitment — set this only after writing the body and verdict_descriptive above"verdict_descriptiveclarifies it should track what was written in the bodyTest coverage: A new test locks in the required field order to prevent silent regression during refactoring.
The change is logical - it aligns the tool schema with the intended reasoning flow. The descriptions provide clear guidance to the model about the expected sequence. The test ensures this critical ordering won't be accidentally broken.
No security issues, logic errors, or breaking changes. This is a thoughtful fix to improve AI model calibration by matching tool field order to reasoning flow.
Automated review by pr-reviewer v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id
r-f9a92a-dbf783This is an AI-generated review and may contain mistakes.
Status: ✅ Completed
✅ Verdict: LGTM — The field reordering fix addresses a real streaming issue, test fixtures contain intentional fake credentials for testing, and the architectural changes look sound.
Complexity: complex
Tool Schema Reordering Review
This PR reorganizes the
submit_reviewtool schema to putbodyfirst andverdictlast, addressing the streaming problem where the model was committing to verdicts before reasoning through the diff. I'll work through the key changes:Field Reordering Changes
Looking at
internal/reviewer/tools.go, the required fields order changed from:["verdict", "verdict_descriptive", "complexity", "body"]["body", "verdict_descriptive", "verdict"]This is a sensible fix - putting
bodyfirst lets the model write out its reasoning before committing to a verdict. The test intools_test.golocks in this ordering to prevent future regressions.Complexity Field Migration
The
complexityfield has been moved entirely to a new preflight pass system:ToolArgsstruct no longer hasComplexity)PreflightRequest/PreflightResulttypessubmit_preflighttool handles complexity + observationsThis architectural change makes sense - complexity estimation can happen on a cheaper model before the full review.
Database Schema Addition
New migration adds an
observationsJSONB column to store preflight findings. The migration looks correct with proper up/down scripts.Security Observations
The automated scanner flagged potential secrets in several test fixture files:
internal/reviewer/testdata/fixtures/02-security-hardcoded-token/diff.patchcontains what looks like a Forgejo tokenHowever, examining these files shows they're intentionally crafted test fixtures designed to test the reviewer's ability to detect hardcoded secrets. The token
fjt_2bSk9aRpLnQwVxYzM7HvBcDeFgT3kJq8xPmW1zRyin the security fixture is clearly a test case, not a real credential.Architecture Changes
The diff shows a significant refactor from Python to Go, but this appears to be part of the broader system rewrite. The new fixture test system in
fixtures_test.goprovides good regression testing for the LLM's ability to catch obvious bugs.Configuration Changes
New environment variables added (
PREFLIGHT_MODEL,DASHBOARD_BASE_URL) with sensible defaults. The preflight model defaults to a Haiku-class model for cost efficiency.Potential Issues
tools.go(500+ chars) is just a detailed comment explaining the tool schema - not a code issueThe core change - reordering tool fields to put reasoning before verdict - directly addresses the stated streaming problem. The model will now write its analysis in the
bodyfield first, then synthesize that intoverdict_descriptive, then commit to the finalverdict. This should eliminate the issue where verdicts were locked in before the reasoning was complete.Automated review by pr-reviewer v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id
r-f9aab3-e11a6dThis is an AI-genrated review and may contain mistakes.
Status: ✅ Completed
✅ Verdict: LGTM — Internal tool schema reorder with test coverage to improve model reasoning flow - no safety concerns detected.
Complexity: trivial
Safety Check Review
This PR reorders the fields in a tool schema to improve model behavior during streaming tool calls. The changes are:
Tool schema reordering: The
requiredfield order intoolArgsSchemais changed from[verdict, verdict_descriptive, complexity, body]to[body, complexity, verdict_descriptive, verdict]. This ensures the model writes its reasoning (body) before committing to a verdict.Description cleanup: Removes "Set this last, after working through observations" from the complexity field description since it's no longer the last field.
Test coverage: Adds a test that locks in the required field order to prevent regression.
The change addresses a real problem where streaming tool calls would emit the verdict first, causing the model to commit to a decision before fully analyzing the diff. This is a logical improvement to the tool design.
Security/Safety considerations:
Code correctness:
This is a sensible improvement to model behavior with appropriate test coverage.
Automated review by pr-reviewer v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id
r-f9ab5a-3d53faThis is an AI-generated review and may contain mistakes.
Status: ✅ Completed
✅ Verdict: LGTM — The changes logically reorder tool parameters to improve AI review quality with no security concerns or breaking changes.
Complexity: trivial
This PR addresses a tool design issue where the submit_review function's field ordering was causing the AI model to commit to a verdict before fully analyzing the code. The change reorders required fields from [verdict, verdict_descriptive, complexity, body] to [body, complexity, verdict_descriptive, verdict].
Looking at the diff:
tools.go: Removes instruction text from complexity field description that said "Set this last, after working through observations" - this makes sense since complexity is no longer last in the order.
tools_test.go: Adds a new test that locks in the required field order as [body, verdict_descriptive, verdict]. The test checks both TierQuick and TierFull schemas.
The PR description explains this is fixing a real problem where streaming tool calls were emitting verdicts before the analysis body, leading to inconsistent reviews where the body would contradict the already-committed verdict.
This is a logical fix that should improve the quality of automated reviews by ensuring the AI works through its analysis before committing to a decision. No security issues, no accidental commits, no breaking changes - just a sensible reordering of tool parameters.