fix: order submit_review fields so verdict is committed last #61

Merged
rcsheets merged 5 commits from fix/tool-schema-verdict-last into main 2026-05-05 08:34:46 +00:00
Owner

The submit_review tool's required-field order was [verdict,
verdict_descriptive, complexity, body], which means tool-call streaming
emitted the verdict first and the body last. The model effectively
committed to a verdict before working through the diff, then wrote
the body as post-hoc justification. We saw this in real reviews
where Opus's body walked back its headline concern mid-paragraph but
couldn't update the already-streamed Flag verdict.

Reorder required to [body, complexity, verdict_descriptive, verdict].
The body becomes the model's scratchpad, complexity is an
intermediate observation, verdict_descriptive is a one-sentence
synthesis, and verdict is the final commitment based on what the
model just wrote. Field descriptions are tightened to encode the
expectation: body says "treat as scratchpad", verdict says "set this
only after writing the body and verdict_descriptive above".

A test pins the required-array order so future refactors don't
silently regress.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

The submit_review tool's required-field order was [verdict, verdict_descriptive, complexity, body], which means tool-call streaming emitted the verdict first and the body last. The model effectively committed to a verdict before working through the diff, then wrote the body as post-hoc justification. We saw this in real reviews where Opus's body walked back its headline concern mid-paragraph but couldn't update the already-streamed Flag verdict. Reorder required to [body, complexity, verdict_descriptive, verdict]. The body becomes the model's scratchpad, complexity is an intermediate observation, verdict_descriptive is a one-sentence synthesis, and verdict is the final commitment based on what the model just wrote. Field descriptions are tightened to encode the expectation: body says "treat as scratchpad", verdict says "set this only after writing the body and verdict_descriptive above". A test pins the required-array order so future refactors don't silently regress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix: order submit_review fields so verdict is committed last
All checks were successful
ci / check (pull_request) Successful in 44s
318bf9a2b5
The submit_review tool's required-field order was [verdict,
verdict_descriptive, complexity, body], which means tool-call streaming
emitted the verdict first and the body last. The model effectively
committed to a verdict before working through the diff, then wrote
the body as post-hoc justification. We saw this in real reviews
where Opus's body walked back its headline concern mid-paragraph but
couldn't update the already-streamed Flag verdict.

Reorder required to [body, complexity, verdict_descriptive, verdict].
The body becomes the model's scratchpad, complexity is an
intermediate observation, verdict_descriptive is a one-sentence
synthesis, and verdict is the final commitment based on what the
model just wrote. Field descriptions are tightened to encode the
expectation: body says "treat as scratchpad", verdict says "set this
only after writing the body and verdict_descriptive above".

A test pins the required-array order so future refactors don't
silently regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator

Automated review by pr-reviewer v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id r-f9a7ae-f692e1
This is an AI-generated review and may contain mistakes.

Status: Completed


Verdict: LGTM — The changes reorder tool schema fields to improve AI reasoning flow and add protective tests - no security or logic issues detected.
Complexity: moderate

Reviewing this diff that reorders the submit_review tool's required fields to fix a calibration issue...

The core change is moving body first and verdict last in the required field order. The PR description explains this addresses a streaming issue where the model would commit to a verdict before reasoning through the problem, then write post-hoc justification that sometimes contradicted the already-emitted verdict.

Looking at the changes:

  1. Field reordering: ["verdict", "verdict_descriptive", "complexity", "body"]["body", "complexity", "verdict_descriptive", "verdict"] - This puts reasoning before conclusion, which makes sense for AI tool calling.

  2. Enhanced descriptions: The field descriptions are updated to be more explicit about the intended workflow:

    • body now says "Write this first" and "Treat this field as your scratchpad"
    • verdict now says "Final commitment — set this only after writing the body and verdict_descriptive above"
    • verdict_descriptive clarifies it should track what was written in the body
  3. Test coverage: A new test locks in the required field order to prevent silent regression during refactoring.

The change is logical - it aligns the tool schema with the intended reasoning flow. The descriptions provide clear guidance to the model about the expected sequence. The test ensures this critical ordering won't be accidentally broken.

No security issues, logic errors, or breaking changes. This is a thoughtful fix to improve AI model calibration by matching tool field order to reasoning flow.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id `r-f9a7ae-f692e1`* *This is an AI-generated review and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: LGTM** — The changes reorder tool schema fields to improve AI reasoning flow and add protective tests - no security or logic issues detected. **Complexity:** moderate Reviewing this diff that reorders the submit_review tool's required fields to fix a calibration issue... The core change is moving `body` first and `verdict` last in the required field order. The PR description explains this addresses a streaming issue where the model would commit to a verdict before reasoning through the problem, then write post-hoc justification that sometimes contradicted the already-emitted verdict. Looking at the changes: 1. **Field reordering**: `["verdict", "verdict_descriptive", "complexity", "body"]` → `["body", "complexity", "verdict_descriptive", "verdict"]` - This puts reasoning before conclusion, which makes sense for AI tool calling. 2. **Enhanced descriptions**: The field descriptions are updated to be more explicit about the intended workflow: - `body` now says "Write this first" and "Treat this field as your scratchpad" - `verdict` now says "Final commitment — set this only after writing the body and verdict_descriptive above" - `verdict_descriptive` clarifies it should track what was written in the body 3. **Test coverage**: A new test locks in the required field order to prevent silent regression during refactoring. The change is logical - it aligns the tool schema with the intended reasoning flow. The descriptions provide clear guidance to the model about the expected sequence. The test ensures this critical ordering won't be accidentally broken. No security issues, logic errors, or breaking changes. This is a thoughtful fix to improve AI model calibration by matching tool field order to reasoning flow.
The submit_review tool's required-field order was [verdict,
verdict_descriptive, complexity, body], which means tool-call streaming
emitted the verdict first and the body last. The model effectively
committed to a verdict before working through the diff, then wrote
the body as post-hoc justification. We saw this in real reviews
where Opus's body walked back its headline concern mid-paragraph but
couldn't update the already-streamed Flag verdict.

Reorder required to [body, complexity, verdict_descriptive, verdict].
The body becomes the model's scratchpad, complexity is an
intermediate observation, verdict_descriptive is a one-sentence
synthesis, and verdict is the final commitment based on what the
model just wrote. Field descriptions are tightened to encode the
expectation: body says "treat as scratchpad", verdict says "set this
only after writing the body and verdict_descriptive above".

A test pins the required-array order so future refactors don't
silently regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator

Automated review by pr-reviewer v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id r-f9a92a-dbf783
This is an AI-generated review and may contain mistakes.

Status: Completed


Verdict: LGTM — The field reordering fix addresses a real streaming issue, test fixtures contain intentional fake credentials for testing, and the architectural changes look sound.
Complexity: complex

Tool Schema Reordering Review

This PR reorganizes the submit_review tool schema to put body first and verdict last, addressing the streaming problem where the model was committing to verdicts before reasoning through the diff. I'll work through the key changes:

Field Reordering Changes

Looking at internal/reviewer/tools.go, the required fields order changed from:

  • OLD: ["verdict", "verdict_descriptive", "complexity", "body"]
  • NEW: ["body", "verdict_descriptive", "verdict"]

This is a sensible fix - putting body first lets the model write out its reasoning before committing to a verdict. The test in tools_test.go locks in this ordering to prevent future regressions.

Complexity Field Migration

The complexity field has been moved entirely to a new preflight pass system:

  • Removed from main review tools (ToolArgs struct no longer has Complexity)
  • Added new PreflightRequest/PreflightResult types
  • New submit_preflight tool handles complexity + observations

This architectural change makes sense - complexity estimation can happen on a cheaper model before the full review.

Database Schema Addition

New migration adds an observations JSONB column to store preflight findings. The migration looks correct with proper up/down scripts.

Security Observations

The automated scanner flagged potential secrets in several test fixture files:

  • internal/reviewer/testdata/fixtures/02-security-hardcoded-token/diff.patch contains what looks like a Forgejo token
  • Similar flags in other fixture files

However, examining these files shows they're intentionally crafted test fixtures designed to test the reviewer's ability to detect hardcoded secrets. The token fjt_2bSk9aRpLnQwVxYzM7HvBcDeFgT3kJq8xPmW1zRy in the security fixture is clearly a test case, not a real credential.

Architecture Changes

The diff shows a significant refactor from Python to Go, but this appears to be part of the broader system rewrite. The new fixture test system in fixtures_test.go provides good regression testing for the LLM's ability to catch obvious bugs.

Configuration Changes

New environment variables added (PREFLIGHT_MODEL, DASHBOARD_BASE_URL) with sensible defaults. The preflight model defaults to a Haiku-class model for cost efficiency.

Potential Issues

  1. The long line flagged in tools.go (500+ chars) is just a detailed comment explaining the tool schema - not a code issue
  2. Version "1.26" for Go in the workflow file might seem high, but Go versions can legitimately be that recent (my training cutoff could be older than the latest releases)
  3. Some test files lack corresponding tests, but this appears to be a broader codebase pattern rather than a regression

The core change - reordering tool fields to put reasoning before verdict - directly addresses the stated streaming problem. The model will now write its analysis in the body field first, then synthesize that into verdict_descriptive, then commit to the final verdict. This should eliminate the issue where verdicts were locked in before the reasoning was complete.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id `r-f9a92a-dbf783`* *This is an AI-generated review and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: LGTM** — The field reordering fix addresses a real streaming issue, test fixtures contain intentional fake credentials for testing, and the architectural changes look sound. **Complexity:** complex # Tool Schema Reordering Review This PR reorganizes the `submit_review` tool schema to put `body` first and `verdict` last, addressing the streaming problem where the model was committing to verdicts before reasoning through the diff. I'll work through the key changes: ## Field Reordering Changes Looking at `internal/reviewer/tools.go`, the required fields order changed from: - OLD: `["verdict", "verdict_descriptive", "complexity", "body"]` - NEW: `["body", "verdict_descriptive", "verdict"]` This is a sensible fix - putting `body` first lets the model write out its reasoning before committing to a verdict. The test in `tools_test.go` locks in this ordering to prevent future regressions. ## Complexity Field Migration The `complexity` field has been moved entirely to a new preflight pass system: - Removed from main review tools (`ToolArgs` struct no longer has `Complexity`) - Added new `PreflightRequest`/`PreflightResult` types - New `submit_preflight` tool handles complexity + observations This architectural change makes sense - complexity estimation can happen on a cheaper model before the full review. ## Database Schema Addition New migration adds an `observations` JSONB column to store preflight findings. The migration looks correct with proper up/down scripts. ## Security Observations The automated scanner flagged potential secrets in several test fixture files: - `internal/reviewer/testdata/fixtures/02-security-hardcoded-token/diff.patch` contains what looks like a Forgejo token - Similar flags in other fixture files However, examining these files shows they're **intentionally crafted test fixtures** designed to test the reviewer's ability to detect hardcoded secrets. The token `fjt_2bSk9aRpLnQwVxYzM7HvBcDeFgT3kJq8xPmW1zRy` in the security fixture is clearly a test case, not a real credential. ## Architecture Changes The diff shows a significant refactor from Python to Go, but this appears to be part of the broader system rewrite. The new fixture test system in `fixtures_test.go` provides good regression testing for the LLM's ability to catch obvious bugs. ## Configuration Changes New environment variables added (`PREFLIGHT_MODEL`, `DASHBOARD_BASE_URL`) with sensible defaults. The preflight model defaults to a Haiku-class model for cost efficiency. ## Potential Issues 1. The long line flagged in `tools.go` (500+ chars) is just a detailed comment explaining the tool schema - not a code issue 2. Version "1.26" for Go in the workflow file might seem high, but Go versions can legitimately be that recent (my training cutoff could be older than the latest releases) 3. Some test files lack corresponding tests, but this appears to be a broader codebase pattern rather than a regression The core change - reordering tool fields to put reasoning before verdict - directly addresses the stated streaming problem. The model will now write its analysis in the `body` field first, then synthesize that into `verdict_descriptive`, then commit to the final `verdict`. This should eliminate the issue where verdicts were locked in before the reasoning was complete.
fix: remove misplaced fields
Some checks failed
ci / check (pull_request) Failing after 43s
0079051d10
Collaborator

Automated review by pr-reviewer v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id r-f9aab3-e11a6d
This is an AI-genrated review and may contain mistakes.

Status: Completed


Verdict: LGTM — Internal tool schema reorder with test coverage to improve model reasoning flow - no safety concerns detected.
Complexity: trivial

Safety Check Review

This PR reorders the fields in a tool schema to improve model behavior during streaming tool calls. The changes are:

  1. Tool schema reordering: The required field order in toolArgsSchema is changed from [verdict, verdict_descriptive, complexity, body] to [body, complexity, verdict_descriptive, verdict]. This ensures the model writes its reasoning (body) before committing to a verdict.

  2. Description cleanup: Removes "Set this last, after working through observations" from the complexity field description since it's no longer the last field.

  3. Test coverage: Adds a test that locks in the required field order to prevent regression.

The change addresses a real problem where streaming tool calls would emit the verdict first, causing the model to commit to a decision before fully analyzing the diff. This is a logical improvement to the tool design.

Security/Safety considerations:

  • No security vulnerabilities introduced
  • No secrets or sensitive data exposed
  • No breaking API changes (internal refactor)
  • The test ensures the fix won't regress silently

Code correctness:

  • The diff shows proper Go syntax
  • Test structure follows standard Go testing patterns
  • Field reordering is internally consistent
  • No logic errors detected

This is a sensible improvement to model behavior with appropriate test coverage.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id `r-f9aab3-e11a6d`* *This is an AI-genrated review and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: LGTM** — Internal tool schema reorder with test coverage to improve model reasoning flow - no safety concerns detected. **Complexity:** trivial # Safety Check Review This PR reorders the fields in a tool schema to improve model behavior during streaming tool calls. The changes are: 1. **Tool schema reordering**: The `required` field order in `toolArgsSchema` is changed from `[verdict, verdict_descriptive, complexity, body]` to `[body, complexity, verdict_descriptive, verdict]`. This ensures the model writes its reasoning (`body`) before committing to a verdict. 2. **Description cleanup**: Removes "Set this last, after working through observations" from the complexity field description since it's no longer the last field. 3. **Test coverage**: Adds a test that locks in the required field order to prevent regression. The change addresses a real problem where streaming tool calls would emit the verdict first, causing the model to commit to a decision before fully analyzing the diff. This is a logical improvement to the tool design. **Security/Safety considerations:** - No security vulnerabilities introduced - No secrets or sensitive data exposed - No breaking API changes (internal refactor) - The test ensures the fix won't regress silently **Code correctness:** - The diff shows proper Go syntax - Test structure follows standard Go testing patterns - Field reordering is internally consistent - No logic errors detected This is a sensible improvement to model behavior with appropriate test coverage.
fix(tests): update expectations
All checks were successful
ci / check (pull_request) Successful in 43s
cb4b32d68e
Collaborator

Automated review by pr-reviewer v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id r-f9ab5a-3d53fa
This is an AI-generated review and may contain mistakes.

Status: Completed


Verdict: LGTM — The changes logically reorder tool parameters to improve AI review quality with no security concerns or breaking changes.
Complexity: trivial

This PR addresses a tool design issue where the submit_review function's field ordering was causing the AI model to commit to a verdict before fully analyzing the code. The change reorders required fields from [verdict, verdict_descriptive, complexity, body] to [body, complexity, verdict_descriptive, verdict].

Looking at the diff:

  1. tools.go: Removes instruction text from complexity field description that said "Set this last, after working through observations" - this makes sense since complexity is no longer last in the order.

  2. tools_test.go: Adds a new test that locks in the required field order as [body, verdict_descriptive, verdict]. The test checks both TierQuick and TierFull schemas.

The PR description explains this is fixing a real problem where streaming tool calls were emitting verdicts before the analysis body, leading to inconsistent reviews where the body would contradict the already-committed verdict.

This is a logical fix that should improve the quality of automated reviews by ensuring the AI works through its analysis before committing to a decision. No security issues, no accidental commits, no breaking changes - just a sensible reordering of tool parameters.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.28.0 | Safety Check | Claude Sonnet 4 | tracking id `r-f9ab5a-3d53fa`* *This is an AI-generated review and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: LGTM** — The changes logically reorder tool parameters to improve AI review quality with no security concerns or breaking changes. **Complexity:** trivial This PR addresses a tool design issue where the submit_review function's field ordering was causing the AI model to commit to a verdict before fully analyzing the code. The change reorders required fields from [verdict, verdict_descriptive, complexity, body] to [body, complexity, verdict_descriptive, verdict]. Looking at the diff: 1. **tools.go**: Removes instruction text from complexity field description that said "Set this last, after working through observations" - this makes sense since complexity is no longer last in the order. 2. **tools_test.go**: Adds a new test that locks in the required field order as [body, verdict_descriptive, verdict]. The test checks both TierQuick and TierFull schemas. The PR description explains this is fixing a real problem where streaming tool calls were emitting verdicts before the analysis body, leading to inconsistent reviews where the body would contradict the already-committed verdict. This is a logical fix that should improve the quality of automated reviews by ensuring the AI works through its analysis before committing to a decision. No security issues, no accidental commits, no breaking changes - just a sensible reordering of tool parameters.
rcsheets deleted branch fix/tool-schema-verdict-last 2026-05-05 08:34:46 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
brooktrails/pr-reviewer!61
No description provided.