feat: structured verdict output via tool use #9

Merged
rcsheets merged 2 commits from feat/verdict-tracking into main 2026-03-21 02:16:29 +00:00
Owner

Replaced free-text verdict parsing with structured output using tool
use (Anthropic) / function calling (vLLM). The LLM now returns three
separate fields:

  • verdict: enum (lgtm/flag/approve/request_changes)
  • verdict_descriptive: one-sentence explanation of the verdict
  • body: the review text in Markdown

This guarantees a verdict on every review and enables reliable tracking
of false positives. "missing" is recorded if the structured call fails.

Reviews now display the verdict prominently above the review body.
Dashboard shows verdict_descriptive on hover over the verdict badge.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Replaced free-text verdict parsing with structured output using tool use (Anthropic) / function calling (vLLM). The LLM now returns three separate fields: - verdict: enum (lgtm/flag/approve/request_changes) - verdict_descriptive: one-sentence explanation of the verdict - body: the review text in Markdown This guarantees a verdict on every review and enables reliable tracking of false positives. "missing" is recorded if the structured call fails. Reviews now display the verdict prominently above the review body. Dashboard shows verdict_descriptive on hover over the verdict badge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaced free-text verdict parsing with structured output using tool
use (Anthropic) / function calling (vLLM). The LLM now returns three
separate fields:

- verdict: enum (lgtm/flag/approve/request_changes)
- verdict_descriptive: one-sentence explanation of the verdict
- body: the review text in Markdown

This guarantees a verdict on every review and enables reliable tracking
of false positives. "missing" is recorded if the structured call fails.

Reviews now display the verdict prominently above the review body.
Dashboard shows verdict_descriptive on hover over the verdict badge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator

Automated review by pr-reviewer v0.10.0 | Safety Check | anthropic | tracking id r-bdfd50-698faa

Status: Completed


Looking at this PR, I can see it's refactoring the LLM review system to use structured output (tool use/function calling) instead of parsing verdicts from free text.

The changes look technically sound:

  • Adds proper tool schemas for both Anthropic and OpenAI/vLLM backends
  • Updates database schema to store verdict fields
  • Handles fallback cases when tool calls fail
  • Maintains backward compatibility with existing data

I don't see any obvious bugs, security issues, or breaking changes. The database migrations are additive-only, the API changes are internal, and the structured approach should be more reliable than regex parsing.

Verdict: LGTM

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.10.0 | Safety Check | anthropic | tracking id `r-bdfd50-698faa`* **Status:** ✅ Completed --- Looking at this PR, I can see it's refactoring the LLM review system to use structured output (tool use/function calling) instead of parsing verdicts from free text. The changes look technically sound: - Adds proper tool schemas for both Anthropic and OpenAI/vLLM backends - Updates database schema to store verdict fields - Handles fallback cases when tool calls fail - Maintains backward compatibility with existing data I don't see any obvious bugs, security issues, or breaking changes. The database migrations are additive-only, the API changes are internal, and the structured approach should be more reliable than regex parsing. **Verdict: LGTM**
Collaborator

Automated review by pr-reviewer v0.10.0 | Full Review | anthropic | tracking id r-bdfd99-8c6926

Status: Completed


This is a well-structured PR that implements structured verdict output for the code review bot. Let me provide a detailed review:

Overall Assessment

The changes successfully replace free-text verdict parsing with structured output using tool use (Anthropic) and function calling (vLLM). This is a solid architectural improvement that guarantees verdict extraction and enables better tracking.

Strengths

  1. Robust Architecture: Using tool use/function calling eliminates the fragility of parsing verdicts from free text
  2. Comprehensive Implementation: All components (reviewer, tracker, dashboard, main) are consistently updated
  3. Good Error Handling: Fallback to "missing" verdict when structured calls fail
  4. UI Enhancement: Dashboard now shows verdicts with hover tooltips for descriptions
  5. Database Schema Evolution: Proper column additions with migrations

Detailed Review

reviewer.py (Lines 66, 91)

Good: The system prompts are updated to use Submit your review using the submit_review tool instead of asking for formatted verdicts. This aligns with the new structured approach.

reviewer.py (Lines 93-177)

Good: The tool schemas are well-defined with appropriate enums for each tier:

  • Quick scan: ["lgtm", "flag"]
  • Full review: ["approve", "request_changes"]

The separation between Anthropic tools and OpenAI functions is clean.

reviewer.py (Lines 311-394)

Good: Both backend implementations properly handle structured output:

  • Anthropic: Iterates through message content blocks to find tool_use
  • vLLM: Parses function call arguments from JSON

Good: Fallback handling when tool calls fail, with appropriate logging.

tracker.py (Lines 29-30, 56-57, 70-71)

Good: Database schema properly extended with verdict fields and migration handled correctly.

tracker.py (Lines 140-150)

Good: INSERT/UPDATE statement properly includes new verdict fields.

main.py (Lines 131-157)

Good: Verdict formatting with emoji and descriptive text is well-implemented.

main.py (Lines 532-537)

Good: Verdict fields are properly extracted from ReviewResult and stored in the event.

main.py (Lines 540-542)

Good: Verdict display is prepended to the review body, making it prominent.

dashboard.py (Lines 33-39)

Good: Verdict badges with hover tooltips provide good UX. CSS classes are appropriately styled.

Minor Issues

dashboard.py (Line 37)

Potential Issue: HTML attribute values should be properly escaped. The current code:

verdict_title = f" title='{e.verdict_descriptive}'" if e.verdict_descriptive else ""

Could break if verdict_descriptive contains single quotes. Consider using html.escape():

import html
verdict_title = f" title='{html.escape(e.verdict_descriptive, quote=True)}'" if e.verdict_descriptive else ""

reviewer.py (Line 379)

Minor: The import json is inside a function. While functional, it's more conventional to put imports at the top of the file.

tracker.py (Lines 184, 215)

Inconsistency: The database queries in find_latest_full_review and find_recent_quick_scan don't include the new verdict columns in their SELECT statements, but try to access them in the ReviewEvent constructor (rows[15], rows[16]). This will cause an IndexError.

Fix needed:

# In find_latest_full_review and find_recent_quick_scan
row = conn.execute(
    """
    SELECT id, timestamp, repo, pr_number, action, tier, status, backend,
           input_tokens, output_tokens, diff_chars, error, duration_ms, 
           request_id, app_version, verdict, verdict_descriptive
    FROM review_events
    WHERE repo = %s AND pr_number = %s AND tier = 'full' AND status = 'completed'
    ORDER BY timestamp DESC LIMIT 1
    """,
    (repo, pr_number),
).fetchone()

tracker.py (Line 380)

Similar Issue: The get_events function also needs to include verdict columns in its SELECT statement.

Security & Performance

Good: No security issues identified. The structured output approach actually improves security by eliminating the need to parse potentially malicious LLM output.

Good: Performance impact is minimal - the database schema changes are properly indexed.

Testing Considerations

The PR doesn't include tests, but the changes affect critical paths. Consider testing:

  • Tool use parsing for both backends
  • Database migration with existing data
  • Dashboard rendering with verdict badges
  • Fallback behavior when LLM doesn't use tools correctly

Breaking Changes

Minor: The database schema changes require migration, but it's handled properly with ALTER TABLE IF NOT EXISTS.

Verdict: Request Changes

The main issues are the database query inconsistencies in tracker.py that will cause runtime errors when accessing verdict fields from historical data. Once those SELECT statements are updated to include the new columns, this will be ready to merge.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.10.0 | Full Review | anthropic | tracking id `r-bdfd99-8c6926`* **Status:** ✅ Completed --- This is a well-structured PR that implements structured verdict output for the code review bot. Let me provide a detailed review: ## Overall Assessment The changes successfully replace free-text verdict parsing with structured output using tool use (Anthropic) and function calling (vLLM). This is a solid architectural improvement that guarantees verdict extraction and enables better tracking. ## Strengths 1. **Robust Architecture**: Using tool use/function calling eliminates the fragility of parsing verdicts from free text 2. **Comprehensive Implementation**: All components (reviewer, tracker, dashboard, main) are consistently updated 3. **Good Error Handling**: Fallback to "missing" verdict when structured calls fail 4. **UI Enhancement**: Dashboard now shows verdicts with hover tooltips for descriptions 5. **Database Schema Evolution**: Proper column additions with migrations ## Detailed Review ### reviewer.py (Lines 66, 91) **Good**: The system prompts are updated to use `Submit your review using the submit_review tool` instead of asking for formatted verdicts. This aligns with the new structured approach. ### reviewer.py (Lines 93-177) **Good**: The tool schemas are well-defined with appropriate enums for each tier: - Quick scan: `["lgtm", "flag"]` - Full review: `["approve", "request_changes"]` The separation between Anthropic tools and OpenAI functions is clean. ### reviewer.py (Lines 311-394) **Good**: Both backend implementations properly handle structured output: - Anthropic: Iterates through message content blocks to find tool_use - vLLM: Parses function call arguments from JSON **Good**: Fallback handling when tool calls fail, with appropriate logging. ### tracker.py (Lines 29-30, 56-57, 70-71) **Good**: Database schema properly extended with verdict fields and migration handled correctly. ### tracker.py (Lines 140-150) **Good**: INSERT/UPDATE statement properly includes new verdict fields. ### main.py (Lines 131-157) **Good**: Verdict formatting with emoji and descriptive text is well-implemented. ### main.py (Lines 532-537) **Good**: Verdict fields are properly extracted from ReviewResult and stored in the event. ### main.py (Lines 540-542) **Good**: Verdict display is prepended to the review body, making it prominent. ### dashboard.py (Lines 33-39) **Good**: Verdict badges with hover tooltips provide good UX. CSS classes are appropriately styled. ## Minor Issues ### dashboard.py (Line 37) **Potential Issue**: HTML attribute values should be properly escaped. The current code: ```python verdict_title = f" title='{e.verdict_descriptive}'" if e.verdict_descriptive else "" ``` Could break if `verdict_descriptive` contains single quotes. Consider using `html.escape()`: ```python import html verdict_title = f" title='{html.escape(e.verdict_descriptive, quote=True)}'" if e.verdict_descriptive else "" ``` ### reviewer.py (Line 379) **Minor**: The `import json` is inside a function. While functional, it's more conventional to put imports at the top of the file. ### tracker.py (Lines 184, 215) **Inconsistency**: The database queries in `find_latest_full_review` and `find_recent_quick_scan` don't include the new verdict columns in their SELECT statements, but try to access them in the ReviewEvent constructor (rows[15], rows[16]). This will cause an IndexError. **Fix needed**: ```python # In find_latest_full_review and find_recent_quick_scan row = conn.execute( """ SELECT id, timestamp, repo, pr_number, action, tier, status, backend, input_tokens, output_tokens, diff_chars, error, duration_ms, request_id, app_version, verdict, verdict_descriptive FROM review_events WHERE repo = %s AND pr_number = %s AND tier = 'full' AND status = 'completed' ORDER BY timestamp DESC LIMIT 1 """, (repo, pr_number), ).fetchone() ``` ### tracker.py (Line 380) **Similar Issue**: The `get_events` function also needs to include verdict columns in its SELECT statement. ## Security & Performance **Good**: No security issues identified. The structured output approach actually improves security by eliminating the need to parse potentially malicious LLM output. **Good**: Performance impact is minimal - the database schema changes are properly indexed. ## Testing Considerations The PR doesn't include tests, but the changes affect critical paths. Consider testing: - Tool use parsing for both backends - Database migration with existing data - Dashboard rendering with verdict badges - Fallback behavior when LLM doesn't use tools correctly ## Breaking Changes **Minor**: The database schema changes require migration, but it's handled properly with `ALTER TABLE IF NOT EXISTS`. **Verdict: Request Changes** The main issues are the database query inconsistencies in `tracker.py` that will cause runtime errors when accessing verdict fields from historical data. Once those SELECT statements are updated to include the new columns, this will be ready to merge.
- Fix IndexError: add verdict, verdict_descriptive to all SELECT queries
- Move json import to top of reviewer.py
- HTML-escape verdict_descriptive in dashboard title attributes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator

Automated review by pr-reviewer v0.10.0 | Full Review | anthropic | tracking id r-bdfef1-c47028

Status: Completed


This is a well-structured PR that replaces free-text verdict parsing with structured output using tool use/function calling. The changes look good overall and represent a significant improvement in reliability. Here's my detailed review:

Architecture and Design

The move from parsing free-text verdicts to structured output is excellent. This eliminates parsing errors and guarantees a verdict on every review. The three-field structure (verdict, verdict_descriptive, body) is well-designed and provides both machine-readable and human-readable information.

The tool/function schemas in reviewer.py are properly defined with appropriate verdict enums for each tier:

  • Quick scans: "lgtm" or "flag"
  • Full reviews: "approve" or "request_changes"

Implementation Quality

reviewer.py (lines 312-400): The backend implementations properly handle both Anthropic tool use and OpenAI/vLLM function calling. Good error handling with fallback to message content if tool parsing fails, and appropriate logging when tool use doesn't work as expected.

main.py (lines 529-540): Proper integration of the new verdict fields into the event tracking, with appropriate logging when verdicts are missing.

tracker.py: Clean database schema evolution with proper migration handling. The new columns are added safely with IF NOT EXISTS and appropriate defaults.

Database Changes

The database migrations in tracker.py (lines 70-75) are handled correctly:

  • Uses ALTER TABLE ... ADD COLUMN IF NOT EXISTS for safety
  • Appropriate default values for new columns
  • Maintains backward compatibility

UI/UX Improvements

dashboard.py (lines 33-39): The verdict badge display is well-implemented:

  • Proper HTML escaping of verdict_descriptive for tooltip safety
  • Good visual distinction between verdict types (green for approve/lgtm, red for flag/request_changes, yellow for missing)
  • Clean CSS styling that fits the existing design

main.py (lines 148-155): The verdict formatting for review comments is clean and prominent, placing the verdict at the top of the review body where users will see it immediately.

Error Handling

Good fallback behavior throughout:

  • "missing" verdict recorded when structured output fails
  • Appropriate logging when verdicts are missing
  • Graceful degradation in dashboard when verdict data is unavailable

Potential Concerns

reviewer.py (line 325): The tool_choice forcing could potentially cause issues if the LLM refuses to use the tool for some reason. However, this is likely rare and the fallback handling should catch it.

Performance: The structured output approach should actually be more reliable than regex parsing, so this is an improvement.

Testing Considerations

The changes maintain backward compatibility well - existing events without verdicts will display correctly. The migration handles existing data appropriately.

Documentation

The code comments and docstrings have been updated to reflect the new structured approach. The module-level documentation in reviewer.py clearly explains the new architecture.

Security

Proper HTML escaping in the dashboard prevents XSS from verdict_descriptive content.

Minor Observations

  • The Co-Authored-By credit to Claude is a nice touch showing transparency about AI assistance
  • Version handling warning in system prompts is maintained appropriately
  • The enum constraints in the tool schemas ensure only valid verdicts are returned

Verdict: Approve — This is a solid architectural improvement that eliminates a common source of parsing errors while adding useful structured data. The implementation is clean, well-tested through migration handling, and maintains backward compatibility.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.10.0 | Full Review | anthropic | tracking id `r-bdfef1-c47028`* **Status:** ✅ Completed --- This is a well-structured PR that replaces free-text verdict parsing with structured output using tool use/function calling. The changes look good overall and represent a significant improvement in reliability. Here's my detailed review: ## Architecture and Design ✅ The move from parsing free-text verdicts to structured output is excellent. This eliminates parsing errors and guarantees a verdict on every review. The three-field structure (verdict, verdict_descriptive, body) is well-designed and provides both machine-readable and human-readable information. The tool/function schemas in `reviewer.py` are properly defined with appropriate verdict enums for each tier: - Quick scans: "lgtm" or "flag" - Full reviews: "approve" or "request_changes" ## Implementation Quality ✅ **reviewer.py (lines 312-400)**: The backend implementations properly handle both Anthropic tool use and OpenAI/vLLM function calling. Good error handling with fallback to message content if tool parsing fails, and appropriate logging when tool use doesn't work as expected. **main.py (lines 529-540)**: Proper integration of the new verdict fields into the event tracking, with appropriate logging when verdicts are missing. **tracker.py**: Clean database schema evolution with proper migration handling. The new columns are added safely with `IF NOT EXISTS` and appropriate defaults. ## Database Changes ✅ The database migrations in `tracker.py` (lines 70-75) are handled correctly: - Uses `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` for safety - Appropriate default values for new columns - Maintains backward compatibility ## UI/UX Improvements ✅ **dashboard.py (lines 33-39)**: The verdict badge display is well-implemented: - Proper HTML escaping of verdict_descriptive for tooltip safety - Good visual distinction between verdict types (green for approve/lgtm, red for flag/request_changes, yellow for missing) - Clean CSS styling that fits the existing design **main.py (lines 148-155)**: The verdict formatting for review comments is clean and prominent, placing the verdict at the top of the review body where users will see it immediately. ## Error Handling ✅ Good fallback behavior throughout: - "missing" verdict recorded when structured output fails - Appropriate logging when verdicts are missing - Graceful degradation in dashboard when verdict data is unavailable ## Potential Concerns **reviewer.py (line 325)**: The tool_choice forcing could potentially cause issues if the LLM refuses to use the tool for some reason. However, this is likely rare and the fallback handling should catch it. **Performance**: The structured output approach should actually be more reliable than regex parsing, so this is an improvement. ## Testing Considerations The changes maintain backward compatibility well - existing events without verdicts will display correctly. The migration handles existing data appropriately. ## Documentation ✅ The code comments and docstrings have been updated to reflect the new structured approach. The module-level documentation in `reviewer.py` clearly explains the new architecture. ## Security ✅ Proper HTML escaping in the dashboard prevents XSS from verdict_descriptive content. ## Minor Observations - The Co-Authored-By credit to Claude is a nice touch showing transparency about AI assistance - Version handling warning in system prompts is maintained appropriately - The enum constraints in the tool schemas ensure only valid verdicts are returned **Verdict: Approve** — This is a solid architectural improvement that eliminates a common source of parsing errors while adding useful structured data. The implementation is clean, well-tested through migration handling, and maintains backward compatibility.
rcsheets deleted branch feat/verdict-tracking 2026-03-21 02:16:29 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
brooktrails/pr-reviewer!9
No description provided.