feat: structured verdict output via tool use #9
No reviewers
Labels
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
brooktrails/pr-reviewer!9
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/verdict-tracking"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Replaced free-text verdict parsing with structured output using tool
use (Anthropic) / function calling (vLLM). The LLM now returns three
separate fields:
This guarantees a verdict on every review and enables reliable tracking
of false positives. "missing" is recorded if the structured call fails.
Reviews now display the verdict prominently above the review body.
Dashboard shows verdict_descriptive on hover over the verdict badge.
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
Automated review by pr-reviewer v0.10.0 | Safety Check | anthropic | tracking id
r-bdfd50-698faaStatus: ✅ Completed
Looking at this PR, I can see it's refactoring the LLM review system to use structured output (tool use/function calling) instead of parsing verdicts from free text.
The changes look technically sound:
I don't see any obvious bugs, security issues, or breaking changes. The database migrations are additive-only, the API changes are internal, and the structured approach should be more reliable than regex parsing.
Verdict: LGTM
Automated review by pr-reviewer v0.10.0 | Full Review | anthropic | tracking id
r-bdfd99-8c6926Status: ✅ Completed
This is a well-structured PR that implements structured verdict output for the code review bot. Let me provide a detailed review:
Overall Assessment
The changes successfully replace free-text verdict parsing with structured output using tool use (Anthropic) and function calling (vLLM). This is a solid architectural improvement that guarantees verdict extraction and enables better tracking.
Strengths
Detailed Review
reviewer.py (Lines 66, 91)
Good: The system prompts are updated to use
Submit your review using the submit_review toolinstead of asking for formatted verdicts. This aligns with the new structured approach.reviewer.py (Lines 93-177)
Good: The tool schemas are well-defined with appropriate enums for each tier:
["lgtm", "flag"]["approve", "request_changes"]The separation between Anthropic tools and OpenAI functions is clean.
reviewer.py (Lines 311-394)
Good: Both backend implementations properly handle structured output:
Good: Fallback handling when tool calls fail, with appropriate logging.
tracker.py (Lines 29-30, 56-57, 70-71)
Good: Database schema properly extended with verdict fields and migration handled correctly.
tracker.py (Lines 140-150)
Good: INSERT/UPDATE statement properly includes new verdict fields.
main.py (Lines 131-157)
Good: Verdict formatting with emoji and descriptive text is well-implemented.
main.py (Lines 532-537)
Good: Verdict fields are properly extracted from ReviewResult and stored in the event.
main.py (Lines 540-542)
Good: Verdict display is prepended to the review body, making it prominent.
dashboard.py (Lines 33-39)
Good: Verdict badges with hover tooltips provide good UX. CSS classes are appropriately styled.
Minor Issues
dashboard.py (Line 37)
Potential Issue: HTML attribute values should be properly escaped. The current code:
Could break if
verdict_descriptivecontains single quotes. Consider usinghtml.escape():reviewer.py (Line 379)
Minor: The
import jsonis inside a function. While functional, it's more conventional to put imports at the top of the file.tracker.py (Lines 184, 215)
Inconsistency: The database queries in
find_latest_full_reviewandfind_recent_quick_scandon't include the new verdict columns in their SELECT statements, but try to access them in the ReviewEvent constructor (rows[15], rows[16]). This will cause an IndexError.Fix needed:
tracker.py (Line 380)
Similar Issue: The
get_eventsfunction also needs to include verdict columns in its SELECT statement.Security & Performance
Good: No security issues identified. The structured output approach actually improves security by eliminating the need to parse potentially malicious LLM output.
Good: Performance impact is minimal - the database schema changes are properly indexed.
Testing Considerations
The PR doesn't include tests, but the changes affect critical paths. Consider testing:
Breaking Changes
Minor: The database schema changes require migration, but it's handled properly with
ALTER TABLE IF NOT EXISTS.Verdict: Request Changes
The main issues are the database query inconsistencies in
tracker.pythat will cause runtime errors when accessing verdict fields from historical data. Once those SELECT statements are updated to include the new columns, this will be ready to merge.Automated review by pr-reviewer v0.10.0 | Full Review | anthropic | tracking id
r-bdfef1-c47028Status: ✅ Completed
This is a well-structured PR that replaces free-text verdict parsing with structured output using tool use/function calling. The changes look good overall and represent a significant improvement in reliability. Here's my detailed review:
Architecture and Design ✅
The move from parsing free-text verdicts to structured output is excellent. This eliminates parsing errors and guarantees a verdict on every review. The three-field structure (verdict, verdict_descriptive, body) is well-designed and provides both machine-readable and human-readable information.
The tool/function schemas in
reviewer.pyare properly defined with appropriate verdict enums for each tier:Implementation Quality ✅
reviewer.py (lines 312-400): The backend implementations properly handle both Anthropic tool use and OpenAI/vLLM function calling. Good error handling with fallback to message content if tool parsing fails, and appropriate logging when tool use doesn't work as expected.
main.py (lines 529-540): Proper integration of the new verdict fields into the event tracking, with appropriate logging when verdicts are missing.
tracker.py: Clean database schema evolution with proper migration handling. The new columns are added safely with
IF NOT EXISTSand appropriate defaults.Database Changes ✅
The database migrations in
tracker.py(lines 70-75) are handled correctly:ALTER TABLE ... ADD COLUMN IF NOT EXISTSfor safetyUI/UX Improvements ✅
dashboard.py (lines 33-39): The verdict badge display is well-implemented:
main.py (lines 148-155): The verdict formatting for review comments is clean and prominent, placing the verdict at the top of the review body where users will see it immediately.
Error Handling ✅
Good fallback behavior throughout:
Potential Concerns
reviewer.py (line 325): The tool_choice forcing could potentially cause issues if the LLM refuses to use the tool for some reason. However, this is likely rare and the fallback handling should catch it.
Performance: The structured output approach should actually be more reliable than regex parsing, so this is an improvement.
Testing Considerations
The changes maintain backward compatibility well - existing events without verdicts will display correctly. The migration handles existing data appropriately.
Documentation ✅
The code comments and docstrings have been updated to reflect the new structured approach. The module-level documentation in
reviewer.pyclearly explains the new architecture.Security ✅
Proper HTML escaping in the dashboard prevents XSS from verdict_descriptive content.
Minor Observations
Verdict: Approve — This is a solid architectural improvement that eliminates a common source of parsing errors while adding useful structured data. The implementation is clean, well-tested through migration handling, and maintains backward compatibility.