brooktrails/pr-reviewer

Fork 0

feat: pre-LLM analysis engine with diff stats, anomaly detection, and repo context #16

Merged

rcsheets merged 2 commits from feat/diff-analysis into main

2026-03-22 08:04:05 +00:00

rcsheets commented

2026-03-22 07:48:11 +00:00

Owner

Adds an analysis layer that runs before the LLM call:

Diff analysis (no API calls):

Line counts, file types, new/deleted files
Test file detection (are tests included in this change?)
Sensitive path classification (CI, auth, migrations, deps, k8s)
Shannon entropy per file (detects encoded data, minified code, secrets)
Secret/credential pattern matching (AWS keys, GitHub tokens, etc)
Large new file warnings (vendored/generated code)
Very long line detection

Repo analysis (Forgejo API):

Author PR history (new contributor vs established)
Test coverage check (do changed source files have corresponding tests?)

Diff filtering:

High-entropy file diffs replaced with a note before sending to LLM,
saving tokens without losing information

Complexity categorization:

LLM now returns trivial/moderate/complex via structured output
Stored in DB and shown in review comments

All analysis results are included in the LLM prompt as structured
context under "Diff statistics", "Repository context", and
"Automated observations (non-AI)" headings.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Adds an analysis layer that runs before the LLM call: Diff analysis (no API calls): - Line counts, file types, new/deleted files - Test file detection (are tests included in this change?) - Sensitive path classification (CI, auth, migrations, deps, k8s) - Shannon entropy per file (detects encoded data, minified code, secrets) - Secret/credential pattern matching (AWS keys, GitHub tokens, etc) - Large new file warnings (vendored/generated code) - Very long line detection Repo analysis (Forgejo API): - Author PR history (new contributor vs established) - Test coverage check (do changed source files have corresponding tests?) Diff filtering: - High-entropy file diffs replaced with a note before sending to LLM, saving tokens without losing information Complexity categorization: - LLM now returns trivial/moderate/complex via structured output - Stored in DB and shown in review comments All analysis results are included in the LLM prompt as structured context under "Diff statistics", "Repository context", and "Automated observations (non-AI)" headings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rcsheets added 1 commit

2026-03-22 07:48:11 +00:00

feat: pre-LLM analysis engine with diff stats, anomaly detection, and repo context ef2fffb582

Adds an analysis layer that runs before the LLM call:

Diff analysis (no API calls):
- Line counts, file types, new/deleted files
- Test file detection (are tests included in this change?)
- Sensitive path classification (CI, auth, migrations, deps, k8s)
- Shannon entropy per file (detects encoded data, minified code, secrets)
- Secret/credential pattern matching (AWS keys, GitHub tokens, etc)
- Large new file warnings (vendored/generated code)
- Very long line detection

Repo analysis (Forgejo API):
- Author PR history (new contributor vs established)
- Test coverage check (do changed source files have corresponding tests?)

Diff filtering:
- High-entropy file diffs replaced with a note before sending to LLM,
  saving tokens without losing information

Complexity categorization:
- LLM now returns trivial/moderate/complex via structured output
- Stored in DB and shown in review comments

All analysis results are included in the LLM prompt as structured
context under "Diff statistics", "Repository context", and
"Automated observations (non-AI)" headings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pr-reviewer-bot commented

2026-03-22 07:48:14 +00:00

Collaborator

Automated review by pr-reviewer v0.14.1 | Safety Check | anthropic | tracking id r-bf9ebd-6add21
This is an AI-generated review and may contain mistakes.

Status: ✅ Completed

✅ Verdict: LGTM — The pre-LLM analysis engine adds valuable features like anomaly detection and test coverage checking without introducing any obvious security vulnerabilities or breaking changes.

Pre-LLM Analysis Engine Review

This PR adds a comprehensive analysis layer that runs before LLM calls, providing structured context about diffs and repository information. The implementation looks solid overall.

Key Observations

✅ Security: No exposed credentials or secrets detected. The secret detection patterns in _SECRET_PATTERNS are appropriate for catching common credential formats.

✅ Architecture: Clean separation between diff analysis (local text processing) and repo analysis (API calls). The filtering mechanism for high-entropy content is a smart token-saving optimization.

✅ Database: Schema migration is handled properly with ALTER TABLE IF NOT EXISTS for the new complexity column.

✅ Error Handling: HTTP requests are wrapped in try-catch blocks with timeouts, and database operations handle exceptions appropriately.

✅ Backwards Compatibility: Changes to existing APIs maintain compatibility by adding optional parameters and providing sensible defaults.

Minor Notes

The entropy threshold of 5.5 bits/char and long line threshold of 500 chars seem reasonable for detecting encoded content
Test file detection patterns cover the major ecosystems (Go, Python, JS/TS)
The Co-Authored-By attribution to "Claude Opus 4.6" is unusual but not problematic

The implementation appears well-tested and production-ready.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.14.1 | Safety Check | anthropic | tracking id `r-bf9ebd-6add21`* *This is an AI-generated review and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: LGTM** — The pre-LLM analysis engine adds valuable features like anomaly detection and test coverage checking without introducing any obvious security vulnerabilities or breaking changes. # Pre-LLM Analysis Engine Review This PR adds a comprehensive analysis layer that runs before LLM calls, providing structured context about diffs and repository information. The implementation looks solid overall. ## Key Observations ✅ **Security**: No exposed credentials or secrets detected. The secret detection patterns in `_SECRET_PATTERNS` are appropriate for catching common credential formats. ✅ **Architecture**: Clean separation between diff analysis (local text processing) and repo analysis (API calls). The filtering mechanism for high-entropy content is a smart token-saving optimization. ✅ **Database**: Schema migration is handled properly with `ALTER TABLE IF NOT EXISTS` for the new `complexity` column. ✅ **Error Handling**: HTTP requests are wrapped in try-catch blocks with timeouts, and database operations handle exceptions appropriately. ✅ **Backwards Compatibility**: Changes to existing APIs maintain compatibility by adding optional parameters and providing sensible defaults. ## Minor Notes - The entropy threshold of 5.5 bits/char and long line threshold of 500 chars seem reasonable for detecting encoded content - Test file detection patterns cover the major ecosystems (Go, Python, JS/TS) - The Co-Authored-By attribution to "Claude Opus 4.6" is unusual but not problematic The implementation appears well-tested and production-ready.

😆 1

rcsheets added 1 commit

2026-03-22 07:57:45 +00:00

doc: explain why entropy is computed inline rather than imported 729c2f751f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pr-reviewer-bot commented

2026-03-22 07:57:47 +00:00

Collaborator

Automated review by pr-reviewer v0.14.1 | Safety Check | anthropic | tracking id r-bfa0fb-939fb2
This is an AI-generated review and may contain mistakes.

Status: ✅ Completed

✅ Verdict: LGTM — Comprehensive diff analysis system with proper error handling and no obvious security issues.

Safety Check Review

This is a substantial feature addition that introduces pre-LLM analysis capabilities to the PR reviewer system. The implementation appears well-structured and safe.

Key Components Reviewed

analysis.py (new file)

Shannon entropy calculation for detecting encoded/encrypted content
Pattern matching for sensitive files (CI/CD, auth, secrets, etc.)
Secret detection with common patterns (GitHub tokens, AWS keys, OpenAI keys)
Test file detection and coverage analysis
Repo API integration for author history

Database Schema Changes

Adds complexity column to track LLM-assigned complexity ratings
Proper migration with ADD COLUMN IF NOT EXISTS
Updates all relevant queries and data structures

Integration Points

Filters high-entropy content before sending to LLM (token optimization)
Passes structured analysis as context to LLM prompts
Updates verdict display to include complexity ratings

Security Considerations

✅ API tokens handled properly - Uses existing settings.forgejo_token with appropriate headers
✅ Input validation - Regex patterns are reasonable and not vulnerable to ReDoS
✅ Error handling - API failures are caught and logged, don't crash the system
✅ No credential exposure - Secret detection helps find accidentally committed credentials

No Alarming Issues Found

The code follows established patterns in the codebase, has appropriate error handling, and the new analysis features enhance security by detecting potential credential leaks and sensitive file changes.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.14.1 | Safety Check | anthropic | tracking id `r-bfa0fb-939fb2`* *This is an AI-generated review and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: LGTM** — Comprehensive diff analysis system with proper error handling and no obvious security issues. # Safety Check Review This is a substantial feature addition that introduces pre-LLM analysis capabilities to the PR reviewer system. The implementation appears well-structured and safe. ## Key Components Reviewed **analysis.py (new file)** - Shannon entropy calculation for detecting encoded/encrypted content - Pattern matching for sensitive files (CI/CD, auth, secrets, etc.) - Secret detection with common patterns (GitHub tokens, AWS keys, OpenAI keys) - Test file detection and coverage analysis - Repo API integration for author history **Database Schema Changes** - Adds `complexity` column to track LLM-assigned complexity ratings - Proper migration with `ADD COLUMN IF NOT EXISTS` - Updates all relevant queries and data structures **Integration Points** - Filters high-entropy content before sending to LLM (token optimization) - Passes structured analysis as context to LLM prompts - Updates verdict display to include complexity ratings ## Security Considerations ✅ **API tokens handled properly** - Uses existing `settings.forgejo_token` with appropriate headers ✅ **Input validation** - Regex patterns are reasonable and not vulnerable to ReDoS ✅ **Error handling** - API failures are caught and logged, don't crash the system ✅ **No credential exposure** - Secret detection helps find accidentally committed credentials ## No Alarming Issues Found The code follows established patterns in the codebase, has appropriate error handling, and the new analysis features enhance security by detecting potential credential leaks and sensitive file changes.