feat: pre-LLM analysis engine with diff stats, anomaly detection, and repo context #16

Merged
rcsheets merged 2 commits from feat/diff-analysis into main 2026-03-22 08:04:05 +00:00
Owner

Adds an analysis layer that runs before the LLM call:

Diff analysis (no API calls):

  • Line counts, file types, new/deleted files
  • Test file detection (are tests included in this change?)
  • Sensitive path classification (CI, auth, migrations, deps, k8s)
  • Shannon entropy per file (detects encoded data, minified code, secrets)
  • Secret/credential pattern matching (AWS keys, GitHub tokens, etc)
  • Large new file warnings (vendored/generated code)
  • Very long line detection

Repo analysis (Forgejo API):

  • Author PR history (new contributor vs established)
  • Test coverage check (do changed source files have corresponding tests?)

Diff filtering:

  • High-entropy file diffs replaced with a note before sending to LLM,
    saving tokens without losing information

Complexity categorization:

  • LLM now returns trivial/moderate/complex via structured output
  • Stored in DB and shown in review comments

All analysis results are included in the LLM prompt as structured
context under "Diff statistics", "Repository context", and
"Automated observations (non-AI)" headings.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Adds an analysis layer that runs before the LLM call: Diff analysis (no API calls): - Line counts, file types, new/deleted files - Test file detection (are tests included in this change?) - Sensitive path classification (CI, auth, migrations, deps, k8s) - Shannon entropy per file (detects encoded data, minified code, secrets) - Secret/credential pattern matching (AWS keys, GitHub tokens, etc) - Large new file warnings (vendored/generated code) - Very long line detection Repo analysis (Forgejo API): - Author PR history (new contributor vs established) - Test coverage check (do changed source files have corresponding tests?) Diff filtering: - High-entropy file diffs replaced with a note before sending to LLM, saving tokens without losing information Complexity categorization: - LLM now returns trivial/moderate/complex via structured output - Stored in DB and shown in review comments All analysis results are included in the LLM prompt as structured context under "Diff statistics", "Repository context", and "Automated observations (non-AI)" headings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds an analysis layer that runs before the LLM call:

Diff analysis (no API calls):
- Line counts, file types, new/deleted files
- Test file detection (are tests included in this change?)
- Sensitive path classification (CI, auth, migrations, deps, k8s)
- Shannon entropy per file (detects encoded data, minified code, secrets)
- Secret/credential pattern matching (AWS keys, GitHub tokens, etc)
- Large new file warnings (vendored/generated code)
- Very long line detection

Repo analysis (Forgejo API):
- Author PR history (new contributor vs established)
- Test coverage check (do changed source files have corresponding tests?)

Diff filtering:
- High-entropy file diffs replaced with a note before sending to LLM,
  saving tokens without losing information

Complexity categorization:
- LLM now returns trivial/moderate/complex via structured output
- Stored in DB and shown in review comments

All analysis results are included in the LLM prompt as structured
context under "Diff statistics", "Repository context", and
"Automated observations (non-AI)" headings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator

Automated review by pr-reviewer v0.14.1 | Safety Check | anthropic | tracking id r-bf9ebd-6add21
This is an AI-generated review and may contain mistakes.

Status: Completed


Verdict: LGTM — The pre-LLM analysis engine adds valuable features like anomaly detection and test coverage checking without introducing any obvious security vulnerabilities or breaking changes.

Pre-LLM Analysis Engine Review

This PR adds a comprehensive analysis layer that runs before LLM calls, providing structured context about diffs and repository information. The implementation looks solid overall.

Key Observations

Security: No exposed credentials or secrets detected. The secret detection patterns in _SECRET_PATTERNS are appropriate for catching common credential formats.

Architecture: Clean separation between diff analysis (local text processing) and repo analysis (API calls). The filtering mechanism for high-entropy content is a smart token-saving optimization.

Database: Schema migration is handled properly with ALTER TABLE IF NOT EXISTS for the new complexity column.

Error Handling: HTTP requests are wrapped in try-catch blocks with timeouts, and database operations handle exceptions appropriately.

Backwards Compatibility: Changes to existing APIs maintain compatibility by adding optional parameters and providing sensible defaults.

Minor Notes

  • The entropy threshold of 5.5 bits/char and long line threshold of 500 chars seem reasonable for detecting encoded content
  • Test file detection patterns cover the major ecosystems (Go, Python, JS/TS)
  • The Co-Authored-By attribution to "Claude Opus 4.6" is unusual but not problematic

The implementation appears well-tested and production-ready.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.14.1 | Safety Check | anthropic | tracking id `r-bf9ebd-6add21`* *This is an AI-generated review and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: LGTM** — The pre-LLM analysis engine adds valuable features like anomaly detection and test coverage checking without introducing any obvious security vulnerabilities or breaking changes. # Pre-LLM Analysis Engine Review This PR adds a comprehensive analysis layer that runs before LLM calls, providing structured context about diffs and repository information. The implementation looks solid overall. ## Key Observations ✅ **Security**: No exposed credentials or secrets detected. The secret detection patterns in `_SECRET_PATTERNS` are appropriate for catching common credential formats. ✅ **Architecture**: Clean separation between diff analysis (local text processing) and repo analysis (API calls). The filtering mechanism for high-entropy content is a smart token-saving optimization. ✅ **Database**: Schema migration is handled properly with `ALTER TABLE IF NOT EXISTS` for the new `complexity` column. ✅ **Error Handling**: HTTP requests are wrapped in try-catch blocks with timeouts, and database operations handle exceptions appropriately. ✅ **Backwards Compatibility**: Changes to existing APIs maintain compatibility by adding optional parameters and providing sensible defaults. ## Minor Notes - The entropy threshold of 5.5 bits/char and long line threshold of 500 chars seem reasonable for detecting encoded content - Test file detection patterns cover the major ecosystems (Go, Python, JS/TS) - The Co-Authored-By attribution to "Claude Opus 4.6" is unusual but not problematic The implementation appears well-tested and production-ready.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator

Automated review by pr-reviewer v0.14.1 | Safety Check | anthropic | tracking id r-bfa0fb-939fb2
This is an AI-generated review and may contain mistakes.

Status: Completed


Verdict: LGTM — Comprehensive diff analysis system with proper error handling and no obvious security issues.

Safety Check Review

This is a substantial feature addition that introduces pre-LLM analysis capabilities to the PR reviewer system. The implementation appears well-structured and safe.

Key Components Reviewed

analysis.py (new file)

  • Shannon entropy calculation for detecting encoded/encrypted content
  • Pattern matching for sensitive files (CI/CD, auth, secrets, etc.)
  • Secret detection with common patterns (GitHub tokens, AWS keys, OpenAI keys)
  • Test file detection and coverage analysis
  • Repo API integration for author history

Database Schema Changes

  • Adds complexity column to track LLM-assigned complexity ratings
  • Proper migration with ADD COLUMN IF NOT EXISTS
  • Updates all relevant queries and data structures

Integration Points

  • Filters high-entropy content before sending to LLM (token optimization)
  • Passes structured analysis as context to LLM prompts
  • Updates verdict display to include complexity ratings

Security Considerations

API tokens handled properly - Uses existing settings.forgejo_token with appropriate headers
Input validation - Regex patterns are reasonable and not vulnerable to ReDoS
Error handling - API failures are caught and logged, don't crash the system
No credential exposure - Secret detection helps find accidentally committed credentials

No Alarming Issues Found

The code follows established patterns in the codebase, has appropriate error handling, and the new analysis features enhance security by detecting potential credential leaks and sensitive file changes.

*Automated review by [pr-reviewer](https://git.brooktrails.org/brooktrails/pr-reviewer) v0.14.1 | Safety Check | anthropic | tracking id `r-bfa0fb-939fb2`* *This is an AI-generated review and may contain mistakes.* **Status:** ✅ Completed --- **✅ Verdict: LGTM** — Comprehensive diff analysis system with proper error handling and no obvious security issues. # Safety Check Review This is a substantial feature addition that introduces pre-LLM analysis capabilities to the PR reviewer system. The implementation appears well-structured and safe. ## Key Components Reviewed **analysis.py (new file)** - Shannon entropy calculation for detecting encoded/encrypted content - Pattern matching for sensitive files (CI/CD, auth, secrets, etc.) - Secret detection with common patterns (GitHub tokens, AWS keys, OpenAI keys) - Test file detection and coverage analysis - Repo API integration for author history **Database Schema Changes** - Adds `complexity` column to track LLM-assigned complexity ratings - Proper migration with `ADD COLUMN IF NOT EXISTS` - Updates all relevant queries and data structures **Integration Points** - Filters high-entropy content before sending to LLM (token optimization) - Passes structured analysis as context to LLM prompts - Updates verdict display to include complexity ratings ## Security Considerations ✅ **API tokens handled properly** - Uses existing `settings.forgejo_token` with appropriate headers ✅ **Input validation** - Regex patterns are reasonable and not vulnerable to ReDoS ✅ **Error handling** - API failures are caught and logged, don't crash the system ✅ **No credential exposure** - Secret detection helps find accidentally committed credentials ## No Alarming Issues Found The code follows established patterns in the codebase, has appropriate error handling, and the new analysis features enhance security by detecting potential credential leaks and sensitive file changes.
rcsheets deleted branch feat/diff-analysis 2026-03-22 08:04:05 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
brooktrails/pr-reviewer!16
No description provided.