Retrieval Benchmarks
Trevec is a codebase memory layer, not an autonomous agent. We benchmark retrieval quality - does the engine find the right files? - not agent solve rate.
The Scope
We isolate the retrieval step to prove Trevec provides the optimal context bundle for any downstream LLM.
300 real-world, highly-curated GitHub issues from popular Python repositories (Django, pytest, scikit-learn). The industry-standard baseline for bug localization and codebase reasoning.
For each of the 300 instances, we pass the raw issue text into Trevec as a single query. No multi-step LLM reasoning loops, no agentic exploration. Pure retrieval.
Hit@1 (Recall@1)
Was the exact file needed to fix the bug the #1 result?
Recall@5
Was the exact file needed in the top 5 results?
SWE-bench Lite Results
Hit@1 (Recall@1)
Recall@5
Query Latency (P50)
Single-shot retrieval pass under 100ms, without relying on slow multi-step LLM reasoning loops or agentic exploration.
Retrieval Quality
Trevec's zero-shot graph retrieval vs lexical search and agentic loops on SWE-bench Lite.
| Architecture | System | Top-1 |
|---|---|---|
| Zero-Shot Graph | Trevec | 43.3% |
| Agentic Loop | Agentless | 34.4% |
| Agentic Loop | SWE-Agent | 31.2% |
| Zero-Shot Lexical | BM25 RAG | 18.9% |
SWE-bench Lite (300 instances). BM25 baseline from Jimenez et al. (ICLR 2024). SWE-Agent and Agentless numbers from their respective papers. Trevec numbers directly measured via single-shot retrieval.
Trade-off Analysis
Trevec runs entirely on-device. No API calls, no cloud indexing, no data leaving your machine.
Engineering Insight
When a user submits a bug report, the text describes the symptoms of the bug. The repository's test files share this exact vocabulary. BM25 and vector search engines predictably rank test files at #1, crowding out the actual source files that need to be fixed.
In our baseline runs, nearly 50% of top predictions were test files.
Instead of expensive LLM loops to filter out tests, Trevec's graph architecture applies structural heuristics that distinguish production code from test infrastructure. Deterministically promoting the files that actually need to be fixed, without any additional API calls or reasoning steps.
Test File Pollution
Hit@1 Accuracy
Hit@1 increased by 60%
From 27.0% to 43.3% with structural heuristics
End-to-End Results
Single retrieval pass + one LLM call resolves 125 of 300 SWE-bench Lite issues. No agent loops, no iterative exploration. Feedback-aware retry on failed instances adds 13 additional resolves.
Issues Resolved
Resolve Rate
Cost per Fix