Retrieval Benchmarks

Measured, not marketed.

Trevec is a codebase memory layer, not an autonomous agent. We benchmark retrieval quality - does the engine find the right files? - not agent solve rate.

The Scope

What we measured

We isolate the retrieval step to prove Trevec provides the optimal context bundle for any downstream LLM.

SWE-bench Lite

300 real-world, highly-curated GitHub issues from popular Python repositories (Django, pytest, scikit-learn). The industry-standard baseline for bug localization and codebase reasoning.

Single-Shot Query

For each of the 300 instances, we pass the raw issue text into Trevec as a single query. No multi-step LLM reasoning loops, no agentic exploration. Pure retrieval.

Hit@1 (Recall@1)

Was the exact file needed to fix the bug the #1 result?

Recall@5

Was the exact file needed in the top 5 results?

SWE-bench Lite Results

300/300 instances. Zero shortcuts.

0%

Hit@1 (Recall@1)

0%

Recall@5

~0ms

Query Latency (P50)

Single-shot retrieval pass under 100ms, without relying on slow multi-step LLM reasoning loops or agentic exploration.

Retrieval Quality

How Trevec compares

Trevec's zero-shot graph retrieval vs lexical search and agentic loops on SWE-bench Lite.

TrevecZero-Shot Graph
0.04sFree (Local)
Top-1
43.3%
AgentlessAgentic Loop
~209sHigh ($)
Top-1
34.4%
SWE-AgentAgentic Loop
~381sHigh ($)
Top-1
31.2%
BM25 RAGZero-Shot Lexical
1.5sFree
Top-1
18.9%
0%10%20%30%40%50%
ArchitectureSystemTop-1
Zero-Shot GraphTrevec43.3%
Agentic LoopAgentless34.4%
Agentic LoopSWE-Agent31.2%
Zero-Shot LexicalBM25 RAG18.9%

SWE-bench Lite (300 instances). BM25 baseline from Jimenez et al. (ICLR 2024). SWE-Agent and Agentless numbers from their respective papers. Trevec numbers directly measured via single-shot retrieval.

Trade-off Analysis

Speed meets privacy.

Trevec runs entirely on-device. No API calls, no cloud indexing, no data leaving your machine.

01002003004005006000246810Retrieval Latency (ms)Privacy ScoreSweet SpotTrevecCursor RAGSourcegraph CodyMem0LangChain RAGLower latency + higher privacy = top-left quadrant wins

Engineering Insight

Solving the "test file" decoy.

Why naive vector search fails

When a user submits a bug report, the text describes the symptoms of the bug. The repository's test files share this exact vocabulary. BM25 and vector search engines predictably rank test files at #1, crowding out the actual source files that need to be fixed.

In our baseline runs, nearly 50% of top predictions were test files.

Trevec's solution

Instead of expensive LLM loops to filter out tests, Trevec's graph architecture applies structural heuristics that distinguish production code from test infrastructure. Deterministically promoting the files that actually need to be fixed, without any additional API calls or reasoning steps.

Test File Pollution

Before
49.7%
After
13%

Hit@1 Accuracy

Before
27%
After
43.3%

Hit@1 increased by 60%

From 27.0% to 43.3% with structural heuristics

End-to-End Results

End-to-End Bug Fixing 41.7% resolved

Single retrieval pass + one LLM call resolves 125 of 300 SWE-bench Lite issues. No agent loops, no iterative exploration. Feedback-aware retry on failed instances adds 13 additional resolves.

125/300

Issues Resolved

0%

Resolve Rate

$0

Cost per Fix

Ready to try Trevec?

Install in 30 seconds. No API keys. Runs entirely on your machine.

Read the Docs