Your AI Agent Is Flying Blind Without Local Code Intelligence

Your AI coding agent doesn’t understand your code. It reads it, token by token, line by line, but it doesn’t understand it. It can’t trace a function call through four modules. It can’t tell you which types implement a trait. It doesn’t know that renaming handleAuth will break three downstream consumers it’s never seen.

That’s because most AI coding tools treat source code the same way they treat a blog post: as text. Local code intelligence treats it as code.

After 1,000+ rounds of benchmarking a local code intelligence engine against real-world codebases, I’ve found that structure-aware, on-device code search consistently outperforms text-based approaches, scoring 9.93 out of 10 on relevance across 15 diverse queries. No cloud API required. No code leaving your machine.

Here’s why local code intelligence is the missing layer in your AI workflow, and what it takes to build one that actually works.

Key Takeaways

Most AI coding tools treat code as flat text, missing call graphs, type hierarchies, and module boundaries that make code navigable.

Local code intelligence combines Tree-Sitter parsing, hybrid search (BM25 + vector with cross-encoder reranking), and on-device LLM inference, all running on your machine via three GGUF models on Apple Silicon Metal.

Benchmarked across 1,000+ iterations: relevance scores improved from 5.8 to 9.93/10 through systematic signal tuning.

One npx command connects 32 structure-aware tools to any MCP-compatible AI agent.

What’s the Conventional Approach? Just Send More Context to the Cloud

The Stack Overflow 2025 Developer Survey found that 84% of developers are using or planning to use AI tools, with 51% using them daily (Stack Overflow, 2025). GitHub Copilot has crossed 20 million users. Cursor hit $2 billion in annualized revenue by February 2026, doubling in three months (Bloomberg, 2026). The explosion is real.

But look at how most of these tools actually work under the hood. The dominant pattern is straightforward: grab files from your repo, chunk them, maybe run them through a cloud-hosted embedding model, and stuff as many tokens as possible into a context window. Need to understand a function? Send the file. Need broader context? Send the whole directory. Hit the token limit? Hope the model figures out the important parts.

In practice, this is the approach behind most AI coding assistants today. GitHub Copilot indexes your repo for context. Cursor uses RAG over your codebase. Cloud-based coding agents pull files into their context window before answering questions.

And it works, to a point. For single-file completions, autocomplete, and “explain this function” queries, text-based context is often good enough. But there’s a growing tension. That same Stack Overflow survey found that 46% of developers actively distrust AI accuracy, up from roughly 30% in prior years (Stack Overflow, 2025). People are using AI tools and doubting their output. Simultaneously.

Why? Because “good enough for autocomplete” isn’t the same as “understands your codebase.” As AI agents take on more ambitious tasks (refactoring across modules, tracing bugs through call chains, assessing the impact of a change), the gap between text search and code understanding becomes a chasm.

Why Does Treating Code as Text Fall Apart?

At its core, the problem is structural blindness. A Qodo study found that 54% of developers who manually select context for AI tools say the AI still misses relevance. And the problem gets worse with seniority, rising from 41% for juniors to 52% for seniors (Qodo, 2025). Senior developers work on harder problems with deeper cross-module dependencies. Text-based context selection can’t keep up.

The text-search ceiling is real. When I started building a code intelligence engine, early benchmarks averaged 5.8 out of 10 on relevance. The root cause wasn’t bad algorithms. It was that text search fundamentally can’t distinguish “code that mentions a concept” from “code that implements it.” A function called extract_concept_tags that checks for JSON patterns in string literals will rank #1 for “JSON serialization” queries, even though it’s a pattern-detection utility, not a serializer. BM25 can’t tell the difference. Only structural understanding can.

Three specific problems compound:

Your code leaves your machine. Every query to a cloud-based code tool means sending proprietary source code to a third-party API. A 2025 Infragistics survey of 250 tech leaders found that security (51%), AI code reliability (45%), and data privacy (41%) are the top software development challenges (GlobeNewsWire, 2025). Meanwhile, 50% of organizations expect data leakage through generative AI tools in the next 12 months (Acuvity, 2025). Your code’s dependency graph, internal API contracts, and architectural patterns are intellectual property. They shouldn’t need to leave your laptop for good search results.

Structure isn’t recoverable from text. Here’s a counterintuitive finding: a METR randomized controlled trial found that AI tools made experienced open-source developers 19% slower on real tasks, even though those same developers perceived a 20% speedup (METR, 2025). One likely contributor: AI tools that suggest plausible-looking code without understanding the structural context of where it lands. Code has call graphs, type hierarchies, import chains, and module boundaries. Text search sees none of it. When your agent searches for “authentication middleware,” text search returns every file that mentions authentication: tests, config files, comments, error messages. Structure-aware search finds the actual middleware function, its callers, its dependencies, and the tests that exercise it.

Cloud round-trips cost time and money. Claude Opus 4.6 costs $25 per million output tokens. Claude Sonnet 4.6 runs $15 per million (Anthropic, 2026). Daily Claude Code users report spending $500 to $2,000 per month on API costs, with one developer consuming 10 billion tokens over eight months (Morph, 2026). Enterprise monorepos span “several million tokens” while even frontier models max out at roughly 1 million. Chroma’s research on 18 LLMs found that “performance grows increasingly unreliable as input length grows” (Chroma Research, 2025). More context doesn’t mean better understanding.

	Cloud Context (RAG)	Local Code Intelligence
Code privacy	Source sent to third-party APIs	Everything stays on your machine
Understanding	Text similarity (flat)	Structural: call graphs, types, imports
Cost per query	$0.01-0.10+ in API tokens	$0 (on-device inference)
Latency	Network round-trip per query	Local, sub-second
Navigation	File-level search results	Symbol-level with 32 navigation tools
Multi-repo	One project at a time	Cross-repo search and dependency exploration
Offline capable	No	Yes

What Does Local Code Intelligence Actually Mean?

Local code intelligence means building a search engine that runs entirely on your machine, understands code as code, and exposes that understanding through a standard protocol your AI agent already speaks.

Not text search with extra steps. Not “local embeddings.” A proper indexing pipeline that parses source code into structured symbols, builds a graph of their relationships, and lets you query that graph with hybrid search, cross-encoder reranking, and structural ranking signals on top.

This is what Code Intelligence MCP Server does. It’s a Rust engine that indexes your codebase into three complementary storage layers, runs three GGUF models on Apple Silicon Metal (an embedding model, a description LLM, and a cross-encoder reranker), and exposes 32 navigation tools through the Model Context Protocol (MCP). Your AI agent connects once and gets structure-aware code understanding without any code leaving your machine.

Let me walk you through how it works.

How Does the Pipeline Actually Work?

Indexing the Code Intelligence MCP Server’s own codebase produces several thousand symbols, over a million edges between them, and one LLM-generated description per symbol. The entire index builds in seconds on Apple Silicon. Here’s what happens.

Stage 1: Parsing with Tree-Sitter

Everything starts with Tree-Sitter, which parses source files into abstract syntax trees (ASTs). Not regex. Not line-matching. Actual language-aware parsing that understands the difference between a function declaration, a type definition, and an import statement.

For each file, Tree-Sitter produces a syntax tree. Language-specific extractors then walk that tree and pull out symbols: functions, structs, enums, traits, type aliases, constants, imports, and their relationships. Each symbol carries metadata: kind, visibility (exported or private), line span, the raw code text, and edges connecting it to other symbols (calls, references, type relationships, imports, reads, writes).

Eight languages are supported: Rust, TypeScript, JavaScript, Python, Go, Java, C, and C++. Adding a new one means writing a single extractor file. The parsing infrastructure handles the rest.

Stage 2: Three Storage Layers, LLM-Enriched

Extracted symbols don’t go into a single index. They go into three complementary storage systems, each optimized for a different kind of query:

SQLite stores the structured metadata: every symbol, every edge between symbols, file metadata, and the LLM-generated descriptions. This is the graph backbone. When your agent asks “what calls this function?” or “what types implement this trait?”, SQLite answers directly through edge traversal. No search needed.

Tantivy (a Rust-native search engine using BM25 ranking) handles full-text keyword search. But it doesn’t just index raw code. The indexing pipeline strips comments (to prevent meta-matching), generates morphological variants of symbol names (watch to watcher, handle to handler), extracts concept tags and framework patterns, and appends LLM-generated descriptions. What emerges is a BM25 index that can match natural-language queries like “file watching” to a function called spawn_watch_loop.

LanceDB stores 1536-dimensional vector embeddings generated by jina-code-embeddings-1.5b, an on-device embedding model running through llama.cpp with Metal GPU offload. The model uses Matryoshka representation, so the first N dimensions of the full 1536-dim vector retain meaningful semantic structure: you can truncate down to a smaller dimension for memory savings without retraining. These embeddings capture semantic similarity. When keyword search can’t bridge the vocabulary gap between your query and the code, vector search often can.

And then there’s the on-device description LLM. A Qwen2.5-Coder-1.5B model (quantized to Q4_K_M, about 1.0 GB) runs locally via llama.cpp with full Metal GPU offload. It generates a one-sentence natural-language description for every symbol in the codebase. These descriptions get indexed into Tantivy, enriching BM25 search with human-readable terms a developer would actually search for. Each description takes about 0.32 seconds; thousands of symbols get described in a single background pass while you work. After generation completes, the LLM is freed to release ~1.0 GB of RAM. The embedding model stays resident for queries.

Stage 3: Hybrid Search with Structural Ranking

When a query comes in, two searches run in parallel: BM25 keyword search through Tantivy and vector similarity search through LanceDB. Results merge through Reciprocal Rank Fusion (RRF), a technique that combines ranked lists by reciprocal position rather than raw scores, making it reliable across different scoring scales.

But raw search scores aren’t enough. On top of the merged results, structural signals reshape the ranking. These are signals only a code-aware engine can provide:

Test file penalty: test files get downranked unless the query explicitly targets tests (multi-layer detection covers file path, symbol name, and AST-level analysis like #[test] and mod tests).
Export status boost: public API symbols represent the primary surface and rank higher.
Intent detection: query patterns like “schema for…” or “error handling” trigger intent-specific multipliers (definitions get 1.5x, schema queries get 50-75x).
Popularity by edge count: functions called from many places rank higher (PageRank-style graph signal).
Framework-pattern injection: routes, middleware, decorators surface alongside symbol matches.
Edge expansion: high-ranking symbols pull in structurally related code (callers, type members), with parent-derived scores stripped of intent multipliers so children compete fairly.
Diversification: caps how many results come from a single file or kind, so one heavy file can’t flood the top-N.
Score-gap detection: if there’s a 2.5x or larger drop between consecutive results, trailing noise gets cut.
Sub-query coverage: for multi-term queries, each sub-query must have at least two matching results before the system declares the query covered.

Stage 4: Cross-Encoder Reranking

After structural ranking, the top candidates pass through a cross-encoder reranker: bge-reranker-v2-m3 (Q8_0, ~600 MB), running on llama.cpp with Metal GPU. Cross-encoders are different from the bi-encoder embeddings used for retrieval: they take the query and a candidate document together and produce a single relevance score, capturing fine-grained interactions that retrieval embeddings can’t. The cost is that you can only afford to run them on a small top-K (default 20). The win is precision: the top three results almost always land where they should.

Reranker output goes through a result cache so duplicate (query, document) pairs don’t re-score, keeping warm-search latency under a couple hundred milliseconds end to end. The reranker is on by default; toggle it with RERANKER_ENABLED=false if you want pure RRF.

Put concretely, a naive text search for “authentication middleware” might return 20 files that mention “auth.” This pipeline surfaces the actual auth middleware, its configuration, its callers, and the tests that cover it. In that order.

What Tools Does Your Agent Actually Get?

Code Intelligence MCP Server exposes 32 tools through MCP. Your AI agent doesn’t just get search. It gets a navigation toolkit, an analysis toolkit, a description-lifecycle toolkit, and (in standalone mode) a cross-repo toolkit.

Search and navigation (9 tools). Your primary entry point is search_code, hybrid semantic and keyword search with the full structural ranking and reranking pipeline. Beyond that: get_definition jumps to a symbol’s full definition. find_references enumerates every usage. get_call_hierarchy traces who calls a function and what it calls. get_type_graph navigates type hierarchies in any direction. explore_dependency_graph maps module-level imports and exports. get_file_symbols lists everything defined in a file. get_usage_examples pulls real call sites from the codebase. get_context_bundle is the most agent-native of the bunch: hand it a task description and it returns a pre-assembled bundle with the relevant definitions, call chains, test coverage, and similar code in a single call.

Analysis (10 tools). find_affected_code performs reverse dependency analysis. predict_impact goes further, combining structural dependencies with git co-change history (files that historically change together) to produce ranked impact predictions with confidence scores. trace_data_flow follows variable reads and writes across functions. find_similar_code and get_similarity_cluster operate over embedding space. find_duplicates flags semantically near-duplicate symbols (the kind of dupe that survives a rename). find_dead_code lists symbols with zero incoming references. explain_search returns a scoring breakdown of any query, useful when an agent needs to understand why a result ranked where it did. summarize_file and get_module_summary provide overview-level context.

Frameworks, tests, and description lifecycle (6 tools). find_tests_for_symbol, search_todos, search_decorators, and search_framework_patterns cover the obvious cases. The interesting pair is find_undocumented_symbols (symbols that haven’t been described yet) and find_stale_descriptions (symbols whose descriptions are out of sync with the current code, detected via content-hash mismatch). These are how the description LLM stays caught up with edits without a full reindex.

Cross-repo (2 tools, standalone mode). search_across_repos runs a single query across every indexed repository and merges results by score. explore_cross_repo_dependencies walks dependency edges that cross repo boundaries. For monorepo-by-convention setups (a backend repo, a frontend repo, a shared types repo), this collapses what would be three separate searches into one structured query.

Index management and learning (5 tools). hydrate_symbols, report_selection, report_file_access, refresh_index, get_index_stats. The two report_ tools feed an opt-in learning system that boosts symbols and files you’ve selected or visited recently.

These tools compose naturally. An agent investigating a bug can: search for the relevant function, get its definition, trace its callers, check what tests cover it, and assess the impact of a fix. All through local MCP calls, in seconds, with zero cloud dependency.

How Do You Prove Code Intelligence Is Actually Intelligent?

Our finding: After 1,000+ benchmark rounds across two codebases, systematic signal tuning improved search relevance from 5.8 to 9.93 out of 10, a 71% improvement driven by targeted fixes to specific ranking pathologies.

Most code search tools ship with vague claims about “smart search” or “AI-powered results.” Nobody publishes relevance scores. We do.

The benchmark system runs two suites: a self-benchmark (15 queries against the engine’s own codebase) and a cross-repo benchmark (15 queries against an entirely separate real-world project). Each query runs through the live MCP server. An automated evaluator scores each result set from 1 to 10 on relevance.

Here’s how scores progressed across key milestones:

Self-benchmark (15 queries, own codebase):

Round	Avg Score	Key Fix
R12	5.80	Baseline
R25	6.73	Comment stripping, concept tags
R59	7.00	Import tag scoping to exported symbols
R61	7.20	SQL test detection, edge expansion filtering
R63	7.80	Asymmetric embedding fix (BGE vs Jina)
R97	8.00	LanceDB data loss auto-repair

Cross-repo benchmark (15 queries, separate real-world project):

Round	Avg Score	Key Fix
R103	~8.4	Baseline on new repo
R120	9.67	Sub-query coverage, file-concentration diversity
R121	9.80	Narrowed glue-code penalty to actual barrel files
R124	9.87	Framework injection cap, BM25 fallback
R1044	9.93	Stem dedup, post-gap score fill

The final per-query breakdown at R1044 tells the full story. Fourteen of fifteen queries score a perfect 10:

Query	Topic	Score
Q1	Scoring and ranking logic	10
Q2	Database schema and migrations	10
Q3	Middleware and request handling	10
Q4	Error handling patterns	10
Q5	Rate limiting and throttling	10
Q6	Authentication and authorization	10
Q7	Admin dashboard components	9
Q8	API endpoint definitions	10
Q9	Background job processing	10
Q10	Data validation and sanitization	10
Q11	Tauri desktop integration	10
Q12	Roles and permissions	10
Q13	Shared state management	10
Q14	File upload handling	10
Q15	Notification system	10

Each round isolates a specific failure. Some examples from the journey:

Round 59: Import tags were being appended to private helper functions, causing a 5-line utility to rank for “embeddings” queries because its file happened to import an embeddings module. Fix: scope import tags to exported symbols only.
Round 115: Score-gap detection was introduced. When consecutive results show a 2.5x or larger score drop, trailing noise gets cut. Calibrated so it only triggers on genuine relevance cliffs.
Round 118: The Elysia framework pattern extractor was wildly overeager. Any .get() call got classified as an HTTP route, including Map.get(key) and headers.get('content-type'). 149 of 221 detected patterns were false positives. Fix: require the first argument to be a string literal starting with /.

Importantly, the evaluator has a noise floor of about ±0.5 points per round, so changes must produce statistically meaningful improvements. Running python3 scripts/run_benchmark.py --live takes about 4 seconds for a full 15-query suite. That fast feedback loop is what makes rigorous iteration possible. The same shape (record, replay, score, attribute regressions) is the one I argue for more generally in Backtesting AI Agents.

This isn’t hand-wavy “AI-powered” marketing. It’s engineering discipline applied to search quality, measured against real queries, across real codebases.

What About Cross-Repo and Large Monorepos?

Single-repo search is the easy case. Real teams almost never live in one repo. A backend repo, a frontend repo, a shared types or design-system repo, plus whatever vendor packages live as siblings. Any answer that requires walking from one to another (where does this API consumer live? which packages depend on this types module?) is exactly the kind of question text search fails on.

Standalone mode is the answer. Run one server: npx @iceinvein/code-intelligence-mcp-standalone. Point any number of MCP clients at http://localhost:3333/mcp. Each session auto-detects its workspace through the MCP roots capability and shares the registry. The three GGUF models (~3.2 GB combined) load once and stay resident; per-repo indexes live under ~/.code-intelligence/repos/<hash>/.

Two tools then operate across that registry. search_across_repos runs a single query against every indexed repo and merges by score: the same 32-tool agent that worked in one project now works across the constellation. explore_cross_repo_dependencies walks dependency edges that cross repo boundaries, so you can ask “which symbols in the backend repo reference this type from the shared repo?” without manually correlating two separate indexes.

The savings compound when you’re running multiple AI sessions against multiple repos. Three GGUF models loaded once across five sessions costs ~3.2 GB; loading per-session costs ~16 GB. For developers running parallel Claude Code sessions across a monorepo-by-convention layout, that’s the difference between “this works on my laptop” and “this needs a workstation.”

Why Does the MCP Protocol Make This Possible?

The Model Context Protocol (MCP), introduced by Anthropic in late 2024, is the connective tissue that makes local code intelligence practical. In one year, MCP grew from roughly 100,000 SDK downloads to 97 million monthly downloads, with over 10,000 servers and 300+ clients (Pento.ai, 2025). OpenAI adopted it in March 2025. Google DeepMind followed in April. By December 2025, the protocol was donated to the Linux Foundation (MCP Blog, 2025).

Before MCP, connecting a code intelligence engine to an AI agent meant building a bespoke plugin for each client. MCP changes that. Code Intelligence MCP Server implements the protocol once and works with any MCP-compatible client: Claude Code, Cursor, OpenCode, Trae, or any future agent that speaks MCP.

Two transport modes are available. Stdio mode is the default: each AI session spawns its own server process, with communication over standard input/output. Standalone mode runs a single long-lived HTTP server that multiple sessions connect to simultaneously, sharing the embedding model, description LLM, and reranker across every connection.

What Does On-Device Inference Actually Get You?

Nearly a third of developers use macOS (Stack Overflow, 2024), and Apple Silicon has turned every MacBook Pro into a capable local AI platform. Community benchmarks show M3 Pro achieving 45-60 tokens per second with quantized 8B models via llama.cpp (Scalastic, 2025). That’s more than enough for code intelligence workloads.

Every query against Code Intelligence MCP Server runs entirely on your machine. Three GGUF models do the work, all on llama.cpp with full Metal GPU offload:

jina-code-embeddings-1.5b (Q8_0, ~1.5 GB, 1536-dim Matryoshka): generates query and document vectors. Symmetric embeddings, so queries and documents share the same space without instruction prefixes. Stays resident.
Qwen2.5-Coder-1.5B-Instruct (Q4_K_M, ~1.0 GB): generates one-line natural-language descriptions per symbol at index time, ~0.32s per symbol. Freed from memory after the description pass completes.
bge-reranker-v2-m3 (Q8_0, ~600 MB): cross-encoder that re-scores the top-K candidates after structural ranking. Stays resident, cached against duplicate (query, doc) pairs.

Total first-launch download: ~3.2 GB. Resident steady-state memory: ~2.1 GB (embedding + reranker; the description LLM is freed after generation).

What this means in practice: zero per-query cloud costs. No API rate limits. No network latency on every search. And no code ever leaves your machine. Compare that to $500-2,000 per month in cloud API costs for daily AI coding workflows. Over a year, a team of five developers could save $30,000-120,000, while getting faster results and keeping proprietary code on their own hardware.

How Do You Get Started?

Getting started takes one command:

npx @iceinvein/code-intelligence-mcp

Add the server to your AI agent’s MCP configuration (e.g., ~/.claude.json for Claude Code):

{
  "mcpServers": {
    "code-intelligence": {
      "command": "npx",
      "args": ["-y", "@iceinvein/code-intelligence-mcp"],
      "env": {}
    }
  }
}

On first launch, the server downloads three GGUF models to ~/.code-intelligence/models/: the embedding model, the description LLM, and the cross-encoder reranker (~3.2 GB combined). These are cached, so subsequent launches are instant. The server auto-detects your working directory and begins indexing in the background.

For teams running multiple AI sessions or working across multiple repos, standalone mode is the right shape:

npx @iceinvein/code-intelligence-mcp-standalone

Then point each client at http://localhost:3333/mcp. One server, shared models, multiple sessions, cross-repo search.

Out of the box, the engine indexes Rust, TypeScript/TSX, JavaScript, Python, Go, Java, C, and C++. File watching is enabled by default. Edit a file and the index updates automatically. Description regeneration follows: edits flag descriptions as stale via content hash; the next refresh pass regenerates the affected ones without rebuilding the entire description set.

What Are the Limitations?

Intellectual honesty: this isn’t a universal solution yet.

macOS only. The server relies on Apple Silicon’s Metal GPU for on-device inference. Linux and Windows support would require switching to CUDA or CPU-only inference, which is feasible but not yet implemented.

First-index takes time. A large codebase (thousands of files) may take a few minutes for the initial parse and embedding pass. Description generation runs as a background pass at ~0.32s per symbol, so a 3,000-symbol codebase finishes describing in roughly 15-20 minutes while you keep working. Subsequent incremental updates are fast (file watcher detects changes within seconds).

Model downloads are ~3.2 GB. Three models, one-time cost, but it’s still a chunky first-launch experience on slow connections.

It complements, not replaces, cloud AI tools. Code Intelligence MCP Server gives your agent better code understanding. It doesn’t generate code, write tests, or do the work that Copilot or Claude does. It’s the intelligence layer underneath, making those tools smarter by giving them structural context instead of raw text. For the broader picture of where MCP-backed retrieval fits among the four context surfaces, see Context Engineering in Practice.

Frequently Asked Questions

But don’t cloud-based tools already have great code search?

Cloud tools have gotten better at RAG over codebases, yes. But they’re fundamentally limited to text-level similarity. They can find files that mention your query terms. They can’t trace call hierarchies, navigate type graphs, or identify that renaming a function breaks three indirect consumers. Structure-aware search across a million-edge graph isn’t something you get from embeddings alone, whether cloud or local.

Why three models instead of one?

Different jobs need different shapes. The embedding model (a bi-encoder) is built for fast retrieval at scale: encode every document once, encode the query, compute similarity. The description LLM is a generative model that produces human-readable summaries to enrich BM25, bridging the gap between how you search and how code is named. The cross-encoder reranker takes the query and a candidate together and computes a single relevance score, capturing interaction details bi-encoders can’t. Combining bi-encoder retrieval with cross-encoder reranking is the standard high-precision retrieval pattern; what’s local here is that all three run on your laptop instead of behind an API.

What if my codebase uses a language that isn’t supported?

The server currently supports eight languages: Rust, TypeScript, JavaScript, Python, Go, Java, C, and C++. Adding a new language requires writing one Tree-Sitter extractor file. The parsing, indexing, and retrieval infrastructure handles the rest. Languages not yet supported still get file-level indexing; they just miss symbol-level structure.

IDE navigation gives you one hop at a time: click a function, jump to its definition. Code Intelligence MCP Server gives your AI agent the full graph. It can trace an entire call chain, find all implementations of a trait, identify every consumer of a type, and assess the impact of a change. All programmatically, all composable. It’s not replacing your IDE; it’s giving your AI agent the same structural understanding your IDE has, plus the ability to query it at scale.

Does the cross-encoder reranker add a lot of latency?

It’s the slowest single step in the search pipeline, but on top-20 candidates with Metal GPU offload it adds ~20-50 ms. End-to-end warm search lands in the 50-200 ms range, which is well under the threshold where AI agents start to feel sluggish. If you ever want to disable it, set RERANKER_ENABLED=false and you’ll fall back to RRF-only ranking.

Does this work in a monorepo?

Yes. Standalone mode is built for this. Index every package as its own repo (or treat the monorepo root as the indexed path), point your AI sessions at the same standalone server, and use search_across_repos to query the whole tree at once. explore_cross_repo_dependencies walks edges that cross package boundaries.

The Case for Local Code Intelligence

Code intelligence shouldn’t require sending your intellectual property to a cloud API. It shouldn’t depend on token budgets or network availability. And it shouldn’t treat the richly structured artifact that is source code as a bag of words.

Local code intelligence that parses code into symbols, builds relationship graphs, runs hybrid search with cross-encoder reranking, and applies structural ranking signals is what AI coding agents have been missing. The benchmark data proves it works: 9.93 out of 10 relevance across 15 queries, earned through 1,000+ rounds of systematic engineering.

One npx command. Three on-device models. Thirty-two tools. Zero code leaves your machine.

Your agent shouldn’t be flying blind. Give it a map.

View the project on GitHub →