Coding agents can reason over code, but the retrieval tools in the loop need a deterministic map of the repository before that reasoning becomes reliable.

The Agent Looked Competent Until The Repository Got Real
Picture a team asking a coding agent to make a small change to a shared data contract.
The task sounds ordinary. Add one validation rule, update the code path that consumes the field, make sure the tests still cover the behavior, and keep the downstream tools stable. The repository is not exotic. It has a Python package, a Rust worker, a small CLI, a Kotlin client, a few notebooks used by the ML team, some generated code, a coverage report, and enough Git history that certain files have clearly learned to change together.
The agent starts well. It finds the data type. It finds one consumer. It reads a test that looks related. It edits the validation rule and adds a plausible assertion. The patch is coherent in the narrow slice it saw.
Then review finds the missing context.
Another command path consumes the same contract. A generated Kotlin serializer depends on the field shape. A fixture provides the validation dependency indirectly. The failing production path lives in a neighboring parametrized test, not in the test the agent happened to read. A notebook assumes the old column shape. The model is mentioned in documentation that needs a small wording change. Git history shows that another file usually changes with this one because the producer and consumer drifted once before.
These are repository facts, available to a human reviewer who knows where to look. The agent had to rediscover them through search, inference, and a short-lived reading path through the repository.
That is the gap that led me to build repo-k-graph, or rkg.
rkg is a Rust command-line tool and MCP server for building deterministic repository knowledge graphs. It indexes source code, tests, documentation, framework metadata, workspace manifests, coverage reports, and Git history into a local SQLite store, then exposes that structure through CLI commands and Model Context Protocol tools. The current release, 1.0.12, supports Python, Rust, F#, Mojo, Kotlin, Swift, Android resource linkage, full-text search, coverage import, context packing, and benchmarking.
Those features matter. The article is about the design claim behind them: deterministic retrieval first, semantic reasoning second.
I Wanted To Improve Agentic Workflow
My starting point was practical. I wanted multi-step agent runs in real repositories to become more predictable and easier to review, the same way I would improve any other engineering workflow that kept failing in the same places.
The useful decomposition, for me, is three parts: an LLM that reasons, tool-calling that touches the repository, and a harness around the loop that observes state, chooses actions, applies policy, validates output, and hands work off across sessions. The loop itself is fairly legible now. The harder questions are what it can see, what it is allowed to call, and how much of the contract survives the next run.
I started with the harness layer. Explicit coding preferences, repo-resident instructions, reusable skills, review norms, and durable artifacts made runs more consistent. The agent stopped re-deriving my taste from scratch every session. That work paid off quickly. The problem space is the one I wrote about in From Tool-Calling Loops to Repository Contracts; this essay picks up where that left off.
Once the harness carried more of the contract, the bottleneck moved to what the loop actually calls. For repository tasks, the dominant tool cost is retrieval: finding symbols, tests, documentation, impact, and co-change history without rebuilding structure from grep on every run. Agents tend to treat their tools as given. If the environment gives them line-oriented search and file reads, they build their understanding out of line-oriented search and file reads. If the environment gives them symbol lookup, callers, tests, docs, impact analysis, coverage, and token-budgeted context packs, they can work at a higher level.
With frontier models such as GPT-5.5 and Claude Opus 4.8, my experience is that repository retrieval often succeeds in the end. If I give the agent enough time, enough tool calls, and enough correction, it usually finds the relevant file, notices the missing test, or recovers the symbol it missed on the first pass. The frustrating part is often the recovery path: rg, read a file, summarize, search again, open another file, realize the first match was the wrong layer, search with a different term, and slowly reconstruct structure that the repository already contained.
That loop consumes tokens and reviewer attention. A human has to separate the agent’s real uncertainty from the noise created by low-level exploration. Better retrieval is one of the places where engineers can make agent use less wasteful and more inspectable without waiting for the next model release. AI can absolutely help build better retrieval tools; rkg itself was built with agent assistance. During a research or coding task, though, an agent usually inherits the repository substrate already exposed by the surrounding environment. rkg is my attempt at that retrieval tool layer: the fact substrate the loop should inherit.
Repository Context Is More Than Nearby Text
For coding agents, retrieval is often the tool layer the loop invokes most.
Most coding-agent workflows already have some way to retrieve context. The agent can use grep or ripgrep. It can ask for filenames. It can run tests. It can inspect call sites by text search. Some systems add embeddings or semantic chunk retrieval. IDE integrations can expose language-server information. These are all useful, with different blind spots.
I still reach for rg constantly as a human. It is fast, honest, and one of the best tools we have for codebase navigation. The problem is that line-oriented search is too low-level as the primary interface for an agent’s repository context. It returns matches. The model then has to spend tokens deciding whether those matches are definitions, callers, tests, generated artifacts, stale docs, framework registrations, or unrelated strings.
Embeddings can surface conceptually related files, while structural questions still need a different kind of evidence: which function calls another, which CLI path consumes a type, which package depends on a module, which test fixture configures the path under review. LSP-style views can be precise inside one language, while many real repositories are polyglot and full of framework conventions, generated symbols, documentation, resource files, notebooks, coverage reports, and Git behavior that sit outside a single editor abstraction.
A repository is a set of relationships as well as a pile of text:
- this symbol is defined in this file
- this command or handler reaches this code path
- this test calls this implementation path
- this type appears in this model and that response
- this Android layout id is referenced from this Kotlin code
- this unsafe Rust block sits behind this wrapper
- this file often changes with that one
- this documentation block names this qualified symbol
When a human engineer navigates a codebase, they reconstruct those relationships constantly. They do it with search, memory, compiler feedback, tests, local conventions, and experience. A coding agent tries to do the same thing through tool calls and context windows.
That works better when the relationships are already materialized.
The key shift is to make repository context a reusable substrate. Some context should be computed once, stored locally, and queried directly. The agent can still reason, summarize, and choose. The raw map of the repository should be available before the model happens to read the right five files in the right order.
Where rkg Fits Among Existing Tools
Several projects are trying to make repositories more usable for AI systems. The surrounding tool space is moving in a similar direction, which I take as evidence that the pain is real.
Repomix is useful when the goal is to package a repository into an AI-friendly artifact. It turns a codebase into a compact representation that can be handed to a model, with filtering and token-awareness around the packing step. That is a good fit for workflows where the main problem is “give the model a well-shaped snapshot.” rkg focuses on persistent, queryable facts. It lets a human or agent ask narrower questions about symbols, tests, docs, commands, routes, coverage, Git history, and impact.
Serena is closer in spirit because it exposes code intelligence through MCP and leans on language-server-style symbolic operations. That direction makes sense to me. Agents need interfaces above raw file search. rkg treats the repository as a local fact graph with SQLite persistence and explicit query surfaces for workspace metadata, framework metadata, co-change, coverage, context packing, and benchmarking. LSP-style symbolic retrieval is one important layer; repository intelligence also has to account for the non-LSP facts that repeatedly matter in real tasks.
Aider is another important reference point because its repo-map idea is agent-native and practical. It tries to keep the model oriented inside the repository without dumping everything into context. I see rkg as complementary to that style. A repo map gives an agent a compact structural overview inside a particular workflow. rkg tries to make repository structure available as a standalone service that multiple clients can query through CLI or MCP.
There is also a graph-native research thread. Codebase-Memory explicitly targets the repeated file-reading and grep-searching pattern, building Tree-sitter-based knowledge graphs for MCP-based code exploration. Repository Intelligence Graph / SPADE frames repository structure as a deterministic architectural map for LLM code assistants, with particular attention to build and test structure. Those projects differ in implementation and emphasis, and they point at the same broad lesson: stronger models make the missing repository map more visible.
The charitable comparison is this. Repomix helps package context. Serena and Aider improve the interaction between agents and code structure. Graph-native systems push toward persistent repository maps. rkg sits in that family, with a specific bet on local SQLite facts, multi-language ecosystem extraction, provenance-first context packing, and MCP tools that can replace repeated low-level search loops.
Deterministic Retrieval First, Semantic Reasoning Second
The design principle behind rkg is simple: keep deterministic repository facts separate from agent reasoning. In loop terms, the tools return deterministic facts and the LLM does the reasoning.
That separation matters because the two jobs have different failure modes. Static extraction can be incomplete, especially in dynamic languages and framework-heavy code. When it records a source span, a symbol name, an import edge, a test record, or a Git co-change relationship, the result has provenance. It can be inspected, tested, regenerated, and compared across runs.
Agent reasoning is different. It is flexible, useful, and often surprisingly good. It is also probabilistic. It can connect facts, explain tradeoffs, and propose changes. The fact layer should already exist before that reasoning begins.
This distinction changes the shape of the workflow. Instead of asking an agent to “look around the repo and figure out what matters,” a team can give it tools that answer narrower questions:
|
|
These commands make the starting point less arbitrary. Human judgment still decides whether the graph is missing a dynamic edge or whether a context pack over-included a noisy neighbor. That is a better failure mode than forcing the agent to rebuild the repository map out of loosely related text chunks on every run.
In practice, this reduces avoidable token waste. Agents spend a lot of their budget rediscovering structure: opening files to check definitions, scanning tests to find likely coverage, searching for a route, looking for documentation, asking whether a name is a type or a function, and repeating similar work after a session reset. A local fact layer lets those questions become cheap.
What rkg Builds
At a high level, rkg is a pipeline from repository files to queryable facts.

The ingestion layer finds the repository root, honors .gitignore and .rkgignore, computes file hashes, detects language, and supports incremental indexing. That part is deliberately boring. If the file set changes nondeterministically, every higher-level query becomes suspect.
Language adapters extract structured facts. The project began with Python because Python is widely used with coding agents and has enough dynamic behavior to make context retrieval painful. It then expanded to Rust, F#, Mojo, Kotlin, and Swift. The adapters use Tree-sitter where practical and keep language-specific parsing separate from persistence and CLI formatting. The detailed support matrix belongs in the user manual; the essay-level point is that each ecosystem hides different facts from plain text search.
The database stores the facts in SQLite. Core tables cover repositories, files, symbols, edges, docs, tests, and index runs. Additional tables capture Git history, routes, Pydantic models, workspace dependencies, concurrency topology, Rust safety metadata, coverage, Android components, Android resources, generated symbols, and FTS5 indexes for ranked symbol and documentation search.
The query layer sits above the database so the CLI and MCP server share behavior. A command-line user and an agent tool call should be asking the same underlying question.
That last point matters more than it sounds. A common failure in agent tooling is that human-facing and model-facing interfaces drift apart. The CLI grows one set of behaviors. The agent tool grows another. Documentation describes a third. In rkg, the CLI and MCP server are both surfaces over the same repository intelligence layer.
The Useful Graph Extends Past Calls
Call graphs are useful. Agent work needs a broader graph.
Suppose the target is a validation function inside a shared library. A direct call graph tells the agent who calls it and what it calls. That is a start. The agent also needs to know which tests exercise it, which CLI command or service path reaches it, which type annotation references it, which docs mention it, whether a notebook uses the same data shape, whether coverage is thin, and whether Git history suggests a downstream file usually changes at the same time.
That is the better way to read the language adapter work. The important question is which facts each ecosystem makes easy to miss.
Tests and coverage are one category. If the agent sees a target symbol and one obvious test, the patch can look responsible while still missing the fixture that configures the behavior or the coverage report showing that a branch is untested. The useful fact is which tests are linked to the symbol, which dependencies those tests carry, and where coverage is thin.
Workspace and package metadata are another category. A change in a shared crate, package, or project rarely lives inside one file. Manifests, lockfiles, project references, generated-source directories, and dependency edges often explain why a small local edit affects a downstream build or client.
Generated code and resource links form a third category. Source search can find the hand-written class and still miss the generated serializer, resource identifier, notebook cell, or UI/resource reference that depends on the same shape. The relevant fact may live in a manifest, generated-symbol table, XML file, markdown heading, or notebook cell rather than in a neighboring source file.
Concurrency, safety, and framework boundaries form a fourth category. A Rust async worker, Kotlin coroutine flow, Swift task, or FFI boundary carries risk that a text match will not explain. The graph does not need to model every runtime behavior perfectly to help. It needs to surface the static facts that change what a reviewer or agent should inspect next.
This is why multi-language support matters as design work rather than a checkbox. Each adapter is a claim about what repository facts matter for that ecosystem. A generic text chunker can treat all of these files as strings. An agent doing engineering work needs the extra shape.
Context Packing Is Where The Graph Becomes Useful
A large graph still needs a usable interface. Dumping every neighboring symbol into the context window simply recreates the original problem with more ceremony.
The practical interface is context packing. A context pack is the retrieval tool’s output contract: the structured neighborhood the loop receives after a targeted query.
rkg context <symbol> builds a deterministic, token-budgeted context pack around a target. It can include the target definition, direct relationships, relevant tests, documentation, imports, type references, and impact information. The output can be Markdown for a human-readable agent prompt or JSON for a more structured workflow.
The budget matters. Agents need a useful subset ordered by provenance and relevance to the task. A context pack should help the agent start in the right place instead of burying it under every edge the indexer knows.
Impact analysis is the companion query. rkg impact <symbol> --depth 2 traces upstream and downstream relationships, affected tests, and affected documentation. For a human reviewer, that helps answer “what should I inspect before trusting this patch?” For an agent, it helps decide which files to read or edit next.
This is also where deterministic and semantic retrieval can work together. The graph can assemble a factual neighborhood. The model can then reason over that neighborhood, explain a change plan, decide which tests to run, or ask for more context. The model is still doing the work that models are good at. It is doing that work over a better substrate, with fewer recovery loops and fewer token-expensive searches.
MCP Turns The Fact Layer Into Agent Infrastructure
The CLI is useful for humans, and rkg is also meant to be consumed by coding agents. That is why it includes a Model Context Protocol stdio server:
|
|
The server exposes deterministic tools such as:
find_symbolget_symbolget_callersget_calleesget_docsget_testsget_impact_analysisget_context_pack
This interface changes the role of the project. Beyond command-line exploration, rkg becomes a local repository intelligence service that an agent can query during a run.
The local stdio design is intentional. Repository facts can be sensitive. The database lives in the working tree under ./.rkg/rkg.db. The agent can ask for the facts through a local process without requiring a hosted account or sending the repository graph to a remote indexing service. The project also includes transcript-style smoke coverage for MCP flows so stdout stays protocol-only and diagnostics stay on stderr. That kind of boundary is plain engineering work, and it matters when the tool is meant to sit inside other agent workflows.
The broader point is that MCP is how retrieval tools plug into the same loop every client runs. The model observes a task, calls a tool, receives structured repository facts, updates its plan, and continues. If those tool results are deterministic, the loop has a firmer footing. A tool call like get_context_pack can replace several rounds of rg, file reads, partial summaries, and another search term guessed from the partial summary.
The Architecture Is Local Because Review Is Local
I wanted rkg to be local-first for a practical reason: code review and debugging happen against local artifacts.
The database is SQLite, regenerated from the repository. The CLI prints source spans, qualified names, and provenance. The schema is documented. The language adapter guide explains how to add a new rkg-lang-* crate while keeping parsing, database writes, and CLI formatting in separate layers.
That architecture makes the project less magical. It also makes it easier to audit.
If a query result looks wrong, an engineer can inspect the indexed file, the relevant parser test, the table shape, and the query behavior. If an adapter misses a framework convention, the fix can be a focused parser improvement and fixture. If a stale record survives reindexing, the cascade behavior and file-scoped deletes are visible. Repository infrastructure needs that kind of boring maintainability.
The implementation is split along those boundaries:
rkg-coreholds shared domain types.rkg-indexerhandles repository discovery, file metadata, Git history, coverage parsing, workspace metadata, and resource parsing.rkg-lang-*crates own language-specific extraction.rkg-dbowns SQLite persistence.rkg-queryowns shared query behavior.rkg-cliandrkg-mcpexpose the behavior to humans and agents.
That split is partly ordinary Rust workspace hygiene. It is also a way to keep the fact layer honest. Parsers stay away from terminal formatting. The MCP server calls shared impact analysis. Core types stay independent from SQLite and CLI code. Once the project grows across languages and ecosystems, those boundaries become load-bearing.
Benchmarks Are A Guardrail Against Vibes
Agent tooling is easy to describe in ways that sound plausible. It is harder to measure.
Release 1.0.12 added rkg bench, a benchmark harness built around network-isolated fixture tasks across Python, Rust, F#, Mojo, Kotlin, and Swift. It includes a static grep-based baseline simulator and scoring for file precision, symbol precision, recall, F1, latency, token reduction, and task success.
The benchmark is a guardrail rather than the final word on repository intelligence. It scores retrieval tool quality directly: precision, recall, tokens, and latency for the part of the workflow that often hides inside a confident agent summary. If the claim is that deterministic repository facts improve context selection, then the project needs repeatable tasks where context selection can be scored. A satisfying demo is too easy to confuse with a durable improvement.
This matters because retrieval tools often fail softly. A bad context pack may still contain the target symbol and one plausible test. A noisy search result may still include enough relevant text for the model to produce a coherent patch. The review failure appears later, when someone notices the missing fixture, the downstream file that usually co-changes, or the coverage gap that should have changed the test plan.
That failure mode is easy to misdiagnose. It can look like the model made a bad engineering decision, when the earlier retrieval step gave it an incomplete picture that looked complete enough. This is why binary task success is too blunt on its own. A patch can pass the first test command and still be under-contextualized. A context pack can feel useful while omitting the one edge that would have changed the implementation.
The useful measurements are closer to retrieval quality: did the context include the right files, the right symbols, and the relevant tests? How much irrelevant material did it include? How much token budget did it spend to get there? How long did indexing and retrieval take? Precision, recall, F1, latency, and token reduction measure the retrieval step that usually precedes the agent summary.
Where Static Facts Need Boundaries
Static repository intelligence has real limits.
rkg is strongest where repositories leave durable artifacts behind: declared symbols, file paths, imports, type references, documentation, test definitions, package manifests, coverage reports, Git history, co-change patterns, and common framework or resource conventions. These facts cover a large share of the context an agent repeatedly spends tokens rediscovering.
The ceiling appears where the decisive behavior exists mostly at runtime. Dynamic dispatch, reflection, dependency injection, build flags, macros, notebook state, framework registration, generated code unavailable on disk, and cross-service contracts can all exceed what a local static index can know with certainty. Some edges will remain unresolved. Some will have lower confidence. Some facts will require language-specific heuristics. A stale index can mislead an agent after repository changes that have not been indexed yet.
Those limits are a reason to represent uncertainty plainly.
A graph that says “this edge is unresolved” is more useful than an agent confidently inventing the missing target. A context pack with source spans is easier to review than a prose summary with no provenance. Static framework tables, workspace metadata, resource links, and dependency facts give the agent a better starting point than another round of nearby file reads, even when runtime behavior still requires review.
This is the right division of labor. The fact layer should be deterministic where it can be deterministic, explicit where it is partial, and inspectable when it is wrong. The agent should reason over those facts and ask for more evidence when the static view is insufficient. For messy repositories, the goal is not a perfect graph. The goal is to move uncertainty out of fluent model prose and into artifacts that humans can inspect.
The Repository Becomes An Interface For Agents
The more I use coding agents, the more I think about agentic workflow as a system rather than a model choice. Strong models can write a lot of code. The harder question is whether the repository presents itself in a form the loop can use without repeated guesswork.
That means repositories need more than source files and tests. They need durable instructions, stable workflows, executable checks, and queryable repository facts. The codebase itself becomes an interface. Humans read it through editors, tests, docs, and review. Agents read it through tools, context packs, and protocols.
rkg is my attempt to make the retrieval part of that interface concrete. It gives the agent a factual map before the reasoning starts. It turns symbols, relationships, tests, documentation, commands, routes, coverage, Git history, and ecosystem metadata into local queries. It exposes those queries through a CLI for humans and MCP tools for agents.
Return to the opening contract change. With a fact layer in place, the agent can locate the target symbol, inspect callers and callees, fetch linked tests and docs, examine impact, include the command or model context, and notice co-changing files before it starts editing. The reviewer can ask the same questions from the command line. If the graph missed something, that miss becomes a concrete adapter or indexing problem rather than a vague complaint that the model “should have known.”
That is the practical arc I had in mind when improving agentic workflow. The harness carries taste and review contract. The retrieval tools expose what the repository already knows. The model reasons over that substrate. Deterministic repository context makes agent work less dependent on a lucky reading path through the codebase. It gives both the model and the reviewer a shared surface for asking what the repository already knows.
The model can still generate the next patch. The repository should provide the map.