This is the third post in the series about making technical documentation available for use in your AI agent or knowledge base, based on our work on Morsel, a knowledge base that improves itself using AI agents. In the first post I described how we crawl documentation sites, clean the page content, and generate descriptions for images. In the second post I shared practical gotchas we ran into when crawling complete techdocs.
Here I want to describe what we do afterwards to structure the crawled documentation further and make it available in a more useful form - for example, for your local AI coding agent. At a high level, we classify pages, embed them with a local model, and then build a knowledge graph that combines explicit hyperlinks with semantic similarity edges.
Classifying Pages
A lot of pages in technical documentation are not actual content. Many are purely navigation hubs - index pages that just link to other, more concrete pages. Others only show legal terms or changelogs. If you are doing further work on the data, you probably want to ignore most of those.
We do a first rule-based pass to cover as much as possible fast, locally and cheap, without involving LLMs. To do so, we check the URL against a set of patterns for each category:
LEGAL_PATTERNS = ["/legal/", "/privacy", "/terms", "/eula", "/cookie"]
def classify_by_rules(url: str, title: str, content: str) -> str | None:
if any(p in url.lower() for p in LEGAL_PATTERNS):
return "legal"
# ... same pattern for changelog, reference, etc.
# Navigation: short content with mostly links
if len(content.split()) < 200 and "[" in content:
return "navigation"
return None # needs LLM classification
Everything the rules cannot classify gets sent to a local LLM. We give it the URL, title, the first 200 words of the page, and a list of its headings, then ask it to classify the page by primary intent. The classes are: conceptual, tutorial, how-to, example (for the main content pages, based on the Diátaxis framework for documentation), as well as structural pages (navigation, reference, legal, changelog) and others (broken, misc).
With this two-pass approach, the rule-based step handles the easy cases cheaply, and the LLM only sees what it actually needs to. The result is that we can filter out legal, changelog, navigation, and reference pages when doing further work on the data, and focus only on actual content pages.
Embedding Pages
With pages classified, the next step is embedding them. We use a local sentence transformer model to avoid API costs and make the process faster. So far, this seems okay for this use case.
If a page exceeds the token limit, we split it at heading boundaries and average the resulting chunk embeddings:
def embed_page(content: str) -> list[float]:
chunks = re.split(r'(?m)^#{1,3} ', content)
if len(chunks) == 1:
return model.encode(content).tolist()
embeddings = [model.encode(chunk) for chunk in chunks if chunk.strip()]
avg = np.mean(embeddings, axis=0)
return (avg / np.linalg.norm(avg)).tolist()
We embed from the cleaned markdown rather than plain text, because headings, code blocks, and list structure have semantic meaning that helps the model understand page structure.
Building a Knowledge Graph
With the embeddings, we can build a graph that includes semantic similarity in combination with the explicit hyperlinks in the documentation. We store two types of edges: Link edges (explicit hyperlinks from page A to page B as directed edges with no weight) and semantic edges (between pages with high embedding similarity, we store two directed edges with the cosine similarity as the edge weight).
We only include actual content pages in the semantic graph with pages classified as navigation, legal, reference etc. excluded. Additionally, we set a configurable similarity threshold and only add edges between pages with higher similarity than that (currently 0.75). Finally, we cap the number of neighbors per page (currently 20) so you don’t end up with a few massively connected hubs.
All of this page data, classifications, embeddings, and graph edges ends up in the same SQLite database.
Do you need help with data science? I can help and am available on a freelance basis :).
The Result
The whole flow: you crawl the technical documentation as described in the second post, extract the page content and describe images as in the first post, and then run these post-processing steps on the crawled data.
What you end up with is a completely self-contained, local SQLite database with the full documentation stored in a form your AI can use easily. Agents can query it with plain SQL and models are quite good at that. They can use the classification to filter out noise and read only actual content pages, or for more fine-grained queries like “Show me all how-to pages related to topic X”. And they can use the embeddings to find semantically similar pages, or navigate the knowledge graph using both the explicit hyperlinks written by the documentation authors and the implicit similarity edges we derived from the content.
If you have built something similar or think this is interesting, I would love to hear about it. How are you using local documentation with your coding agents? What would you do differently?