Skip to content
Go back

Learnings From Crawling Technical Documentation

This is a second post in the series about making technical documentation available for use in your AI agent or knowledge base, concerning our work on Morsel, an AI-first knowledge base. Crawling technical documentation is really helpful for vendoring documentation in your software engineering projects - for example, for local AI to offer your coding agents to use the documentation of software you use, or just to do further processing on it, as we do.

This post is about the small gotchas we ran into when crawling at scale, so others doing similar work don’t hit the same problems.

Our approach

Our approach is quite pragmatic. At a high level, we have a Python script that takes an entry page of technical documentation. We crawl that entry point, extract every link from it, save the link, the full content of the page, a crawled-at date, and a status field in an SQLite database. We then enqueue every extracted link to be crawled in the future. With that, we are able to resume the crawl later and to share the results easily as one self-contained file.

Do you need help with data science? I can help and am available on a freelance basis :).

Send me an Email ↗

Learnings

1. Scoping and restricting the size of the crawl

We restrict the scope of the crawling using two parameters. One is a --scope parameter, which is a URL prefix we stay within - by default, the origin of the entry page. This prevents us from following links that lead outside the documentation we are crawling.

We also added an --exclude parameter. Any URL starting with that prefix is skipped, even if it is in scope. We added this because a lot of documentation has a default language (most often English) and then hosts translations under sub-paths - for example, /de/ for German. This way we can exclude translated parts of the documentation.

Before crawling for links and content, you should let the whole content render and let JavaScript execute. We do this by waiting for networkidle using Playwright, which waits until there are no more than two network connections for at least 500ms. As described in the earlier blog post, we also handle cookie and consent banners that would otherwise block the page content. We do this once at the entry URL using an LLM to identify the correct dismiss button from the list of visible button labels, and the resulting session state is then reused for the rest of the crawl.

Of course, you can use href attributes to extract links from the rendered content. Make sure to filter out links that start with mailto: or javascript:, because those are technically valid URLs, but you do not want to crawl them (as they do not link to content).

4. Handling content that is not HTML

Most relevant for documentation are PDF files, which we encountered quite a bit of. We crawl them the same way we crawl HTML pages, but if we encounter a PDF we download it and extract all the text using Python. That way we have title and text content in the same way we do for HTML pages, can discard the PDF file, and save the text and title in the database.

import io
import httpx
import pypdf

def fetch_pdf_text(url: str) -> tuple[str, str, int]:
    """Download a PDF and extract its text."""
    response = httpx.get(url, follow_redirects=True, timeout=30)
    response.raise_for_status()
    reader = pypdf.PdfReader(io.BytesIO(response.content))
    title = reader.metadata.title if reader.metadata and reader.metadata.title else url.split("/")[-1]
    pages_text = [page.extract_text() or "" for page in reader.pages]
    text = "\n\n".join(pages_text).strip()
    return title, text, response.status_code

5. Normalize URLs

We normalize URLs to deduplicate them, since urls with or without trailing slashes are often used interchangeably in docs. We use urlparse from Python’s urllib.parse to strip the fragment and remove trailing slashes:

from urllib.parse import urlparse

def normalize_url(url: str) -> str:
    parsed = urlparse(url)
    normalized = parsed._replace(fragment="")
    result = normalized.geturl()
    if result.endswith("/") and len(parsed.path) > 1:
        result = result.rstrip("/")
    return result

6. Make the script resumable and idempotent

We save the status of every page we crawl: either rendered, error, or not yet visited. This lets us rerun the script any number of times. We ignore already-rendered pages, retry error pages, and continue visiting previously unvisited pages. The script can be safely rerun at any time.

Full crawl loop

The full crawl loop looks roughly like this (with additional error handling, reporting etc removed):

initialize DB and queue with entry URL
retry any previously errored pages

for each URL in queue:
    if URL ends with .pdf:
        download PDF, extract title and text
    else:
        render page with Playwright (execute JS, dismiss cookie banners)
        extract title, HTML, and text
        find all links on the page
        for each link: if in-scope, not excluded, and not yet visited -> enqueue

    save result (title, text, status) to SQLite DB

report final counts: total pages, rendered, errors

At the end, every page has been visited, fully rendered, and saved in an SQLite database file. That file is available for further processing - it can be made available to your local coding agent to search through the documentation, or exported to Markdown or any format you want. For example, in an earlier blog post I wrote about how we enrich individual pages by downloading and describing the images they contain.

Do you need help with data science? I can help and am available on a freelance basis :).

Send me an Email ↗

About Me

I am an indie maker & researcher with a doctorate in computer science, interested in (among others): Software engineering, open data, data science, startups and esports.

See /about for details.

Have feedback, comments? Email me: philip@heltweg.org.

I (very occasionally) send out a newsletter when publishing new articles like this.

Subscribe ↗

Share this post on:

Next Post
Hosting an Open Alternative to Google Docs for Digital Sovereignty