TIL: Wiki-links Are High-Quality Embedding Fuel

Last week I was running autoloop optimization experiments on my vault recommender, trying to improve recall. I wrote about how a shorter embedding body improves semantic search recall. One of the variables I tested was what text gets sent to the embedding model. When I included wiki-links as a Related: field in the embedding input, recall@10 jumped from 0.91 to 0.97; the single largest improvement in fifteen experiments.

The reason is simpler than I first thought. Wiki-links are human-curated topic descriptors. When I write [[Machine Learning]] in a note, I'm adding a concise label that says "this note relates to Machine Learning." The embedding model (all-MiniLM-L6-v2 in my case) doesn't understand link structure or graph topology. It sees tokens. But those tokens happen to be high-signal, human-chosen words that compress what the note is about. Notes that link to the same targets share those topic words in their input text, so their vectors end up closer together.

This is the same principle behind anchor text in web search. Eiron and McCurley showed back in SIGIR 2003 that hyperlink text is a compact, human-authored description of the linked page. Wiki-links in a personal knowledge base work the same way; they're your own anchor text for your own knowledge graph.

I want to be precise about what's happening here. The vectors aren't "graph-aware" in any structural sense. The model has no concept of edges or adjacency. What the wiki-links provide is additional semantic signal, high-quality metadata that concentrates relevance into a few tokens. It's the same reason structured labels (Title:, Tags:, Topic:) outperformed raw body text in my experiments; the embedding model's attention window is limited (256-512 tokens for sentence-transformers), and human-curated metadata packs more signal per token than prose.

The common alternative is to embed plain text and then boost retrieval scores after the fact by counting shared backlinks or direct connections. That approach works, but it adds a separate scoring pass. Including wiki-links in the embedding input gets you most of the benefit in a single operation, because the link target names carry the semantic weight of the relationship.

I still use a graph-boost layer in vault-recommender for edge cases where the link structure matters beyond what the text captures. But the wiki-link inclusion did most of the heavy lifting. The embedding input format that got me to perfect recall looks like this:

Topic: areas/programming
Title: Note Title
Tags: python, ml
Related: Machine Learning, Neural Networks
Body: first 300 characters...

The full implementation is in the _prepare_text function if you want to see how it's constructed.