TIL: Shorter Embedding Body Improves Semantic Search Recall

Last week I ran a set of autoloop experiments on my Obsidian vault recommender, comparing different text preparation strategies for embedding. I wanted to find out how much the input format affects recall when you're using sentence-transformers to surface related notes.

I measured quality using recall@10: of a note's outgoing wiki-links, how many appear in its top 10 recommendations? After some initial tuning I'd gotten recall@10 up to 0.87, but the text prep was still naive: dump the full note body into the embedding model and hope for the best. So I set up autoloop experiments to systematically test what happens when you reshape the input.

The experiment that moved the needle the most stripped the note body down to 300 characters and front-loaded the input with structured metadata. The winning format looked like this:

Topic: resources/learning
Title: Spaced Repetition for Technical Concepts
Tags: learning, memory, flashcards
Related: anki-setup, mochi-workflow, active-recall
Body: [first 300 chars of note body]

Recall jumped from 0.87 to 1.00.

The reason this works comes down to how attention mechanisms distribute weight across tokens. When a note body is long, the model spreads attention across hundreds of tokens of prose. The structured metadata (title, tags, aliases) gets diluted. For retrieval, those metadata fields are the highest-signal parts of a note: they're intentional labels that reflect the note's topic. Body text is often exploratory, tangential, or repetitive.

Truncating the body and keeping metadata dense at the front means the embedding captures what the note is about rather than what the note says in detail. Both the query and the note vectors end up shaped more by their core subject than by surface-level word overlap.

The tradeoff is that you lose body-level semantic detail. For a recommender that surfaces related notes, that's acceptable; you want thematic neighbors, not paraphrase matches. For full-text passage retrieval, shorter inputs would hurt. Knowing which mode you're building for determines how much body text to include.

The vault recommender is on GitHub if you want to try it on your own Obsidian vault.