The “Magic” of Text Anchoring, Demystified
I stared at my screen in disbelief. A 10,000-word document, and somehow this tool knew exactly where to highlight “Nintendo has pricing power”—down to the character. My first thought? “This has to be pure AI magic.” My second thought? “Wait, how does an LLM even know character positions?”
Ever wondered how tools like LangExtract can highlight the exact location of a quote in a giant document? Spoiler: It’s not LLM magic—it’s clever, classic computer science. Let’s break down how text anchoring really works, and how you can re-implement it yourself.
Recently, I stumbled across LangExtract, a Google open-source project that seems to pull off something almost magical: you give it a 10,000-word document and a vague prompt, and suddenly it can extract precise quotes—and show you exactly where they appear in the source, down to the character. Imagine asking “find evidence that Nintendo has pricing power” and having the exact sentences light up in your Markdown file like a beacon.
The first time you see this, it feels like pure AI wizardry. But here’s the beautiful truth: it’s not magic at all. It’s classic engineering, with a healthy respect for good old computer science algorithms. And honestly? That’s way cooler than magic.
Wait—How Can an LLM Know Offsets in My Document?
If you’ve ever worked with language models, you know they’re great at generating or extracting text—less so at returning the precise character position of something inside a 10,000-word document. So how does something like LangExtract bridge the gap?
Here’s the trick: the LLM isn’t guessing the index. Instead, it’s asked to output the relevant quote (“evidence”) as a substring from your document. Then, the library post-processes that quote, searching for its location in the original text. If the LLM is loyal to the source (by careful prompting), you get a direct match. If not—maybe there are typos or whitespace differences—the library falls back to fuzzy matching. That’s where the actual magic happens (hint: it’s not magic, it’s difflib.SequenceMatcher
).
Let me walk you through this, both conceptually and with working code. And if you want to jump right to the algorithm, skip ahead to the re-implementation below.
How Text Anchoring Works: The 30-Second Tour
Imagine you have a giant Markdown file, and you ask an LLM (or yourself) to pull out a quote: “Nintendo can set the price unchallenged”. You want to know exactly where that quote appears in the original text—so you can, say, highlight it in your app.
Step 1: Extraction
- The LLM (or your code) outputs one or more quotes it found relevant.
Step 2: Source Grounding
- For each quote, LangExtract (or your own pipeline) tries to find its location in the original document string.
- First try: exact match (
text.find(quote)
). - If that fails: Use a fuzzy matching algorithm to look for close-enough spans, even if there are small typos or formatting differences.
- Once a match is found, record its
[start, end]
character span.
Step 3: Multiple Matches and Edge Cases
- If the same quote appears more than once, you can:
- Pick the first match.
- Return all matches (let your UI or user disambiguate).
- Use context windows (e.g., a few words before/after) to make the match unique.
And that’s it! The LLM provides semantic extraction, and the alignment code gives you precise, highlightable spans.
Under the Hood: It’s Not LLMs, It’s Computer Science
This is where the curtain gets pulled back. The quote-to-offset step is handled by a classic algorithm: fuzzy string matching. Specifically, LangExtract’s core resolver uses Python’s built-in difflib.SequenceMatcher
, the same algorithm you’ve probably encountered for diffing files, spellchecking, or syntax correction. It’s fast, well-tested, and scales well to long documents (we’ll talk about performance in a second).
Here’s the workflow, in a nutshell:
- Tokenization: Break both the candidate quote and the document into tokens (words, roughly).
- Matching: Look for the quote in the document.
- Try exact match.
- If not found, slide a window of the quote’s length over the document and use
SequenceMatcher
to compare. - If the similarity score is above a threshold (say, 0.85), accept it as a match.
- Offsets: Return the
[start, end]
character indices in the original document.
What About Performance?
You might wonder: “Will this be slow for giant documents?” The worst-case time complexity is O(n*m)
(where n
is document length, m
is quote length), but smart optimizations make it much faster in practice:
- Windowed matching: Instead of comparing against the entire document, slide a window roughly the size of your quote
- Early exit: Exact matches short-circuit the expensive fuzzy logic
- Token heuristics: Seed search windows around rare words from your quote
For typical documents (tens of thousands of characters), difflib
performs well, and you can always tune the window size or stride for your use case. In practice, you’re looking at sub-100ms performance for most real-world documents.
Here’s what this looks like in practice. Let’s say your LLM returns: “Nintendo can set prices without competition” but your original text says “Nintendo can set the price unchallenged”. Exact match fails, but fuzzy matching sees these are 85% similar and correctly identifies the span. The magic isn’t in perfect matching—it’s in knowing when “close enough” is exactly what we need.
Why Not Just Wrap the Library?
If you’re already using LangExtract for everything else, it makes sense to use their resolver. But if you just want this one function (quote-to-offset alignment), pulling in a whole extra dependency feels like overkill. At this point, you might be thinking: “Great, I’ll just use LangExtract for everything!” And if you’re already deep in their ecosystem, that makes sense. But here’s where I had my second realization: sometimes the most powerful tools are the ones you can hold in your head. When I looked at what LangExtract was actually doing for quote alignment, it wasn’t some arcane LLM ritual—it was about 50 lines of classic computer science.
Indirection makes it harder to debug, and you end up packaging a library for a single feature you could implement yourself in ~50 lines of code.
So—let’s just write it ourselves!
Building Your Own: A Dependency-Free Text Alignment Algorithm
You can find the original code from LangExtract here langextract @ langextract/resolver.py
.
Below is a clean, dependency-free function that replicates LangExtract’s core behavior. We’ll start with the essential algorithm, then discuss production hardening:
|
|
What Makes This Code Better
This refactored version follows SOLID principles and best practices:
🔧 Single Responsibility Principle
TextNormalizer
: Handles only text preprocessingExactMatcher
: Handles only exact string matchingFuzzyMatcher
: Handles only fuzzy matching logicQuoteAligner
: Orchestrates the matching process
⚙️ Configuration-Driven
MatchingConfig
centralizes all tunable parameters- Easy to adjust thresholds, window sizes, and step sizes
- No more magic numbers scattered throughout the code
🧪 Testable Components**
- Each class has a clear interface and can be tested independently
- Mock dependencies easily for unit testing
- Clear separation between algorithm logic and configuration
📈 Performance Optimizations**
- Early exit on exact matches (O(n) fast path)
- Configurable sliding window with smart step sizes
- Efficient difflib usage with proper window sizing
Key Features & Usage
Simple Interface:
|
|
Advanced Usage:
|
|
Production Considerations:
- Index Mapping: The
TextNormalizer.create_index_map()
method provides a foundation for mapping normalized positions back to original text positions - Multiple Matches: Currently returns the first/best match; extend
QuoteAligner
to return all candidates if needed - Performance: For documents >100KB, consider pre-chunking by paragraphs and using keyword-based window seeding
- Memory: The sliding window approach keeps memory usage constant regardless of document size
Technical Notes
- Time Complexity: O(n×m) worst case, but typically much faster due to early exits and windowing
- Space Complexity: O(k) where k is the window size (constant memory usage)
- Accuracy: Matches LangExtract’s behavior while being more maintainable and testable
Why This Matters (And Why It Isn’t “Just LLM Magic”)
It’s tempting to look at modern LLM-powered tools and assume they’re pure AI wizardry—mysterious, unreachable intelligence that somehow “just knows” where text lives in a document. But here’s the thing: the secret sauce isn’t in the LLM at all.
The real breakthrough is in the engineering: LLMs extract meaning, while classic algorithms ground that meaning in reality. When you see a quote highlighted perfectly in a 10,000-word document, you’re witnessing a beautiful marriage—semantic understanding from language models married to decades-old computer science fundamentals like difflib.SequenceMatcher
.
This is why the most exciting innovations aren’t coming from people who just throw bigger models at problems, or from those who dismiss AI entirely. They’re coming from engineers who genuinely understand both worlds: who can combine foundational computer science with cutting-edge AI, instead of treating them as opposing forces.
The future belongs to bridge-builders. While others argue whether LLMs will replace traditional programming, the real practitioners are busy connecting the dots—using LLMs for what they do best (semantic extraction, natural language understanding) while relying on proven algorithms for what they do best (precise computation, deterministic matching, performance optimization).
So next time you see something that looks like AI magic, peek behind the curtain. You might discover that the most impressive breakthroughs are often the most beautifully engineered ones—and that the “magic” isn’t in any single technology, but in the thoughtful integration of old wisdom with new capabilities.
Final Takeaways
- Text anchoring (finding and highlighting quotes in big docs) is a blend of LLM prompting and classic text alignment.
- Tools like LangExtract are open about their approach—it isn’t some mysterious AI trick, but careful engineering and time-tested algorithms like
difflib.SequenceMatcher
. - If you just need quote alignment, you can (and probably should) build your own version, dependency-free.
- The future belongs to those who understand both the old and the new—and aren’t afraid to peek behind the curtain.
The next time you see a tool doing something that feels like magic, resist the urge to either worship it or dismiss it. Instead, ask: “What’s the actual bridge here between what the AI can do and what I need?” Because that’s where the real innovation happens—not in the models themselves, but in the thoughtful engineering that connects them to human needs.
(Written by Human, improved using AI where applicable.)