The Notebook-to-Production Gap
An LLM extraction pipeline that works perfectly on ten clean examples in a notebook behaves very differently once it's processing thousands of real-world documents with inconsistent formatting, OCR noise, and edge cases nobody anticipated.
What Actually Breaks
- Token limits get hit by documents you didn't expect to be long
- Confident-sounding hallucinations are harder to catch than obvious errors
- Cost adds up fast without a tiered validation strategy before the expensive model call
What Held Up
Layering cheap, deterministic checks (structural validation, OCR text density) before ever calling an LLM reduced cost dramatically without sacrificing accuracy on the documents that actually needed the model.