Data Preparation for RAG

"In a lot of the companies that I have seen, the biggest performance in their RAG solutions comes from better data preparations, not agonizing over what vector databases to use." - Chip Huyen

What It Is

A framework for understanding that Retrieval-Augmented Generation (RAG) quality is primarily determined by how you prepare your data, not by your choice of infrastructure. While teams often debate vector databases and embedding models, the real performance gains come from thoughtful data preparation strategies.

RAG works by providing models with relevant context so they can answer questions better. But the quality of retrieval depends heavily on how data is structured, chunked, and annotated—not just how it's stored.

How It Works

RAG has three key components, in order of impact on quality:

Data Preparation (highest impact)
- How documents are chunked
- What metadata is added
- How content is rewritten for AI consumption
Retrieval Strategy (medium impact)
- How queries are formed
- How results are ranked and filtered
- Multi-step retrieval approaches
Infrastructure (lower impact)
- Vector database choice
- Embedding model selection
- Latency and scale optimization

How to Apply It

Chunk Design:

Balance chunk size carefully—too long captures more context but limits retrieval variety; too short increases variety but loses context
Consider the natural structure of your documents, not arbitrary character limits

Contextual Enhancement:

Add summaries and metadata to chunks so they can be retrieved even when query terms don't appear
Generate "hypothetical questions" for each chunk—questions the chunk could answer—to improve retrieval matching

Format Optimization:

Rewrite documentation from human-readable to AI-readable format
Add annotation layers explaining concepts that humans understand implicitly but AI doesn't
Convert narrative content into question-answer format when appropriate

AI-Specific Annotations:

Explain domain-specific terms, scales, and conventions
Document context that experts know implicitly (e.g., "temperature=1 in this function means high randomness, not actual temperature")

When to Use It

When building any RAG-based application
When RAG retrieval quality is disappointing
When debating which vector database to use (stop debating, focus on data)
When scaling RAG systems and prioritizing engineering effort
When evaluating why AI responses have poor grounding

Source

Guest: Chip Huyen
Episode: "AI Engineering with Chip Huyen"
Key Discussion: (00:34:02) - Deep dive on data preparation for RAG
YouTube: Watch on YouTube

Related Frameworks

What Actually Improves AI Apps - Data preparation over infrastructure debates
Evals as PRD - Measuring RAG quality requires good evals