Test Time Compute

"Spending more compute on inference... can give you much better final answers. But the base model itself does not change." - Chip Huyen

What It Is

A strategy for improving AI output quality by investing more computational resources at inference time (when the model generates responses) rather than at training time. This allows teams to get better results from existing models without retraining them.

The key insight: perceived performance can improve dramatically without the underlying model's capabilities changing—by thinking longer, generating multiple candidates, and verifying answers.

How It Works

The Compute Allocation Decision:

AI systems have three phases where compute is allocated:

Pre-training - Building the base model (extremely expensive, done by labs)
Fine-tuning/Post-training - Adapting the model (moderate cost)
Inference - Generating responses (cost per query)

Test time compute shifts investment toward inference, accepting higher per-query costs for better results.

Strategies for Test Time Compute:

Multiple Candidates - Generate several answers, then select or ensemble the best
Majority Voting - If 3 of 4 generated answers say "42," that's likely correct
Extended Reasoning - Allow the model to generate more "thinking tokens" before the final answer
Verification Steps - Have the model check its work before responding

How to Apply It

Identify quality-critical queries - Not all queries need extra compute; prioritize where quality matters most
Implement candidate generation - Generate 3-5 responses and select the best using a scoring function or majority vote
Enable reasoning tokens - For complex questions, allow models to "think" more before answering
Design verification steps - Build in checks where the model validates its response
Monitor cost/quality tradeoffs - Test time compute increases cost; ensure the quality improvement justifies it

When to Use It

When base model quality is plateauing
When you can't fine-tune but need better results
For high-stakes queries where accuracy is critical
When users can tolerate slightly higher latency for better answers
For complex reasoning tasks like math, coding, or research synthesis

Source

Guest: Chip Huyen
Episode: "AI Engineering with Chip Huyen"
Key Discussion: (01:06:20) - Explanation of test time compute as a strategy
YouTube: Watch on YouTube

Related Frameworks

What Actually Improves AI Apps - Practical approaches to improving AI quality