Guides / RAG

Cheapest LLM for RAG

RAG pipelines pay for the retrieved context on every query, so input price and a large-enough context window are what decide the bill. These models clear a 128K-token floor and rank cheapest first for a retrieval-heavy workload.

The cheapest pickLlama 3.1 8B Instruct
$1.15/mo for a RAG app stuffing ~50M input and ~5M output tokens a month · $0.02 in / $0.03 out per 1M · Meta
The ranking

Cheapest models for RAG

Monthly cost for a RAG app stuffing ~50M input and ~5M output tokens a month. Sorted cheapest first.

#ModelContextInput $/MOutput $/MMonthly cost
1Llama 3.1 8B Instruct
Meta
128K$0.02$0.03$1.15 ◎
2Amazon Nova Micro
Amazon
128K$0.035$0.14$2.45
3Command R7B
Cohere
128K$0.037$0.15$2.63
4Amazon Nova Lite
Amazon
300K$0.06$0.24$4.20
5Qwen-Flash
Alibaba
1M$0.05$0.4$4.50
6Llama 4 Scout (17B-16E Instruct)
Meta
10M$0.1$0.3$6.50
7Llama 3.3 70B Instruct
Meta
128K$0.1$0.32$6.60
8Qwen3.5-Flash
Alibaba
1M$0.1$0.4$7.00
9Ministral 3 8B
Mistral
256K$0.15$0.15$8.25
10Llama 4 Maverick (17B-128E Instruct)
Meta
1M$0.15$0.6$10.50
11Mistral Small 4
Mistral
256K$0.15$0.6$10.50
12Command R (08-2024)
Cohere
128K$0.15$0.6$10.50

Estimate only; excludes prompt caching, batch discounts and free tiers. Different volumes change the ranking —run your own numbers. Prices verified against official docs · catalog updated 2026-06-28.

Methodology

RAG is the most input-heavy pattern: each answer carries multiple retrieved chunks, so input can be 10× the output. We rank a 50M-in / 5M-out monthly workload and require at least a 128K context window so retrieved passages actually fit.

FAQ

Cheapest LLM for RAG

What is the cheapest LLM for RAG?

Llama 3.1 8B Instruct (Meta) is the cheapest generally-available model we track for RAG, at $0.02 per 1M input tokens and $0.03 per 1M output tokens — about $1.15/month for a RAG app stuffing ~50M input and ~5M output tokens a month. Amazon Nova Micro is the next cheapest at $2.45/month.

How is "cheapest for RAG" calculated?

We price a representative monthly workload — a RAG app stuffing ~50M input and ~5M output tokens a month — against every generally-available model, then rank by total cost. Only models with at least a 128K-token context window are included. All prices are USD per 1M tokens, sourced from official provider documentation.

Is the cheapest model always the right choice for RAG?

No. Price is one axis; quality, latency, rate limits and reliability matter too. Use this ranking to shortlist, then test the top candidates on your own RAG workload before committing. Cost is easy to measure — fit is not.