Cheapest LLM for RAG
RAG pipelines pay for the retrieved context on every query, so input price and a large-enough context window are what decide the bill. These models clear a 128K-token floor and rank cheapest first for a retrieval-heavy workload.
Cheapest models for RAG
Monthly cost for a RAG app stuffing ~50M input and ~5M output tokens a month. Sorted cheapest first.
| # | Model | Context | Input $/M | Output $/M | Monthly cost |
|---|---|---|---|---|---|
| 1 | Llama 3.1 8B Instruct Meta | 128K | $0.02 | $0.03 | $1.15 ◎ |
| 2 | Amazon Nova Micro Amazon | 128K | $0.035 | $0.14 | $2.45 |
| 3 | Command R7B Cohere | 128K | $0.037 | $0.15 | $2.63 |
| 4 | Amazon Nova Lite Amazon | 300K | $0.06 | $0.24 | $4.20 |
| 5 | Qwen-Flash Alibaba | 1M | $0.05 | $0.4 | $4.50 |
| 6 | Llama 4 Scout (17B-16E Instruct) Meta | 10M | $0.1 | $0.3 | $6.50 |
| 7 | Llama 3.3 70B Instruct Meta | 128K | $0.1 | $0.32 | $6.60 |
| 8 | Qwen3.5-Flash Alibaba | 1M | $0.1 | $0.4 | $7.00 |
| 9 | Ministral 3 8B Mistral | 256K | $0.15 | $0.15 | $8.25 |
| 10 | Llama 4 Maverick (17B-128E Instruct) Meta | 1M | $0.15 | $0.6 | $10.50 |
| 11 | Mistral Small 4 Mistral | 256K | $0.15 | $0.6 | $10.50 |
| 12 | Command R (08-2024) Cohere | 128K | $0.15 | $0.6 | $10.50 |
Estimate only; excludes prompt caching, batch discounts and free tiers. Different volumes change the ranking —run your own numbers. Prices verified against official docs · catalog updated 2026-06-28.
RAG is the most input-heavy pattern: each answer carries multiple retrieved chunks, so input can be 10× the output. We rank a 50M-in / 5M-out monthly workload and require at least a 128K context window so retrieved passages actually fit.
Cheapest LLM for RAG
What is the cheapest LLM for RAG?
Llama 3.1 8B Instruct (Meta) is the cheapest generally-available model we track for RAG, at $0.02 per 1M input tokens and $0.03 per 1M output tokens — about $1.15/month for a RAG app stuffing ~50M input and ~5M output tokens a month. Amazon Nova Micro is the next cheapest at $2.45/month.
How is "cheapest for RAG" calculated?
We price a representative monthly workload — a RAG app stuffing ~50M input and ~5M output tokens a month — against every generally-available model, then rank by total cost. Only models with at least a 128K-token context window are included. All prices are USD per 1M tokens, sourced from official provider documentation.
Is the cheapest model always the right choice for RAG?
No. Price is one axis; quality, latency, rate limits and reliability matter too. Use this ranking to shortlist, then test the top candidates on your own RAG workload before committing. Cost is easy to measure — fit is not.
Get alerted when a cheaper model for RAG ships
New models, price cuts, and deprecations — a short email when something actually changes. No spam, unsubscribe anytime.
◎ You're on the watch list. We'll ping you the moment a model launches, changes price, or gets deprecated.
Free forever · powered by the same data on this page.