Comprehensive comparison of 40+ Large Language Models — benchmarks, pricing, capabilities & hardware requirements
Major expansion with Developer column and 10+ new open-source models:
Deep research iteration with verified data from official sources:
Click any column header to sort. All data sourced from official docs and verified benchmarks. NEW = added in v3, LEGACY = superseded.
| Model | Developer | Params (B) | Open Source | License | Context (tokens) | Vision | Code Gen | Func Calling | Self-Host | API | MoE | Quant Avail | MMLU | HumanEval | In $/M tok | Out $/M tok | Min VRAM | Fine-tune | Reasoning/CoT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5 | OpenAI | N/A | ❌ | Proprietary | 200,000 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $1.25 | $10.00 | N/A | ❌ | ✅ |
| GPT-4.1 | OpenAI | N/A | ❌ | Proprietary | 1,047,576 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $2.00 | $8.00 | N/A | ✅ | ❌ |
| GPT-4.1 mini | OpenAI | N/A | ❌ | Proprietary | 1,047,576 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $0.40 | $1.60 | N/A | ✅ | ❌ |
| GPT-4.1 nano | OpenAI | N/A | ❌ | Proprietary | 1,047,576 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $0.10 | $0.40 | N/A | ✅ | ❌ |
| o3 | OpenAI | N/A | ❌ | Proprietary | 200,000 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $2.00 | $8.00 | N/A | ❌ | ✅ |
| o3-mini | OpenAI | N/A | ❌ | Proprietary | 200,000 | ❌ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $1.10 | $4.40 | N/A | ❌ | ✅ |
| GPT-4o | OpenAI | ~200 | ❌ | Proprietary | 128,000 | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | 88.7 | 90.2 | $2.50 | $10.00 | N/A | ✅ | ❌ |
| GPT-4o-mini | OpenAI | ~8 | ❌ | Proprietary | 128,000 | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | 82.0 | 87.2 | $0.15 | $0.60 | N/A | ✅ | ❌ |
| GPT-4-Turbo LEGACY | OpenAI | ~200 | ❌ | Proprietary | 128,000 | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | 86.5 | 87.1 | $10.00 | $30.00 | N/A | ❌ | ❌ |
| Claude Opus 4.6 | Anthropic | N/A | ❌ | Proprietary | 200,000 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $5.00 | $25.00 | N/A | ❌ | ✅ |
| Claude Sonnet 4.6 | Anthropic | N/A | ❌ | Proprietary | 200,000 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $3.00 | $15.00 | N/A | ❌ | ✅ |
| Claude Haiku 4.5 | Anthropic | N/A | ❌ | Proprietary | 200,000 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $1.00 | $5.00 | N/A | ❌ | ✅ |
| Claude 3.5 Sonnet LEGACY | Anthropic | N/A | ❌ | Proprietary | 200,000 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | 88.7 | 92.0 | $3.00 | $15.00 | N/A | ❌ | ✅ |
| Claude 3 Opus LEGACY | Anthropic | N/A | ❌ | Proprietary | 200,000 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | 86.8 | 84.9 | $15.00 | $75.00 | N/A | ❌ | ❌ |
| Gemini 2.5 Pro | Google DeepMind | N/A | ❌ | Proprietary | 1,048,576 | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | N/A | N/A | $1.25 | $10.00 | N/A | ❌ | ✅ |
| Gemini 2.5 Flash | Google DeepMind | N/A | ❌ | Proprietary | 1,048,576 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $0.15 | $0.60 | N/A | ❌ | ✅ |
| Gemini 2.0 Flash | Google DeepMind | N/A | ❌ | Proprietary | 1,048,576 | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | ❌ | N/A | N/A | $0.10 | $0.40 | N/A | ❌ | ✅ |
| Gemini 1.5 Pro LEGACY | Google DeepMind | N/A | ❌ | Proprietary | 2,097,152 | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | 85.9 | N/A | $1.25 | $5.00 | N/A | ✅ | ❌ |
| Gemma 2 27B NEW | Google DeepMind | 27 | ✅ | Gemma License | 8,192 | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | 75.2 | N/A | Free (self-host) | Free (self-host) | ~54 GB (FP16) | ✅ | ❌ |
| Gemma 2 9B NEW | Google DeepMind | 9.2 | ✅ | Gemma License | 8,192 | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | 71.3 | N/A | Free (self-host) | Free (self-host) | ~18 GB (FP16) | ✅ | ❌ |
| Llama 4 Scout | Meta AI | 109 (17B active) | ✅ | Llama 4 Community | 10,000,000 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | N/A | N/A | $0.15 | $0.60 | ~218 GB (FP16) | ✅ | ❌ |
| Llama 3.3 70B | Meta AI | 70.6 | ✅ | Llama 3.3 Community | 128,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 86.0 | 88.4 | $0.60 | $0.60 | ~140 GB (FP16) | ✅ | ❌ |
| Llama 3.1 405B | Meta AI | 405 | ✅ | Llama 3.1 Community | 128,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 88.6 | 89.0 | $3.00 | $3.00 | ~810 GB (FP16) | ✅ | ❌ |
| Llama 3.1 70B | Meta AI | 70.6 | ✅ | Llama 3.1 Community | 128,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 86.0 | 80.5 | $0.88 | $0.88 | ~140 GB (FP16) | ✅ | ❌ |
| Llama 3.1 8B | Meta AI | 8.0 | ✅ | Llama 3.1 Community | 128,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 73.0 | 72.6 | $0.06 | $0.06 | ~16 GB (FP16) | ✅ | ❌ |
| Mistral Small 3.1 | Mistral AI | 24 | ✅ | Apache 2.0 | 128,000 | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | N/A | $0.10 | $0.30 | ~48 GB (FP16) | ✅ | ❌ |
| Mixtral 8x22B LEGACY | Mistral AI | 141 (39B active) | ✅ | Apache 2.0 | 65,536 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 77.8 | 75.0 | $2.00 | $6.00 | ~282 GB (FP16) | ✅ | ❌ |
| Mixtral 8x7B LEGACY | Mistral AI | 46.7 (12.9B active) | ✅ | Apache 2.0 | 32,768 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 70.6 | 40.2 | $0.24 | $0.24 | ~93 GB (FP16) | ✅ | ❌ |
| Mistral Large 2 LEGACY | Mistral AI | 123 | ❌ | Proprietary | 128,000 | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | 84.0 | N/A | $2.00 | $6.00 | N/A | ❌ | ❌ |
| DeepSeek-V3.2 | DeepSeek | 671 (37B active) | ✅ | DeepSeek License | 128,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 88.5 | 82.6 | $0.28 | $0.42 | ~1.3 TB (FP16) | ✅ | ❌ |
| DeepSeek-R1 | DeepSeek | 671 (37B active) | ✅ | MIT | 128,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 90.8 | 85.0 | $0.28 | $0.42 | ~1.3 TB (FP16) | ✅ | ✅ |
| Grok-1 NEW | xAI | 314 (MoE) | ✅ | Apache 2.0 | 8,192 | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | 73.0 | N/A | Free (self-host) | Free (self-host) | ~628 GB (FP16) | ✅ | ❌ |
| Qwen2.5-72B | Alibaba Cloud | 72.7 | ✅ | Qwen License | 131,072 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 86.1 | 86.6 | $0.90 | $0.90 | ~145 GB (FP16) | ✅ | ❌ |
| Qwen2.5-32B NEW | Alibaba Cloud | 32.5 | ✅ | Apache 2.0 | 131,072 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 83.0 | 81.7 | $0.40 | $0.40 | ~65 GB (FP16) | ✅ | ❌ |
| Qwen2.5-Coder-32B | Alibaba Cloud | 32.5 | ✅ | Apache 2.0 | 131,072 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | 92.7 | $0.40 | $0.40 | ~65 GB (FP16) | ✅ | ❌ |
| Phi-3 Medium (14B) | Microsoft | 14 | ✅ | MIT | 128,000 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | 78.0 | 62.2 | $0.14 | $0.56 | ~28 GB (FP16) | ✅ | ❌ |
| Phi-3-mini (3.8B) NEW | Microsoft | 3.8 | ✅ | MIT | 128,000 | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | 70.9 | 58.5 | Free (self-host) | Free (self-host) | ~8 GB (FP16) | ✅ | ❌ |
| Command R+ | Cohere | 104 | ✅ | CC-BY-NC-4.0 | 128,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 75.7 | N/A | $2.50 | $10.00 | ~208 GB (FP16) | ✅ | ❌ |
| Falcon-2 11B NEW | TII (UAE) | 11 | ✅ | TII Falcon 2.0 (Apache-based) | 8,192 | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | N/A | N/A | Free (self-host) | Free (self-host) | ~22 GB (FP16) | ✅ | ❌ |
| Falcon-180B NEW | TII (UAE) | 180 | ✅ | Falcon-180B TII License | 2,048 | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | N/A | N/A | Free (self-host) | Free (self-host) | ~400 GB (FP16) | ✅ | ❌ |
| DBRX (132B MoE) NEW | Databricks | 132 (36B active) | ✅ | Databricks Open | 32,768 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 73.7 | 70.1 | N/A | N/A | ~264 GB (FP16) | ✅ | ❌ |
| Jamba 1.5 Large NEW | AI21 Labs | 398 (94B active) | ✅ | Jamba Open Model License | 256,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 81.2 | N/A | N/A | N/A | ~796 GB (FP16) | ✅ | ❌ |
| Nemotron-70B NEW | NVIDIA | 70.6 | ✅ | Llama 3.1 Community | 128,000 | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | N/A | N/A | Free (self-host) | Free (self-host) | ~140 GB (FP16) | ✅ | ❌ |
deepseek-chat (non-thinking) and deepseek-reasoner (thinking mode) at the same price point.Model names encode a lot of information. Here's how to decode them:
The number before "B" indicates the model's parameter count in billions.
Llama-3.1-8B → 8 billion parametersLlama-3.1-70B → 70 billion parametersLlama-3.1-405B → 405 billion parametersMore parameters generally = more capable but more resource-hungry.
Indicates the model has been fine-tuned for instruction following and conversation.
Llama-3.1-8B → base model (completion only)Llama-3.1-8B-Instruct → tuned for chat/instructionsAlways use the Instruct variant for chat applications.
Models like Mixtral 8x7B and Llama 4 Scout use MoE architecture: multiple expert networks, but only a subset are active per token.
Quantized model names include the format and bit depth:
Q4_K_M → 4-bit quantization, K-quant, Medium qualityQ5_K_S → 5-bit, K-quant, Small (more compression)Q8_0 → 8-bit quantizationGPTQ-Int4 → GPTQ format, 4-bit integersAWQ → Activation-aware Weight QuantizationOpenAI now uses a cleaner versioning scheme:
GPT-4.1 → non-reasoning, succeeds GPT-4oGPT-5 → reasoning flagship, succeeds o3o3 / o4-mini → reasoning model lineClaude Opus 4.6 → Anthropic's latest flagshipGemini 2.5 / 3.x → Google's generation systemQwen2.5 → 2nd gen, 5th revisionCommon naming patterns for model sizes:
Total number of trainable weights in billions. Larger models generally perform better but require more compute. MoE models have high total counts but fewer active parameters per inference.
Maximum number of tokens (input + output) the model can process at once. 1 token ≈ 0.75 English words. A 128K context ≈ ~96,000 words ≈ a 300-page book. Llama 4 Scout leads with 10M tokens.
Massive Multitask Language Understanding — tests knowledge across 57 subjects (STEM, humanities, social sciences). Scores are 0–100%. Top models score 85–90%+. Note: many newer models use MMLU-Pro instead.
Measures code generation ability via 164 Python programming problems. Reports pass@1 (% correct on first try). Top models score 85–92%+. Newer models often report LiveCodeBench instead.
Cost per million tokens for API usage. Input tokens (your prompt) are usually cheaper than output tokens (model's response). Cached input prices are even lower. Reasoning tokens (hidden CoT) are billed as output.
Mixture of Experts — routes each token to a subset of specialized "expert" sub-networks. Enables larger total capacity with lower inference cost per token. Used by DeepSeek, Mixtral, DBRX, Jamba, and Llama 4.
Chain-of-Thought reasoning mode — the model "thinks step-by-step" before answering. Models like DeepSeek-R1, o3, GPT-5, and Claude with Extended Thinking use this. Reasoning tokens are billed but not always visible.
Whether you can download and run the model on your own hardware. Requires significant GPU VRAM. Open-source/open-weight models are self-hostable; proprietary models are API-only.
Best picks for cost-effective usage:
Best for code generation and debugging:
Best for multi-step logic, math proofs, research:
Best for image understanding and multimodal tasks:
Best for running on your own infrastructure:
Best for processing very long inputs:
Quantization reduces model size and VRAM needs by using lower-precision numbers for weights.
The most popular format for CPU + GPU inference via llama.cpp and ollama.
GPU-only quantization optimized for fast inference on NVIDIA GPUs.
AutoGPTQ, vLLM, TGINewer quantization method that preserves quality better than GPTQ.
vLLM, TGI, AutoAWQ| Quant Level | Bits/Weight | Size vs FP16 | Quality Loss | Use Case |
|---|---|---|---|---|
| FP16 | 16 | 100% | None (baseline) | Maximum quality, research |
| Q8_0 | 8 | ~50% | Negligible | When you have the VRAM |
| Q6_K | 6 | ~38% | Very minimal | Quality-sensitive workloads |
| Q5_K_M | 5 | ~31% | Slight | Good balance |
| Q4_K_M | 4 | ~25% | Minor | Recommended default |
| Q3_K_M | 3 | ~19% | Noticeable | Tight VRAM constraints |
| Q2_K | 2 | ~13% | Significant | Experimentation only |
VRAM (GB) ≈ Parameters (B) × Bytes per WeightVRAM requirements for self-hosting. Rule of thumb: ~2 bytes per parameter at FP16, so a 7B model needs ~14 GB VRAM.
| Model | FP16 VRAM | Q4_K_M VRAM | Suggested GPU(s) |
|---|---|---|---|
| Phi-3-mini (3.8B) | ~8 GB | ~2 GB | Any modern GPU, even integrated |
| Llama 3.1 8B | ~16 GB | ~5 GB | RTX 3060 12GB, RTX 4060 Ti 16GB |
| Gemma 2 9B | ~18 GB | ~6 GB | RTX 3060 12GB, RTX 4060 Ti 16GB |
| Falcon-2 11B | ~22 GB | ~7 GB | RTX 3090 24GB, RTX 4070 Ti 16GB |
| Phi-3 14B | ~28 GB | ~9 GB | RTX 3090 24GB, RTX 4070 Ti 16GB |
| Gemma 2 27B | ~54 GB | ~16 GB | RTX 4090 24GB, RTX 5090 32GB |
| Mistral Small 3.1 (24B) | ~48 GB | ~15 GB | RTX 4090 24GB, RTX 5090 32GB |
| Qwen2.5-32B / Coder-32B | ~65 GB | ~20 GB | RTX 3090 24GB, RTX 4090 24GB |
| Mixtral 8x7B | ~93 GB | ~26 GB | RTX 4090 24GB + CPU offload, or 2× RTX 3090 |
| Llama 3.3 70B / 3.1 70B / Nemotron-70B | ~140 GB | ~40 GB | 2× RTX 4090, A100 80GB |
| Qwen2.5-72B | ~145 GB | ~42 GB | 2× RTX 4090, A100 80GB |
| Command R+ (104B) | ~208 GB | ~60 GB | 3× RTX 4090, 2× A100 80GB |
| DBRX (132B MoE) | ~264 GB | ~75 GB | 4× RTX 4090, 3× A100 80GB |
| Llama 4 Scout (109B MoE) | ~218 GB | ~62 GB | 3× RTX 4090, 3× A100 80GB |
| Mixtral 8x22B | ~282 GB | ~80 GB | 4× RTX 4090, 4× A100 80GB |
| Falcon-180B | ~400 GB | ~115 GB | 8× RTX 4090, 5× A100 80GB |
| Grok-1 (314B MoE) | ~628 GB | ~180 GB | 8× A100 80GB, 8× H100 |
| Jamba 1.5 Large (398B) | ~796 GB | ~230 GB | 8× H100 80GB |
| Llama 3.1 405B | ~810 GB | ~230 GB | 8× A100 80GB, 8× H100 |
| DeepSeek-V3.2 / R1 | ~1.3 TB | ~370 GB | 8× H100 80GB minimum |
Apple Silicon can run models using unified memory with llama.cpp Metal backend. Slower than NVIDIA GPUs but works well for local testing.
All data in this guide comes from verified, official sources: