🤖 LLM Chooser v3 — Feb 2026

Comprehensive comparison of 40+ Large Language Models — benchmarks, pricing, capabilities & hardware requirements

🆕 v3 Updates — Feb 20, 2026

Major expansion with Developer column and 10+ new open-source models:

  • Developer column added — every model now shows its creator (OpenAI, Anthropic, Meta AI, etc.)
  • 10 new OSS models: Grok-1, Falcon-2 11B, Falcon-180B, Qwen2.5-32B, Gemma 2 27B, Gemma 2 9B, Phi-3-mini (3.8B), DBRX (132B MoE), Jamba 1.5 Large, Nemotron-70B
  • Open source coverage dramatically improved — table now includes models from xAI, TII, Alibaba Cloud, Google DeepMind, Microsoft, Databricks, AI21 Labs, and NVIDIA
  • All OSS/Self-Hostable flags verified against actual licenses on Hugging Face model cards

v2 Updates — Feb 20, 2026

Deep research iteration with verified data from official sources:

  • 12 new models added: GPT-5, GPT-4.1, o3, o3-mini, Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, Llama 4 Scout, Llama 3.3 70B, Mistral Small 3.1
  • All pricing verified against official API pricing pages (OpenAI, Anthropic, Google, DeepSeek, Mistral)
  • Legacy models marked — GPT-4-Turbo, Claude 3 Opus, Claude 3.5 Sonnet/Haiku, Gemini 1.5 Pro now labeled as legacy
  • DeepSeek unified — V3.2 now serves both chat and reasoning modes via single API
  • Context windows updated — GPT-4.1 now 1M tokens, Claude Opus/Sonnet 4.6 support 1M (beta)
  • Decision framework refreshed with current model recommendations
  • Sources section expanded with direct links to official documentation

📊 Model Comparison Table

Click any column header to sort. All data sourced from official docs and verified benchmarks. NEW = added in v3, LEGACY = superseded.

Model Developer Params (B) Open Source License Context (tokens) Vision Code Gen Func Calling Self-Host API MoE Quant Avail MMLU HumanEval In $/M tok Out $/M tok Min VRAM Fine-tune Reasoning/CoT
GPT-5 OpenAI N/A Proprietary 200,000 N/A N/A N/A $1.25 $10.00 N/A
GPT-4.1 OpenAI N/A Proprietary 1,047,576 N/A N/A N/A $2.00 $8.00 N/A
GPT-4.1 mini OpenAI N/A Proprietary 1,047,576 N/A N/A N/A $0.40 $1.60 N/A
GPT-4.1 nano OpenAI N/A Proprietary 1,047,576 N/A N/A N/A $0.10 $0.40 N/A
o3 OpenAI N/A Proprietary 200,000 N/A N/A N/A $2.00 $8.00 N/A
o3-mini OpenAI N/A Proprietary 200,000 N/A N/A N/A $1.10 $4.40 N/A
GPT-4o OpenAI ~200 Proprietary 128,000 88.7 90.2 $2.50 $10.00 N/A
GPT-4o-mini OpenAI ~8 Proprietary 128,000 82.0 87.2 $0.15 $0.60 N/A
GPT-4-Turbo LEGACY OpenAI ~200 Proprietary 128,000 86.5 87.1 $10.00 $30.00 N/A
Claude Opus 4.6 Anthropic N/A Proprietary 200,000 N/A N/A N/A $5.00 $25.00 N/A
Claude Sonnet 4.6 Anthropic N/A Proprietary 200,000 N/A N/A N/A $3.00 $15.00 N/A
Claude Haiku 4.5 Anthropic N/A Proprietary 200,000 N/A N/A N/A $1.00 $5.00 N/A
Claude 3.5 Sonnet LEGACY Anthropic N/A Proprietary 200,000 N/A 88.7 92.0 $3.00 $15.00 N/A
Claude 3 Opus LEGACY Anthropic N/A Proprietary 200,000 N/A 86.8 84.9 $15.00 $75.00 N/A
Gemini 2.5 Pro Google DeepMind N/A Proprietary 1,048,576 N/A N/A $1.25 $10.00 N/A
Gemini 2.5 Flash Google DeepMind N/A Proprietary 1,048,576 N/A N/A N/A $0.15 $0.60 N/A
Gemini 2.0 Flash Google DeepMind N/A Proprietary 1,048,576 N/A N/A N/A $0.10 $0.40 N/A
Gemini 1.5 Pro LEGACY Google DeepMind N/A Proprietary 2,097,152 85.9 N/A $1.25 $5.00 N/A
Gemma 2 27B NEW Google DeepMind 27 Gemma License 8,192 75.2 N/A Free (self-host) Free (self-host) ~54 GB (FP16)
Gemma 2 9B NEW Google DeepMind 9.2 Gemma License 8,192 71.3 N/A Free (self-host) Free (self-host) ~18 GB (FP16)
Llama 4 Scout Meta AI 109 (17B active) Llama 4 Community 10,000,000 N/A N/A $0.15 $0.60 ~218 GB (FP16)
Llama 3.3 70B Meta AI 70.6 Llama 3.3 Community 128,000 86.0 88.4 $0.60 $0.60 ~140 GB (FP16)
Llama 3.1 405B Meta AI 405 Llama 3.1 Community 128,000 88.6 89.0 $3.00 $3.00 ~810 GB (FP16)
Llama 3.1 70B Meta AI 70.6 Llama 3.1 Community 128,000 86.0 80.5 $0.88 $0.88 ~140 GB (FP16)
Llama 3.1 8B Meta AI 8.0 Llama 3.1 Community 128,000 73.0 72.6 $0.06 $0.06 ~16 GB (FP16)
Mistral Small 3.1 Mistral AI 24 Apache 2.0 128,000 N/A N/A $0.10 $0.30 ~48 GB (FP16)
Mixtral 8x22B LEGACY Mistral AI 141 (39B active) Apache 2.0 65,536 77.8 75.0 $2.00 $6.00 ~282 GB (FP16)
Mixtral 8x7B LEGACY Mistral AI 46.7 (12.9B active) Apache 2.0 32,768 70.6 40.2 $0.24 $0.24 ~93 GB (FP16)
Mistral Large 2 LEGACY Mistral AI 123 Proprietary 128,000 84.0 N/A $2.00 $6.00 N/A
DeepSeek-V3.2 DeepSeek 671 (37B active) DeepSeek License 128,000 88.5 82.6 $0.28 $0.42 ~1.3 TB (FP16)
DeepSeek-R1 DeepSeek 671 (37B active) MIT 128,000 90.8 85.0 $0.28 $0.42 ~1.3 TB (FP16)
Grok-1 NEW xAI 314 (MoE) Apache 2.0 8,192 73.0 N/A Free (self-host) Free (self-host) ~628 GB (FP16)
Qwen2.5-72B Alibaba Cloud 72.7 Qwen License 131,072 86.1 86.6 $0.90 $0.90 ~145 GB (FP16)
Qwen2.5-32B NEW Alibaba Cloud 32.5 Apache 2.0 131,072 83.0 81.7 $0.40 $0.40 ~65 GB (FP16)
Qwen2.5-Coder-32B Alibaba Cloud 32.5 Apache 2.0 131,072 N/A 92.7 $0.40 $0.40 ~65 GB (FP16)
Phi-3 Medium (14B) Microsoft 14 MIT 128,000 78.0 62.2 $0.14 $0.56 ~28 GB (FP16)
Phi-3-mini (3.8B) NEW Microsoft 3.8 MIT 128,000 70.9 58.5 Free (self-host) Free (self-host) ~8 GB (FP16)
Command R+ Cohere 104 CC-BY-NC-4.0 128,000 75.7 N/A $2.50 $10.00 ~208 GB (FP16)
Falcon-2 11B NEW TII (UAE) 11 TII Falcon 2.0 (Apache-based) 8,192 N/A N/A Free (self-host) Free (self-host) ~22 GB (FP16)
Falcon-180B NEW TII (UAE) 180 Falcon-180B TII License 2,048 N/A N/A Free (self-host) Free (self-host) ~400 GB (FP16)
DBRX (132B MoE) NEW Databricks 132 (36B active) Databricks Open 32,768 73.7 70.1 N/A N/A ~264 GB (FP16)
Jamba 1.5 Large NEW AI21 Labs 398 (94B active) Jamba Open Model License 256,000 81.2 N/A N/A N/A ~796 GB (FP16)
Nemotron-70B NEW NVIDIA 70.6 Llama 3.1 Community 128,000 N/A N/A Free (self-host) Free (self-host) ~140 GB (FP16)
💡 Notes on data: Parameter counts for proprietary models (GPT-4o/5, Claude, Gemini) are not officially disclosed — estimates marked with "~" are based on published reports. Benchmark scores (MMLU, HumanEval) come from official model papers and independent evaluations. Many newer models (GPT-4.1, GPT-5, Claude 4.x, Gemini 2.5) have not published traditional MMLU/HumanEval scores, using newer benchmarks instead (GPQA, AIME, LiveCodeBench). Pricing is from official API pricing pages as of Feb 20, 2026.

DeepSeek note: As of Feb 2026, DeepSeek's API serves DeepSeek-V3.2 for both deepseek-chat (non-thinking) and deepseek-reasoner (thinking mode) at the same price point.

Anthropic note: Current Anthropic models are Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5. Claude Opus 4.6 and Sonnet 4.6 support 1M token context (beta). The Claude 3.x series are legacy.

Self-host pricing: Models marked "Free (self-host)" have no API pricing — they are open-weight models you download and run on your own hardware. The only cost is compute/electricity.

📛 How to Read Model Names

Model names encode a lot of information. Here's how to decode them:

B = Billion Parameters

The number before "B" indicates the model's parameter count in billions.

  • Llama-3.1-8B → 8 billion parameters
  • Llama-3.1-70B → 70 billion parameters
  • Llama-3.1-405B → 405 billion parameters

More parameters generally = more capable but more resource-hungry.

Instruct / Chat

Indicates the model has been fine-tuned for instruction following and conversation.

  • Llama-3.1-8B → base model (completion only)
  • Llama-3.1-8B-Instruct → tuned for chat/instructions

Always use the Instruct variant for chat applications.

MoE = Mixture of Experts

Models like Mixtral 8x7B and Llama 4 Scout use MoE architecture: multiple expert networks, but only a subset are active per token.

  • Llama 4 Scout: 109B total, ~17B active per token
  • DeepSeek-V3: 671B total, ~37B active per token
  • DBRX: 132B total, ~36B active per token
  • Faster inference than equivalent dense models

Quantization Suffixes

Quantized model names include the format and bit depth:

  • Q4_K_M → 4-bit quantization, K-quant, Medium quality
  • Q5_K_S → 5-bit, K-quant, Small (more compression)
  • Q8_0 → 8-bit quantization
  • GPTQ-Int4 → GPTQ format, 4-bit integers
  • AWQ → Activation-aware Weight Quantization

Version Numbers & Generations

OpenAI now uses a cleaner versioning scheme:

  • GPT-4.1 → non-reasoning, succeeds GPT-4o
  • GPT-5 → reasoning flagship, succeeds o3
  • o3 / o4-mini → reasoning model line
  • Claude Opus 4.6 → Anthropic's latest flagship
  • Gemini 2.5 / 3.x → Google's generation system
  • Qwen2.5 → 2nd gen, 5th revision

Size Tiers

Common naming patterns for model sizes:

  • Nano/Mini → lightweight, fast, cheapest
  • Small/Medium → balanced capability
  • Large/Pro/Opus → highest capability tier
  • Flash/Haiku → optimized for speed
  • Sonnet → balanced price/performance (Anthropic)
  • Scout/Maverick → Meta's Llama 4 tiers

📋 What Each Parameter Means

Parameter Count

Total number of trainable weights in billions. Larger models generally perform better but require more compute. MoE models have high total counts but fewer active parameters per inference.

Context Window

Maximum number of tokens (input + output) the model can process at once. 1 token ≈ 0.75 English words. A 128K context ≈ ~96,000 words ≈ a 300-page book. Llama 4 Scout leads with 10M tokens.

MMLU Score

Massive Multitask Language Understanding — tests knowledge across 57 subjects (STEM, humanities, social sciences). Scores are 0–100%. Top models score 85–90%+. Note: many newer models use MMLU-Pro instead.

HumanEval Score

Measures code generation ability via 164 Python programming problems. Reports pass@1 (% correct on first try). Top models score 85–92%+. Newer models often report LiveCodeBench instead.

API Pricing

Cost per million tokens for API usage. Input tokens (your prompt) are usually cheaper than output tokens (model's response). Cached input prices are even lower. Reasoning tokens (hidden CoT) are billed as output.

MoE Architecture

Mixture of Experts — routes each token to a subset of specialized "expert" sub-networks. Enables larger total capacity with lower inference cost per token. Used by DeepSeek, Mixtral, DBRX, Jamba, and Llama 4.

Reasoning / CoT

Chain-of-Thought reasoning mode — the model "thinks step-by-step" before answering. Models like DeepSeek-R1, o3, GPT-5, and Claude with Extended Thinking use this. Reasoning tokens are billed but not always visible.

Self-Hostable

Whether you can download and run the model on your own hardware. Requires significant GPU VRAM. Open-source/open-weight models are self-hostable; proprietary models are API-only.

🎯 Decision Framework: When to Use What

💰 Budget-Conscious / High Volume

Best picks for cost-effective usage:

  • GPT-4.1 nano — $0.10/$0.40, 1M context, excellent for simple tasks
  • Gemini 2.0 Flash — $0.10/$0.40, cheapest with vision + 1M context
  • Gemini 2.5 Flash — $0.15/$0.60, reasoning at flash prices
  • GPT-4o-mini — $0.15/$0.60, proven all-rounder
  • DeepSeek-V3.2 — $0.28/$0.42, top-tier quality at budget price
  • Llama 3.1 8B — self-host for near-zero marginal cost

💻 Coding & Development

Best for code generation and debugging:

  • Claude Opus 4.6 — Anthropic's most capable for coding/agents
  • GPT-4.1 — 1M context, excels at instruction following + tool calling
  • Qwen2.5-Coder-32B — 92.7% HumanEval, open source, self-hostable
  • Claude Sonnet 4.6 — fast + intelligent, great balance
  • DeepSeek-R1 — reasoning mode great for algorithmic problems

🧠 Complex Reasoning & Math

Best for multi-step logic, math proofs, research:

  • GPT-5 — OpenAI's flagship reasoning model
  • o3 — top-tier math/science/coding reasoning
  • Claude Opus 4.6 — extended thinking for deep analysis
  • Gemini 2.5 Pro — Google's thinking model, strong analytical
  • DeepSeek-R1 — open-source reasoning champion, MIT license

👁️ Vision & Multimodal

Best for image understanding and multimodal tasks:

  • Gemini 2.0/2.5 Flash — cheapest vision, 1M context
  • GPT-4.1 — vision + 1M context window
  • Claude Opus/Sonnet 4.6 — excellent document/chart analysis
  • Llama 4 Scout — open source multimodal, 10M context
  • Mistral Small 3.1 — self-hostable vision, Apache 2.0

🏠 Self-Hosting / Privacy

Best for running on your own infrastructure:

  • Phi-3-mini (3.8B) — runs on any GPU, MIT license, ~2GB Q4
  • Gemma 2 9B — excellent for its size, runs on consumer GPUs
  • Llama 3.1 8B — runs on a single consumer GPU (16GB FP16, ~5GB Q4)
  • Mistral Small 3.1 — 24B with vision, Apache 2.0
  • Qwen2.5-72B — best quality-to-size ratio for 70B class
  • Nemotron-70B — NVIDIA-optimized Llama 3.1, top alignment scores

📄 Long Documents / RAG

Best for processing very long inputs:

  • Llama 4 Scout — 10M token context window (open source!)
  • Jamba 1.5 Large — 256K context, SSM-Transformer hybrid, fast inference
  • GPT-4.1 — ~1M tokens, excellent instruction following
  • Gemini 2.5 Pro/Flash — 1M tokens, multimodal
  • Claude Opus/Sonnet 4.6 — 200K (1M beta), excellent recall
  • Command R+ — built specifically for RAG use cases

⚡ Quantization Explained

Quantization reduces model size and VRAM needs by using lower-precision numbers for weights.

GGUF (llama.cpp)

The most popular format for CPU + GPU inference via llama.cpp and ollama.

  • Supports mixed CPU/GPU offloading
  • Multiple quant levels: Q2_K through Q8_0
  • K-quant variants (K_S, K_M, K_L) trade size vs quality
  • Q4_K_M is the sweet spot for most users
  • Best for: consumer hardware, local inference

GPTQ

GPU-only quantization optimized for fast inference on NVIDIA GPUs.

  • Requires calibration dataset during quantization
  • Typically 4-bit (Int4) or 8-bit
  • Faster than GGUF on pure GPU workloads
  • Used with AutoGPTQ, vLLM, TGI
  • Best for: production GPU serving

AWQ (Activation-Aware)

Newer quantization method that preserves quality better than GPTQ.

  • Protects the most important weights (based on activation patterns)
  • Typically 4-bit
  • Slightly better quality than GPTQ at same bit depth
  • Used with vLLM, TGI, AutoAWQ
  • Best for: quality-sensitive production deployment

Quantization Impact on Quality & Size

Quant LevelBits/WeightSize vs FP16Quality LossUse Case
FP1616100%None (baseline)Maximum quality, research
Q8_08~50%NegligibleWhen you have the VRAM
Q6_K6~38%Very minimalQuality-sensitive workloads
Q5_K_M5~31%SlightGood balance
Q4_K_M4~25%MinorRecommended default
Q3_K_M3~19%NoticeableTight VRAM constraints
Q2_K2~13%SignificantExperimentation only

VRAM Formula

Quick VRAM estimate: VRAM (GB) ≈ Parameters (B) × Bytes per Weight
FP16 = 2 bytes → 7B model ≈ 14 GB | Q4 ≈ 0.5 bytes → 7B model ≈ 3.5 GB
Add ~10-20% overhead for KV cache, activations, and framework. Longer contexts need more KV cache memory.

🖥️ Hardware Requirements Guide

VRAM requirements for self-hosting. Rule of thumb: ~2 bytes per parameter at FP16, so a 7B model needs ~14 GB VRAM.

ModelFP16 VRAMQ4_K_M VRAMSuggested GPU(s)
Phi-3-mini (3.8B)~8 GB~2 GBAny modern GPU, even integrated
Llama 3.1 8B~16 GB~5 GBRTX 3060 12GB, RTX 4060 Ti 16GB
Gemma 2 9B~18 GB~6 GBRTX 3060 12GB, RTX 4060 Ti 16GB
Falcon-2 11B~22 GB~7 GBRTX 3090 24GB, RTX 4070 Ti 16GB
Phi-3 14B~28 GB~9 GBRTX 3090 24GB, RTX 4070 Ti 16GB
Gemma 2 27B~54 GB~16 GBRTX 4090 24GB, RTX 5090 32GB
Mistral Small 3.1 (24B)~48 GB~15 GBRTX 4090 24GB, RTX 5090 32GB
Qwen2.5-32B / Coder-32B~65 GB~20 GBRTX 3090 24GB, RTX 4090 24GB
Mixtral 8x7B~93 GB~26 GBRTX 4090 24GB + CPU offload, or 2× RTX 3090
Llama 3.3 70B / 3.1 70B / Nemotron-70B~140 GB~40 GB2× RTX 4090, A100 80GB
Qwen2.5-72B~145 GB~42 GB2× RTX 4090, A100 80GB
Command R+ (104B)~208 GB~60 GB3× RTX 4090, 2× A100 80GB
DBRX (132B MoE)~264 GB~75 GB4× RTX 4090, 3× A100 80GB
Llama 4 Scout (109B MoE)~218 GB~62 GB3× RTX 4090, 3× A100 80GB
Mixtral 8x22B~282 GB~80 GB4× RTX 4090, 4× A100 80GB
Falcon-180B~400 GB~115 GB8× RTX 4090, 5× A100 80GB
Grok-1 (314B MoE)~628 GB~180 GB8× A100 80GB, 8× H100
Jamba 1.5 Large (398B)~796 GB~230 GB8× H100 80GB
Llama 3.1 405B~810 GB~230 GB8× A100 80GB, 8× H100
DeepSeek-V3.2 / R1~1.3 TB~370 GB8× H100 80GB minimum
⚠️ Important: VRAM estimates are for inference only. Training and fine-tuning require 2–4× more memory. Context length also affects VRAM — longer contexts need more memory for KV cache. These are approximate values for the model weights only.

Quick Reference: GPU VRAM

Consumer GPUs

  • RTX 3060 — 12 GB
  • RTX 3080 — 10 GB
  • RTX 3090 — 24 GB
  • RTX 4060 Ti — 16 GB
  • RTX 4070 Ti Super — 16 GB
  • RTX 4080 — 16 GB
  • RTX 4090 — 24 GB
  • RTX 5090 — 32 GB

Data Center GPUs

  • A10G — 24 GB
  • A40 — 48 GB
  • A100 — 40 GB or 80 GB
  • H100 — 80 GB
  • H200 — 141 GB
  • MI300X (AMD) — 192 GB
  • B200 — 192 GB

Apple Silicon (Unified Memory)

  • M1/M2 — 8–24 GB
  • M1/M2 Pro — 16–32 GB
  • M1/M2/M3 Max — 32–128 GB
  • M1/M2 Ultra — 64–192 GB
  • M4 Max — up to 128 GB
  • M4 Ultra — up to 256 GB

Apple Silicon can run models using unified memory with llama.cpp Metal backend. Slower than NVIDIA GPUs but works well for local testing.

📚 Sources

All data in this guide comes from verified, official sources:

Official API Pricing & Documentation

Model Cards & Technical Reports

Independent Benchmarks & Analysis

📅 Last Updated: February 20, 2026 (v3 — Developer column + OSS expansion)
⚠️ Disclaimer: LLM pricing, benchmarks, and capabilities change frequently. Always check official sources for the most current information. Benchmark scores may vary across evaluation methodologies. Many newer models (GPT-4.1+, Claude 4.x, Gemini 2.5+) have moved away from MMLU/HumanEval to newer benchmarks like GPQA, AIME, and LiveCodeBench — their cells show N/A for traditional benchmarks.