LLM Chooser — Comprehensive Model Comparison Guide

🆕 v3 Updates — Feb 20, 2026

Major expansion with Developer column and 10+ new open-source models:

Developer column added — every model now shows its creator (OpenAI, Anthropic, Meta AI, etc.)
10 new OSS models: Grok-1, Falcon-2 11B, Falcon-180B, Qwen2.5-32B, Gemma 2 27B, Gemma 2 9B, Phi-3-mini (3.8B), DBRX (132B MoE), Jamba 1.5 Large, Nemotron-70B
Open source coverage dramatically improved — table now includes models from xAI, TII, Alibaba Cloud, Google DeepMind, Microsoft, Databricks, AI21 Labs, and NVIDIA
All OSS/Self-Hostable flags verified against actual licenses on Hugging Face model cards

v2 Updates — Feb 20, 2026

Deep research iteration with verified data from official sources:

12 new models added: GPT-5, GPT-4.1, o3, o3-mini, Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, Llama 4 Scout, Llama 3.3 70B, Mistral Small 3.1
All pricing verified against official API pricing pages (OpenAI, Anthropic, Google, DeepSeek, Mistral)
Legacy models marked — GPT-4-Turbo, Claude 3 Opus, Claude 3.5 Sonnet/Haiku, Gemini 1.5 Pro now labeled as legacy
DeepSeek unified — V3.2 now serves both chat and reasoning modes via single API
Context windows updated — GPT-4.1 now 1M tokens, Claude Opus/Sonnet 4.6 support 1M (beta)
Decision framework refreshed with current model recommendations
Sources section expanded with direct links to official documentation

📊 Model Comparison Table

Click any column header to sort. All data sourced from official docs and verified benchmarks. NEW = added in v3, LEGACY = superseded.

Model	Developer	Params (B)	Open Source	License	Context (tokens)	Vision	Code Gen	Func Calling	Self-Host	API	MoE	Quant Avail	MMLU	HumanEval	In $/M tok	Out $/M tok	Min VRAM	Fine-tune	Reasoning/CoT
GPT-5	OpenAI	N/A	❌	Proprietary	200,000	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$1.25	$10.00	N/A	❌	✅
GPT-4.1	OpenAI	N/A	❌	Proprietary	1,047,576	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$2.00	$8.00	N/A	✅	❌
GPT-4.1 mini	OpenAI	N/A	❌	Proprietary	1,047,576	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$0.40	$1.60	N/A	✅	❌
GPT-4.1 nano	OpenAI	N/A	❌	Proprietary	1,047,576	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$0.10	$0.40	N/A	✅	❌
o3	OpenAI	N/A	❌	Proprietary	200,000	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$2.00	$8.00	N/A	❌	✅
o3-mini	OpenAI	N/A	❌	Proprietary	200,000	❌	✅	✅	❌	✅	N/A	❌	N/A	N/A	$1.10	$4.40	N/A	❌	✅
GPT-4o	OpenAI	~200	❌	Proprietary	128,000	✅	✅	✅	❌	✅	❌	❌	88.7	90.2	$2.50	$10.00	N/A	✅	❌
GPT-4o-mini	OpenAI	~8	❌	Proprietary	128,000	✅	✅	✅	❌	✅	❌	❌	82.0	87.2	$0.15	$0.60	N/A	✅	❌
GPT-4-Turbo LEGACY	OpenAI	~200	❌	Proprietary	128,000	✅	✅	✅	❌	✅	❌	❌	86.5	87.1	$10.00	$30.00	N/A	❌	❌
Claude Opus 4.6	Anthropic	N/A	❌	Proprietary	200,000	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$5.00	$25.00	N/A	❌	✅
Claude Sonnet 4.6	Anthropic	N/A	❌	Proprietary	200,000	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$3.00	$15.00	N/A	❌	✅
Claude Haiku 4.5	Anthropic	N/A	❌	Proprietary	200,000	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$1.00	$5.00	N/A	❌	✅
Claude 3.5 Sonnet LEGACY	Anthropic	N/A	❌	Proprietary	200,000	✅	✅	✅	❌	✅	N/A	❌	88.7	92.0	$3.00	$15.00	N/A	❌	✅
Claude 3 Opus LEGACY	Anthropic	N/A	❌	Proprietary	200,000	✅	✅	✅	❌	✅	N/A	❌	86.8	84.9	$15.00	$75.00	N/A	❌	❌
Gemini 2.5 Pro	Google DeepMind	N/A	❌	Proprietary	1,048,576	✅	✅	✅	❌	✅	✅	❌	N/A	N/A	$1.25	$10.00	N/A	❌	✅
Gemini 2.5 Flash	Google DeepMind	N/A	❌	Proprietary	1,048,576	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$0.15	$0.60	N/A	❌	✅
Gemini 2.0 Flash	Google DeepMind	N/A	❌	Proprietary	1,048,576	✅	✅	✅	❌	✅	N/A	❌	N/A	N/A	$0.10	$0.40	N/A	❌	✅
Gemini 1.5 Pro LEGACY	Google DeepMind	N/A	❌	Proprietary	2,097,152	✅	✅	✅	❌	✅	✅	❌	85.9	N/A	$1.25	$5.00	N/A	✅	❌
Gemma 2 27B NEW	Google DeepMind	27	✅	Gemma License	8,192	❌	✅	❌	✅	✅	❌	✅	75.2	N/A	Free (self-host)	Free (self-host)	~54 GB (FP16)	✅	❌
Gemma 2 9B NEW	Google DeepMind	9.2	✅	Gemma License	8,192	❌	✅	❌	✅	✅	❌	✅	71.3	N/A	Free (self-host)	Free (self-host)	~18 GB (FP16)	✅	❌
Llama 4 Scout	Meta AI	109 (17B active)	✅	Llama 4 Community	10,000,000	✅	✅	✅	✅	✅	✅	✅	N/A	N/A	$0.15	$0.60	~218 GB (FP16)	✅	❌
Llama 3.3 70B	Meta AI	70.6	✅	Llama 3.3 Community	128,000	❌	✅	✅	✅	✅	❌	✅	86.0	88.4	$0.60	$0.60	~140 GB (FP16)	✅	❌
Llama 3.1 405B	Meta AI	405	✅	Llama 3.1 Community	128,000	❌	✅	✅	✅	✅	❌	✅	88.6	89.0	$3.00	$3.00	~810 GB (FP16)	✅	❌
Llama 3.1 70B	Meta AI	70.6	✅	Llama 3.1 Community	128,000	❌	✅	✅	✅	✅	❌	✅	86.0	80.5	$0.88	$0.88	~140 GB (FP16)	✅	❌
Llama 3.1 8B	Meta AI	8.0	✅	Llama 3.1 Community	128,000	❌	✅	✅	✅	✅	❌	✅	73.0	72.6	$0.06	$0.06	~16 GB (FP16)	✅	❌
Mistral Small 3.1	Mistral AI	24	✅	Apache 2.0	128,000	✅	✅	✅	✅	✅	❌	✅	N/A	N/A	$0.10	$0.30	~48 GB (FP16)	✅	❌
Mixtral 8x22B LEGACY	Mistral AI	141 (39B active)	✅	Apache 2.0	65,536	❌	✅	✅	✅	✅	✅	✅	77.8	75.0	$2.00	$6.00	~282 GB (FP16)	✅	❌
Mixtral 8x7B LEGACY	Mistral AI	46.7 (12.9B active)	✅	Apache 2.0	32,768	❌	✅	✅	✅	✅	✅	✅	70.6	40.2	$0.24	$0.24	~93 GB (FP16)	✅	❌
Mistral Large 2 LEGACY	Mistral AI	123	❌	Proprietary	128,000	❌	✅	✅	❌	✅	❌	❌	84.0	N/A	$2.00	$6.00	N/A	❌	❌
DeepSeek-V3.2	DeepSeek	671 (37B active)	✅	DeepSeek License	128,000	❌	✅	✅	✅	✅	✅	✅	88.5	82.6	$0.28	$0.42	~1.3 TB (FP16)	✅	❌
DeepSeek-R1	DeepSeek	671 (37B active)	✅	MIT	128,000	❌	✅	✅	✅	✅	✅	✅	90.8	85.0	$0.28	$0.42	~1.3 TB (FP16)	✅	✅
Grok-1 NEW	xAI	314 (MoE)	✅	Apache 2.0	8,192	❌	✅	❌	✅	❌	✅	❌	73.0	N/A	Free (self-host)	Free (self-host)	~628 GB (FP16)	✅	❌
Qwen2.5-72B	Alibaba Cloud	72.7	✅	Qwen License	131,072	❌	✅	✅	✅	✅	❌	✅	86.1	86.6	$0.90	$0.90	~145 GB (FP16)	✅	❌
Qwen2.5-32B NEW	Alibaba Cloud	32.5	✅	Apache 2.0	131,072	❌	✅	✅	✅	✅	❌	✅	83.0	81.7	$0.40	$0.40	~65 GB (FP16)	✅	❌
Qwen2.5-Coder-32B	Alibaba Cloud	32.5	✅	Apache 2.0	131,072	❌	✅	✅	✅	✅	❌	✅	N/A	92.7	$0.40	$0.40	~65 GB (FP16)	✅	❌
Phi-3 Medium (14B)	Microsoft	14	✅	MIT	128,000	✅	✅	❌	✅	✅	❌	✅	78.0	62.2	$0.14	$0.56	~28 GB (FP16)	✅	❌
Phi-3-mini (3.8B) NEW	Microsoft	3.8	✅	MIT	128,000	❌	✅	❌	✅	✅	❌	✅	70.9	58.5	Free (self-host)	Free (self-host)	~8 GB (FP16)	✅	❌
Command R+	Cohere	104	✅	CC-BY-NC-4.0	128,000	❌	✅	✅	✅	✅	❌	✅	75.7	N/A	$2.50	$10.00	~208 GB (FP16)	✅	❌
Falcon-2 11B NEW	TII (UAE)	11	✅	TII Falcon 2.0 (Apache-based)	8,192	❌	✅	❌	✅	❌	❌	✅	N/A	N/A	Free (self-host)	Free (self-host)	~22 GB (FP16)	✅	❌
Falcon-180B NEW	TII (UAE)	180	✅	Falcon-180B TII License	2,048	❌	✅	❌	✅	❌	❌	✅	N/A	N/A	Free (self-host)	Free (self-host)	~400 GB (FP16)	✅	❌
DBRX (132B MoE) NEW	Databricks	132 (36B active)	✅	Databricks Open	32,768	❌	✅	✅	✅	✅	✅	✅	73.7	70.1	N/A	N/A	~264 GB (FP16)	✅	❌
Jamba 1.5 Large NEW	AI21 Labs	398 (94B active)	✅	Jamba Open Model License	256,000	❌	✅	✅	✅	✅	✅	✅	81.2	N/A	N/A	N/A	~796 GB (FP16)	✅	❌
Nemotron-70B NEW	NVIDIA	70.6	✅	Llama 3.1 Community	128,000	❌	✅	✅	✅	✅	❌	✅	N/A	N/A	Free (self-host)	Free (self-host)	~140 GB (FP16)	✅	❌

💡 Notes on data: Parameter counts for proprietary models (GPT-4o/5, Claude, Gemini) are not officially disclosed — estimates marked with "~" are based on published reports. Benchmark scores (MMLU, HumanEval) come from official model papers and independent evaluations. Many newer models (GPT-4.1, GPT-5, Claude 4.x, Gemini 2.5) have not published traditional MMLU/HumanEval scores, using newer benchmarks instead (GPQA, AIME, LiveCodeBench). Pricing is from official API pricing pages as of Feb 20, 2026.

DeepSeek note: As of Feb 2026, DeepSeek's API serves DeepSeek-V3.2 for both deepseek-chat (non-thinking) and deepseek-reasoner (thinking mode) at the same price point.

Anthropic note: Current Anthropic models are Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5. Claude Opus 4.6 and Sonnet 4.6 support 1M token context (beta). The Claude 3.x series are legacy.

Self-host pricing: Models marked "Free (self-host)" have no API pricing — they are open-weight models you download and run on your own hardware. The only cost is compute/electricity.

📛 How to Read Model Names

Model names encode a lot of information. Here's how to decode them:

B = Billion Parameters

The number before "B" indicates the model's parameter count in billions.

Llama-3.1-8B → 8 billion parameters
Llama-3.1-70B → 70 billion parameters
Llama-3.1-405B → 405 billion parameters

More parameters generally = more capable but more resource-hungry.

Instruct / Chat

Indicates the model has been fine-tuned for instruction following and conversation.

Llama-3.1-8B → base model (completion only)
Llama-3.1-8B-Instruct → tuned for chat/instructions

Always use the Instruct variant for chat applications.

MoE = Mixture of Experts

Models like Mixtral 8x7B and Llama 4 Scout use MoE architecture: multiple expert networks, but only a subset are active per token.

Llama 4 Scout: 109B total, ~17B active per token
DeepSeek-V3: 671B total, ~37B active per token
DBRX: 132B total, ~36B active per token
Faster inference than equivalent dense models

Quantization Suffixes

Quantized model names include the format and bit depth:

Q4_K_M → 4-bit quantization, K-quant, Medium quality
Q5_K_S → 5-bit, K-quant, Small (more compression)
Q8_0 → 8-bit quantization
GPTQ-Int4 → GPTQ format, 4-bit integers
AWQ → Activation-aware Weight Quantization

Version Numbers & Generations

OpenAI now uses a cleaner versioning scheme:

GPT-4.1 → non-reasoning, succeeds GPT-4o
GPT-5 → reasoning flagship, succeeds o3
o3 / o4-mini → reasoning model line
Claude Opus 4.6 → Anthropic's latest flagship
Gemini 2.5 / 3.x → Google's generation system
Qwen2.5 → 2nd gen, 5th revision

Size Tiers

Common naming patterns for model sizes:

Nano/Mini → lightweight, fast, cheapest
Small/Medium → balanced capability
Large/Pro/Opus → highest capability tier
Flash/Haiku → optimized for speed
Sonnet → balanced price/performance (Anthropic)
Scout/Maverick → Meta's Llama 4 tiers

📋 What Each Parameter Means

Parameter Count

Total number of trainable weights in billions. Larger models generally perform better but require more compute. MoE models have high total counts but fewer active parameters per inference.

Context Window

Maximum number of tokens (input + output) the model can process at once. 1 token ≈ 0.75 English words. A 128K context ≈ ~96,000 words ≈ a 300-page book. Llama 4 Scout leads with 10M tokens.

MMLU Score

Massive Multitask Language Understanding — tests knowledge across 57 subjects (STEM, humanities, social sciences). Scores are 0–100%. Top models score 85–90%+. Note: many newer models use MMLU-Pro instead.

HumanEval Score

Measures code generation ability via 164 Python programming problems. Reports pass@1 (% correct on first try). Top models score 85–92%+. Newer models often report LiveCodeBench instead.

API Pricing

Cost per million tokens for API usage. Input tokens (your prompt) are usually cheaper than output tokens (model's response). Cached input prices are even lower. Reasoning tokens (hidden CoT) are billed as output.

MoE Architecture

Mixture of Experts — routes each token to a subset of specialized "expert" sub-networks. Enables larger total capacity with lower inference cost per token. Used by DeepSeek, Mixtral, DBRX, Jamba, and Llama 4.

Reasoning / CoT

Chain-of-Thought reasoning mode — the model "thinks step-by-step" before answering. Models like DeepSeek-R1, o3, GPT-5, and Claude with Extended Thinking use this. Reasoning tokens are billed but not always visible.

Self-Hostable

Whether you can download and run the model on your own hardware. Requires significant GPU VRAM. Open-source/open-weight models are self-hostable; proprietary models are API-only.

🎯 Decision Framework: When to Use What

💰 Budget-Conscious / High Volume

Best picks for cost-effective usage:

GPT-4.1 nano — $0.10/$0.40, 1M context, excellent for simple tasks
Gemini 2.0 Flash — $0.10/$0.40, cheapest with vision + 1M context
Gemini 2.5 Flash — $0.15/$0.60, reasoning at flash prices
GPT-4o-mini — $0.15/$0.60, proven all-rounder
DeepSeek-V3.2 — $0.28/$0.42, top-tier quality at budget price
Llama 3.1 8B — self-host for near-zero marginal cost

💻 Coding & Development

Best for code generation and debugging:

Claude Opus 4.6 — Anthropic's most capable for coding/agents
GPT-4.1 — 1M context, excels at instruction following + tool calling
Qwen2.5-Coder-32B — 92.7% HumanEval, open source, self-hostable
Claude Sonnet 4.6 — fast + intelligent, great balance
DeepSeek-R1 — reasoning mode great for algorithmic problems

🧠 Complex Reasoning & Math

Best for multi-step logic, math proofs, research:

GPT-5 — OpenAI's flagship reasoning model
o3 — top-tier math/science/coding reasoning
Claude Opus 4.6 — extended thinking for deep analysis
Gemini 2.5 Pro — Google's thinking model, strong analytical
DeepSeek-R1 — open-source reasoning champion, MIT license

👁️ Vision & Multimodal

Best for image understanding and multimodal tasks:

Gemini 2.0/2.5 Flash — cheapest vision, 1M context
GPT-4.1 — vision + 1M context window
Claude Opus/Sonnet 4.6 — excellent document/chart analysis
Llama 4 Scout — open source multimodal, 10M context
Mistral Small 3.1 — self-hostable vision, Apache 2.0

🏠 Self-Hosting / Privacy

Best for running on your own infrastructure:

Phi-3-mini (3.8B) — runs on any GPU, MIT license, ~2GB Q4
Gemma 2 9B — excellent for its size, runs on consumer GPUs
Llama 3.1 8B — runs on a single consumer GPU (16GB FP16, ~5GB Q4)
Mistral Small 3.1 — 24B with vision, Apache 2.0
Qwen2.5-72B — best quality-to-size ratio for 70B class
Nemotron-70B — NVIDIA-optimized Llama 3.1, top alignment scores

📄 Long Documents / RAG

Best for processing very long inputs:

Llama 4 Scout — 10M token context window (open source!)
Jamba 1.5 Large — 256K context, SSM-Transformer hybrid, fast inference
GPT-4.1 — ~1M tokens, excellent instruction following
Gemini 2.5 Pro/Flash — 1M tokens, multimodal
Claude Opus/Sonnet 4.6 — 200K (1M beta), excellent recall
Command R+ — built specifically for RAG use cases

⚡ Quantization Explained

Quantization reduces model size and VRAM needs by using lower-precision numbers for weights.

GGUF (llama.cpp)

The most popular format for CPU + GPU inference via llama.cpp and ollama.

Supports mixed CPU/GPU offloading
Multiple quant levels: Q2_K through Q8_0
K-quant variants (K_S, K_M, K_L) trade size vs quality
Q4_K_M is the sweet spot for most users
Best for: consumer hardware, local inference

GPTQ

GPU-only quantization optimized for fast inference on NVIDIA GPUs.

Requires calibration dataset during quantization
Typically 4-bit (Int4) or 8-bit
Faster than GGUF on pure GPU workloads
Used with AutoGPTQ, vLLM, TGI
Best for: production GPU serving

AWQ (Activation-Aware)

Newer quantization method that preserves quality better than GPTQ.

Protects the most important weights (based on activation patterns)
Typically 4-bit
Slightly better quality than GPTQ at same bit depth
Used with vLLM, TGI, AutoAWQ
Best for: quality-sensitive production deployment

Quantization Impact on Quality & Size

Quant Level	Bits/Weight	Size vs FP16	Quality Loss	Use Case
FP16	16	100%	None (baseline)	Maximum quality, research
Q8_0	8	~50%	Negligible	When you have the VRAM
Q6_K	6	~38%	Very minimal	Quality-sensitive workloads
Q5_K_M	5	~31%	Slight	Good balance
Q4_K_M	4	~25%	Minor	Recommended default
Q3_K_M	3	~19%	Noticeable	Tight VRAM constraints
Q2_K	2	~13%	Significant	Experimentation only

VRAM Formula

Quick VRAM estimate: VRAM (GB) ≈ Parameters (B) × Bytes per Weight
FP16 = 2 bytes → 7B model ≈ 14 GB | Q4 ≈ 0.5 bytes → 7B model ≈ 3.5 GB
Add ~10-20% overhead for KV cache, activations, and framework. Longer contexts need more KV cache memory.

🖥️ Hardware Requirements Guide

VRAM requirements for self-hosting. Rule of thumb: ~2 bytes per parameter at FP16, so a 7B model needs ~14 GB VRAM.

Model	FP16 VRAM	Q4_K_M VRAM	Suggested GPU(s)
Phi-3-mini (3.8B)	~8 GB	~2 GB	Any modern GPU, even integrated
Llama 3.1 8B	~16 GB	~5 GB	RTX 3060 12GB, RTX 4060 Ti 16GB
Gemma 2 9B	~18 GB	~6 GB	RTX 3060 12GB, RTX 4060 Ti 16GB
Falcon-2 11B	~22 GB	~7 GB	RTX 3090 24GB, RTX 4070 Ti 16GB
Phi-3 14B	~28 GB	~9 GB	RTX 3090 24GB, RTX 4070 Ti 16GB
Gemma 2 27B	~54 GB	~16 GB	RTX 4090 24GB, RTX 5090 32GB
Mistral Small 3.1 (24B)	~48 GB	~15 GB	RTX 4090 24GB, RTX 5090 32GB
Qwen2.5-32B / Coder-32B	~65 GB	~20 GB	RTX 3090 24GB, RTX 4090 24GB
Mixtral 8x7B	~93 GB	~26 GB	RTX 4090 24GB + CPU offload, or 2× RTX 3090
Llama 3.3 70B / 3.1 70B / Nemotron-70B	~140 GB	~40 GB	2× RTX 4090, A100 80GB
Qwen2.5-72B	~145 GB	~42 GB	2× RTX 4090, A100 80GB
Command R+ (104B)	~208 GB	~60 GB	3× RTX 4090, 2× A100 80GB
DBRX (132B MoE)	~264 GB	~75 GB	4× RTX 4090, 3× A100 80GB
Llama 4 Scout (109B MoE)	~218 GB	~62 GB	3× RTX 4090, 3× A100 80GB
Mixtral 8x22B	~282 GB	~80 GB	4× RTX 4090, 4× A100 80GB
Falcon-180B	~400 GB	~115 GB	8× RTX 4090, 5× A100 80GB
Grok-1 (314B MoE)	~628 GB	~180 GB	8× A100 80GB, 8× H100
Jamba 1.5 Large (398B)	~796 GB	~230 GB	8× H100 80GB
Llama 3.1 405B	~810 GB	~230 GB	8× A100 80GB, 8× H100
DeepSeek-V3.2 / R1	~1.3 TB	~370 GB	8× H100 80GB minimum

⚠️ Important: VRAM estimates are for inference only. Training and fine-tuning require 2–4× more memory. Context length also affects VRAM — longer contexts need more memory for KV cache. These are approximate values for the model weights only.

Quick Reference: GPU VRAM

Consumer GPUs

RTX 3060 — 12 GB
RTX 3080 — 10 GB
RTX 3090 — 24 GB
RTX 4060 Ti — 16 GB
RTX 4070 Ti Super — 16 GB
RTX 4080 — 16 GB
RTX 4090 — 24 GB
RTX 5090 — 32 GB

Data Center GPUs

A10G — 24 GB
A40 — 48 GB
A100 — 40 GB or 80 GB
H100 — 80 GB
H200 — 141 GB
MI300X (AMD) — 192 GB
B200 — 192 GB

Apple Silicon (Unified Memory)

M1/M2 — 8–24 GB
M1/M2 Pro — 16–32 GB
M1/M2/M3 Max — 32–128 GB
M1/M2 Ultra — 64–192 GB
M4 Max — up to 128 GB
M4 Ultra — up to 256 GB

Apple Silicon can run models using unified memory with llama.cpp Metal backend. Slower than NVIDIA GPUs but works well for local testing.

📚 Sources

All data in this guide comes from verified, official sources:

Official API Pricing & Documentation

OpenAI API Pricing — GPT-5, GPT-4.1, GPT-4o, o3, o3-mini pricing & specs (verified Feb 2026)
OpenAI Models Documentation — full model listing with context windows, capabilities
Anthropic Models Documentation — Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 specs, pricing, context windows
DeepSeek API Pricing — DeepSeek-V3.2 (chat & reasoner) pricing at $0.28/$0.42 per M tokens
Mistral AI Model Docs — Mistral Small 3.1, Magistral, Codestral model information
Google Vertex AI Models — Gemini 3.x, 2.5, 2.0 model family documentation

Model Cards & Technical Reports

Meta Llama on Hugging Face — Llama 4, 3.3, 3.1 model cards, licenses, specs
Qwen2.5-72B on Hugging Face — 72.7B params, 131K context, architecture details
Gemma 2 27B on Hugging Face — Google's open model, 27B params
Phi-3-mini on Hugging Face — 3.8B params, MIT license
Microsoft Phi-3 Medium on Hugging Face — Phi-3 specs and benchmarks
Command R+ on Hugging Face — Cohere Command R+ model card
Grok-1 on Hugging Face — xAI's 314B MoE, Apache 2.0 license
Falcon-2 11B on Hugging Face — TII's 11B model, trained on 5T+ tokens
Falcon-180B on Hugging Face — TII's 180B dense model
DBRX on Hugging Face — Databricks' 132B MoE
Jamba 1.5 Large on Hugging Face — AI21 Labs' 398B SSM-Transformer hybrid
Nemotron-70B on Hugging Face — NVIDIA's alignment-optimized Llama 3.1
Llama 3.1 Model Card (GitHub) — official Meta benchmarks and architecture

Independent Benchmarks & Analysis

Artificial Analysis LLM Leaderboard — independent intelligence index, speed, pricing comparisons for 100+ models
DeepSeek-V3 Technical Report (arXiv) — 671B MoE architecture, 37B active params
DeepSeek-R1 Technical Report (arXiv) — reasoning model, MIT license, MMLU 90.8%
DBRX Blog Post (Databricks) — architecture, benchmarks, comparison

📅 Last Updated: February 20, 2026 (v3 — Developer column + OSS expansion)
⚠️ Disclaimer: LLM pricing, benchmarks, and capabilities change frequently. Always check official sources for the most current information. Benchmark scores may vary across evaluation methodologies. Many newer models (GPT-4.1+, Claude 4.x, Gemini 2.5+) have moved away from MMLU/HumanEval to newer benchmarks like GPQA, AIME, and LiveCodeBench — their cells show N/A for traditional benchmarks.