Explore a comprehensive collection of cutting-edge AI models from leading providers.
anthropic
Claude Opus 4 is Anthropic's most powerful model, setting new standards for coding, advanced reasoning, and AI agents. It excels at long-running tasks and complex problem-solving, with capabilities like extended thinking with tool use and improved memory.
Claude Sonnet 4 is a significant upgrade to Sonnet 3.7, delivering superior coding and reasoning while responding more precisely to instructions. It balances performance and efficiency for various use cases.
google
Gemini 2.5 Flash May 20th Checkpoint is Google's state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks. It includes built-in "thinking" capabilities, enabling it to provide responses with greater accuracy and nuanced context handling. Note: This model is available in two variants: thinking and non-thinking. The output pricing varies significantly depending on whether the thinking capability is active. If you select the standard variant (without the ":thinking" suffix), the model will explicitly avoid generating thinking tokens. To utilize the thinking capability and receive thinking tokens, you must choose the ":thinking" variant, which will then incur the higher thinking-output pricing. Additionally, Gemini 2.5 Flash is configurable through the "max tokens for reasoning" parameter.
Claude 3.7 Sonnet offers Extended Thinking Scaffolds that boost SWE-bench coding accuracy from 62.3% to 70.3%, with 81.2% accuracy in retail automation tasks, outperforming Claude Sonnet 3.6 (2022-10-22) by 13.6%.
Claude 3.7 Thinking Sonnet exposes the full chain-of-thought process during problem-solving, including error backtracking and alternative solution exploration. Scores 86.1% on GPQA Diamond benchmark for expert-level Q&A.
Claude 3.5 Sonnet offers a cost-efficient API ($3/million input tokens vs. $5 for GPT-4o) and uses embedded alignment techniques that reduce harmful outputs by 34% compared to Claude 2.1.
Claude 3 Haiku is Anthropic's fastest model with 21 ms response time for real-time applications and 98.7% accuracy on JLPT N1 benchmarks for Japanese language specialization.
Claude 3 Opus is Anthropic's most powerful model with versatile capabilities ranging from complex reasoning to advanced problem-solving.
Anthropic's Claude 2 model, featuring a large 100K token context window and strong performance on various benchmarks. Known for helpful, honest, and harmless AI conversations.
A temporary research demo version of Claude 3 Sonnet (active for 24 hours on May 23, 2024) specifically engineered by Anthropic to demonstrate feature steering. The model was manipulated to obsessively focus on the Golden Gate Bridge in its responses, showcasing research into model interpretability and safety.
deepseek
DeepSeek R1 is the world's first reasoning model developed entirely via reinforcement learning, offering cost efficiency at $0.14/million tokens vs. OpenAI o1's $15, and reducing Python runtime errors by 71% via static analysis integration.
DeepSeek V3 (March 2024) shows significant improvements in reasoning capabilities with enhanced MMLU-Pro (81.2%), GPQA (68.4%), AIME (59.4%), and LiveCodeBench (49.2%) scores. Features improved front-end web development, Chinese writing proficiency, and function calling accuracy.
A 671B parameter model, speculated to be geared towards logic and mathematics. Likely an upgrade from DeepSeek-Prover-V1.5. Released on Hugging Face without an announcement or description.
Gemini 1.5 Pro handles infinite context with 99% retrieval accuracy at 750k tokens via Mixture-of-Experts and generates chapter summaries for 2-hour videos with 92% accuracy.
Gemini 2.5 Pro Experimental is Google's advanced model with improved multimodal reasoning, long context understanding with 1 million tokens, and specialized video comprehension.
Gemini 2.0 Pro builds interactive 3D environments from text descriptions and offers hypothetical reasoning for scientific simulations.
Gemini 2.0 Flash Thinking offers subsecond reasoning with 840 ms median response time for financial forecasting and an energy-efficient architecture using 0.8 kWh per million tokens (40% less than Gemini 1.5).
Google's state-of-the-art workhorse model, designed for advanced reasoning, coding, mathematics, and scientific tasks. Features hybrid reasoning (thinking on/off) with configurable budgets, balancing quality, cost, and latency.
Gemini 2.5 Flash is Google's state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks. It includes built-in "thinking" capabilities, enabling it to provide responses with greater accuracy and nuanced context handling.
PaLM 2 by Google features improved multilingual, reasoning, and coding capabilities. Optimized for chat-based interactions.
Google's flagship multimodal model (as of release). Designed for natural language tasks, multi-turn chat, code generation, and understanding image inputs.
Our most advanced reasoning model, capable of solving complex problems. Best for multimodal understanding, reasoning over complex problems, complex prompts, tackling multi-step code, math and STEM problems, coding (especially web development), and analyzing large datasets/codebases/documents with long context. Knowledge cutoff Jan 2025.
Gemma 3n E4B-it is optimized for efficient execution on mobile and low-resource devices, such as phones, laptops, and tablets. It supports multimodal inputs—including text, visual data, and audio—enabling diverse tasks such as text generation, speech recognition, translation, and image analysis. Leveraging innovations like Per-Layer Embedding (PLE) caching and the MatFormer architecture, Gemma 3n dynamically manages memory usage and computational load by selectively activating model parameters, significantly reducing runtime resource requirements. This model supports a wide linguistic range (trained in over 140 languages) and features a flexible 32K token context window. Gemma 3n can selectively load parameters, optimizing memory and computational efficiency based on the task or device capabilities, making it well-suited for privacy-focused, offline-capable applications and on-device AI solutions.
meta
Llama 3 70B is a large language model from Meta with strong performance and efficiency for real-time interactions.
Llama 3.1 70B offers a dramatically expanded context window and improved performance on mathematical reasoning and general knowledge tasks.
Llama 3.1 405B is Meta's most powerful open-source model, outperforming even proprietary models on various benchmarks.
Llama 4 Maverick is Meta's multimodal expert model with 17B active parameters and 128 experts (400B total parameters). It outperforms GPT-4o and Gemini 2.0 Flash across various benchmarks, achieving an ELO of 1417 on LMArena. Designed for sophisticated AI applications with excellent image understanding and creative writing.
Llama 4 Scout is Meta's compact yet powerful multimodal model with 17B active parameters and 16 experts (109B total parameters). It fits on a single H100 GPU with Int4 quantization and offers an industry-leading 10M token context window, outperforming Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across various benchmarks.
Llama 4 Behemoth is Meta's most powerful model yet with 288B active parameters and 16 experts (nearly 2T total parameters), outperforming GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks.
midjourney
The first public release of Midjourney, introducing AI image generation to a wider audience through its Discord-based interface.
Midjourney v2 improved on the original model with better coherence, detail, and more consistent style application.
Midjourney v3 introduced significantly improved artistic capabilities with better understanding of prompt nuances and artistic styles.
Midjourney v4 marked a major leap forward with dramatically improved photorealism, coherence, and prompt understanding, trained on Google TPUs for the first time.
Midjourney v5 produces realistic images.
Midjourney v6 produces realistic images.
Midjourney v6.1 introduced a native web interface alongside Discord, with improved detail rendering, better text handling, and enhanced image coherence.
mistral
Mistral Large is a powerful model with strong multilingual capabilities and reasoning, featuring a 32K token context window.
Mistral Large 2 features a 128K context window with enhanced code generation, mathematics, reasoning, and multilingual support.
Mistral Medium 3 is a high-performance enterprise-grade language model designed to deliver frontier-level capabilities at significantly reduced operational cost. It balances state-of-the-art reasoning and multimodal performance with 8× lower cost compared to traditional large models, making it suitable for scalable deployments across professional and industrial use cases. Excels in coding, STEM reasoning, and enterprise adaptation, supporting hybrid, on-prem, and in-VPC deployments.
Mistral Neom 3 is a 12B parameter model with a 128k token context length built by Mistral in collaboration with NVIDIA.
openai
OpenAI's most powerful reasoning model, pushing the frontier across coding, math, science, and visual perception. Trained to think longer before responding and agentically use tools (web search, code execution, image generation) to solve complex problems. Sets new SOTA on benchmarks like Codeforces and MMMU.
A smaller, cost-efficient reasoning model from OpenAI optimized for speed. Achieves remarkable performance for its size, particularly in math, coding, and visual tasks. Supports significantly higher usage limits than o3 and can agentically use tools.
OpenAI o4-mini-high is the same model as o4-mini but defaults to a high reasoning effort setting. It's a compact reasoning model optimized for speed and cost-efficiency, retaining strong multimodal and agentic capabilities, especially in math, coding, and visual tasks.
DALL-E 3 auto-improves user inputs via ChatGPT integration and blocks prohibited content with 99.9% precision using multimodal classifiers.
GPT-4o processes text, images, and audio through a unified transformer architecture and offers real-time translation for 154 languages with 89.2% BLEU score on low-resource languages.
GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and GPT-4.5 across coding (54.6% SWE-bench Verified), instruction compliance (87.4% IFEval), and multimodal understanding benchmarks. It is tuned for precise code diffs, agent reliability, and high recall in large document contexts, making it ideal for agents, IDE tooling, and enterprise knowledge retrieval.
o3 Mini is a smaller, more efficient version of the o3 model, optimized for faster response times and lower computational costs while maintaining high-quality outputs.
o1 achieves 86% accuracy on Mathematics Olympiad benchmarks (vs. GPT-4o's 13%), offers PhD-level STEM proficiency, and maintains a 0.17% deceptive response rate in synthetic testing.
GPT-4.5 is a step forward in scaling up pre-training and post-training. With broader knowledge, improved intent understanding, and greater 'EQ', it excels at natural conversations, writing, programming, and practical problem solving with reduced hallucinations. GPT-4.5 achieved 62.5% accuracy on SimpleQA and a 37.1% hallucination rate, significantly outperforming GPT-4o and other models.
An updated version of GPT-4o that feels more intuitive, creative, and collaborative. Follows instructions more accurately, handles coding tasks more smoothly, and communicates in a clearer, more natural way with more concise responses and fewer markdown levels.
GPT-4o mini is OpenAI's newest model after GPT-4 Omni, supporting both text and image inputs with text outputs. As their most advanced small model, it is many multiples more affordable than other recent frontier models, and more than 60% cheaper than GPT-3.5 Turbo. It maintains SOTA intelligence, while being significantly more cost-effective.
For tasks that demand low latency, GPT‑4.1 nano is the fastest and cheapest model in the GPT-4.1 series. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding – even higher than GPT‑4o mini. It's ideal for tasks like classification or autocompletion.
GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard instruction evals, 35.8% on MultiChallenge, and 84.1% on IFEval. Mini also shows strong coding ability (e.g., 31.6% on Aider's polyglot diff benchmark) and vision understanding, making it suitable for interactive applications with tight performance constraints.
GPT-3.5 Turbo is OpenAI's fastest model. It can understand and generate natural language or code, and is optimized for chat and traditional completion tasks.
OpenAI's flagship model, GPT-4 is a large-scale multimodal language model capable of solving difficult problems with greater accuracy than previous models due to its broader general knowledge and advanced reasoning capabilities. Training data: up to Sep 2021.
A direct scale-up of GPT-1 with 1.5 billion parameters, trained on 8 million web pages. Known for its ability to generate coherent text, sometimes indistinguishable from humans, but could be repetitive.
The first large-scale transformer-based language model released by OpenAI, trained on the BooksCorpus dataset. This version is accessed via the Hugging Face model hub (`openai-community/openai-gpt`).
codex-mini-latest is a fine-tuned version of o4-mini specifically for use in Codex CLI. For direct use in the API, we recommend starting with gpt-4.1.
openrouter
This is a cloaked model provided to the community to gather feedback. It's a powerful, all-purpose model supporting long-context tasks, including code generation. All prompts and completions for this model are logged by the provider as well as OpenRouter.
A stealth, powerful, all-purpose model supporting long-context tasks, including code generation. Based on community feedback.
qwen
QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.
The latest generation Qwen model (30.5B params, 3.3B activated MoE) excels in reasoning, multilingual support, and agent tasks. Features a unique thinking/non-thinking mode switch. Supports up to 131K context with YaRN. Free tier on OpenRouter.
Qwen3-235B-A22B is a 235B parameter mixture-of-experts (MoE) model from Alibaba's Qwen team, activating 22B parameters per forward pass. Features seamless switching between 'thinking' mode (complex tasks) and 'non-thinking' mode (general conversation). Strong reasoning, multilingual (100+), instruction-following, and tool-calling. 32K context, extendable to 131K.
A 0.6B parameter dense model from the Qwen3 family. Supports seamless switching between 'thinking' mode (complex tasks) and 'non-thinking' mode (general conversation). Trained on 36 trillion tokens across 119 languages. Features enhanced reasoning, instruction-following, agent capabilities, and multilingual support.
xai
Grok 3 is a cutting-edge AI model from xAI with Big Brain Mode for complex problems, Colossus Supercomputer integration, and Reinforcement Learning optimization. Achieves 1402 Elo on LMArena benchmarks and 93.3% on AIME 2025 mathematics competition.
Grok 3 Thinking exposes the full chain-of-thought process during problem-solving, including error backtracking and alternative solution exploration. Scores 84.6% on GPQA Diamond benchmark for expert-level Q&A.
Grok 3 Mini is a lightweight, smaller thinking model ideal for reasoning-heavy tasks that don't demand extensive domain knowledge. It shines in math-specific and quantitative use cases. Transparent 'thinking' traces accessible.
Grok 3 Beta is xAI's flagship model excelling at enterprise use cases like data extraction, coding, and text summarization. Possesses deep domain knowledge in finance, healthcare, law, and science. Outperforms Grok 3 Mini on high thinking tasks.