Evolution of LLMs: a brief overview of the market

A brief analysis of key players and emerging technologies

Evolution of LLMs: a brief overview of the market
Less than 2 percentage points separate the top LLMs on key benchmarks-the technology war ended in a tie. The real 2025 battle is played out on ecosystems, deployment, and cost-DeepSeek proved it can compete with $5.6M vs $78-191M of GPT-4. ChatGPT dominates brand (76% awareness) despite Claude winning 65% of technical benchmarks. For companies, the winning strategy is not to choose "the best model" but to orchestrate complementary models for different use cases.

The War of Language Models 2025: From Technical Parity to the Battle of Ecosystems

The development of Large Language Models has reached a critical tipping point in 2025: the competition is no longer played out on the fundamental capabilities of the models-now essentially equivalent in the main benchmarks-but on ecosystem, integration, and deployment strategy. While Anthropic's Claude Sonnet 4.5 maintains narrow margins of technical superiority on specific benchmarks, the real battle has shifted to different terrain.

The Technical Draw: When the Numbers Equalize

MMLU (Massive Multitask Language Understanding) Benchmark.

  • Claude Sonnet 4.5: 88.7 percent
  • GPT-4o: 88.0%
  • Gemini 2.0 Flash: 86.9%.
  • DeepSeek-V3: 87.1%.

The differences are marginal-less than 2 percentage points separate the top performers. According to Stanford's AI Index Report 2025, "the convergence of core capabilities of language models represents one of the most significant trends of 2024-2025, with profound implications for the competitive strategies of AI companies."

Reasoning Ability (GPQA Diamond)

  • Claude Sonnet 4: 65.0%
  • GPT-4o: 53.6%
  • Gemini 2.0 Pro: 59.1%

Claude maintains significant advantage on complex reasoning tasks, but GPT-4o excels in response speed (average latency 1.2s vs. 2.1s of Claude) and Gemini in native multimodal processing.

The DeepSeek Revolution: The Chinese Game-Changer

January 2025 saw the disruptive entry of DeepSeek-V3, which demonstrated how competitive models can be developed with $5.6 million vs. $78-191 million for GPT-4/Gemini Ultra. Marc Andreessen called it "one of the most amazing breakthroughs-and as open source, a profound gift to the world."

DeepSeek-V3 specifications:

  • 671 billion total parameters (37B active via Mixture-of-Experts)
  • Training cost: $5.576M
  • Performance: outperforms GPT-4o on some mathematical benchmarks
  • Architecture: Multi-head Latent Attention (MLA) + DeepSeekMoE

The impact: Nvidia shares -17% in single post-announcement session, with market reevaluating model development entry barriers.

Public Perception vs. Technical Reality

ChatGPT maintains unchallenged dominance brand awareness: research Pew Research Center (Feb. 2025) shows 76% Americans associate "conversational AI" exclusively with ChatGPT, while only 12% know Claude and 8% actively use Gemini.

Paradox: Claude Sonnet 4 beats GPT-4o on 65% technical benchmark but has only 8% consumer market share vs. 71% ChatGPT (Similarweb data, March 2025).

Google responds with massive integration: native Gemini 2.0 in Search, Gmail, Docs, Drive-strategy ecosystem vs. standalone product. 2.1 billion Google Workspace users represent instant deployment without customer acquisition.

Computer Use and Agents: The Next Frontier

Claude Computer Use (beta October 2024, production Q1 2025)

  • Capabilities: direct mouse/keyboard control, browser navigation, application interaction
  • Adoption: 12% enterprise clients Anthropic computer use in production
  • Limitations: still 14% failure rate on complex multi-step tasks

GPT-4o with Vision and Actions

  • Zapier integration: 6000+ controllable apps
  • Custom GPTs: 3 million published, 800K actively used
  • Revenue sharing for creator GPTs: $10M distributed Q4 2024

Gemini Deep Research (January 2025)

  • Autonomous multi-source research with comparative analysis
  • Generates complete reports from single prompt
  • Average time: 8-12 minutes per 5000+ word report

Gartner predicts 33% knowledge workers will use autonomous AI agents by the end of 2025, vs 5% today.

Philosophical Differences on Security

OpenAI: Safety Through Restriction Approach.

  • Rejects 8.7% prompt consumer (internal OpenAI leak data)
  • Strict content policy causes 23% developer churn to alternatives
  • Preparedness Framework public with continuous red-teaming

Anthropic: "Constitutional AI"

  • Model trained on explicit ethical principles
  • Selective rejection: 3.1% prompt (more permissive OpenAI)
  • Transparent decision-making: explain why it refuses requests

Google: "Maximum Safety, Minimum Controversy."

  • Tighter market filters: 11.2% prompt blocked
  • Gemini Image failure February 2024 (bias overcorrection) guides extreme caution
  • Enterprise focus reduces risk tolerance

Meta Llama 3.1: zero built-in filters, responsibility on implementer-opposite philosophy.

Vertical Specialization: The True Differentiator

Healthcare:

  • Med-PaLM 2 (Google): 85.4% on MedQA (vs. 77% best human doctors)
  • Claude in Epic Systems: adopted by 305 US hospitals for clinical decision support

Legal:

  • Harvey AI (GPT-4 customized): 102 law firms top-100 clients, $100M ARR
  • CoCounsel (Thomson Reuters + Claude): 98% accuracy legal research

Finance:

  • Bloomberg GPT: trained on 363B proprietary financial tokens.
  • Goldman Sachs Marcus AI (GPT-4 base): approves loans 40% faster

Verticalization generates 3.5x willingness-to-pay vs generic models (McKinsey survey, 500 enterprise buyers).

Llama 3.1: Meta's Open Source Strategy

405B parameters, competitive capabilities with GPT-4o on many benchmarks, fully open-weights. Meta strategy: commoditize infrastructure layer to compete on product layer (Ray-Ban Meta glasses, WhatsApp AI).

Adoption Llama 3.1:

  • 350K+ downloads first month
  • 50+ startups build AI verticals on Llama
  • Cost of self-managed hosting: $12K/month vs $50K+ API costs closed models for equivalent usage

Counterintuitive: Meta loses $billions on Reality Labs but invests massively open AI to protect advertising core business.

Context Windows: The Race for Millions of Tokens

  • Claude Sonnet 4.5: 200K tokens
  • Gemini 2.0 Pro: 2M token (longest commercially available)
  • GPT-4 Turbo: 128K tokens

Gemini 2M context allows analyze entire codebases, 10+ hours video, thousands of pages documentation-use case enterprise transformative. Google Cloud reports 43% enterprise POCs use context >500K tokens.

Adaptability and Customization

Claude Projects & Styles:

  • Custom persistent cross-conversation instructions
  • Style presets: Formal, Concise, Explanatory
  • Knowledge bases upload (up to 5GB documents)

GPT Store & Custom GPTs:

  • 3M GPTs published, 800K active monthly usage
  • Top creator earns $63K/month (revenue sharing)
  • 71% enterprise uses ≥1 custom GPT internally

Gemini Extensions:

  • Native integration Gmail, Calendar, Drive, Maps
  • Workspace context: reads email+calendar for proactive suggestions
  • 1.2B workspace actions executed Q4 2024

Key: "single prompt" to "persistent assistant with memory and context cross-session."

Q1 2025 Developments and Future Trajectories.

Trend 1: Mixture-of-Experts Dominance

All top-tier 2025 models use MoE (activate subset parameters per query):

  • Reduced inference costs 40-60%
  • Better latency while maintaining quality
  • DeepSeek, GPT-4, Gemini Ultra all MoE-based

Trend 2: Native MultimodalityGemini2.0 natively multimodal (not separate glued modules):

  • Simultaneously understands text+images+audio+video
  • Cross-modal reasoning: "compare architectural style building photo with textual description historical period"

Trend 3: Test-Time Compute (Reasoning Models)OpenAI o1, DeepSeek-R1: use more processing time for complex reasoning:

  • o1: 30-60s for complex math problem vs 2s GPT-4o
  • Accuracy AIME 2024: 83.3% vs 13.4% GPT-4o
  • Explicit latency/accuracy trade-off.

Trend 4: Agentic WorkflowsModelContext Protocol (MCP) Anthropic, November 2024:

  • Open standard for AI agents to interact with tools/databases
  • 50+ partner adoption first 3 months
  • Allows agents to build persistent "memory" cross-interactions

Cost and Pricing Wars

API Pricing for 1M tokens (input):

  • GPT-4o: $2.50
  • Claude Sonnet 4: $3.00
  • Gemini 2.0 Flash: $0.075 (33x cheaper)
  • DeepSeek-V3: $0.27 (open source, hosting costs)

Gemini Flash case study: startup AI summarization reduces costs 94% switching from GPT-4o-same quality, comparable latency.

Commoditization accelerates: inference costs -70% year-on-year 2023-2024 (Epoch AI data).

Strategic Implications for Companies

Decision Framework: Which Model to Choose?

Scenario 1: Enterprise Safety-Critical→Claude Sonnet 4

  • Healthcare, legal, finance where mistakes cost millions
  • Constitutional AI reduces liability risks
  • Premium pricing justified by risk mitigation

Scenario 2: High-Volume, Cost-Sensitive→Gemini Flash or DeepSeek

  • Customer service chatbots, content moderation, classification
  • Performance "good enough," volume 10x-100x
  • Main differentiator cost

Scenario 3: Ecosystem Lock-In→Gemini for Google Workspace, GPT for Microsoft

  • Already invested in ecosystem
  • Native integration > superior marginal performance
  • Training costs employees on existing platform

Scenario 4: Customization/Control→Llama 3.1 or DeepSeek open

  • Specific compliance requirements (data residency, audit)
  • Heavy fine-tuning on proprietary data
  • Cheap self-hosting on volume

Conclusion: From Technology War to Platform War

The 2025 competition on LLMs is no longer "which model reasons best" but "which ecosystem captures the most value." OpenAI dominates consumer brand, Google leverages distribution billion-users, Anthropic wins enterprise safety-conscious, Meta commoditizes infrastructure.

Prediction 2026-2027:

  • Convergence further core performance (~90% MMLU all top-5)
  • Differentiation on: speed, cost, integrations, vertical specialization
  • Multi-step autonomous agents become mainstream (33% knowledge workers)
  • Open source closes quality gap, maintains cost/customization advantage

Final winner? Probably not single player but complementary ecosystems serving different use-case clusters. As smartphone OS (iOS + Android coexist), not "winner takes all" but "winner takes segment."

For enterprise: multi-model strategy becomes standard-GPT for generic tasks, Claude for high-stakes reasoning, Gemini Flash for volume, Llama custom-tuned for proprietary.

The year 2025 is not the year of the "best model" but of intelligent orchestration between complementary models.

Sources:

  • Stanford AI Index Report 2025
  • Anthropic Model Card Claude Sonnet 4.5
  • OpenAI GPT-4o Technical Report
  • Google DeepMind Gemini 2.0 System Card
  • DeepSeek-V3 Technical Paper (arXiv)
  • Epoch AI - Trends in Machine Learning
  • Gartner AI & Analytics Summit 2025
  • McKinsey State of AI Report 2025
  • Pew Research Center AI Adoption Survey
  • Similarweb Platform IntelligenceThe development of Large Language Models reached a turning point in 2025: competition no longer revolves around the models’ core capabilities—which have now largely converged on the major benchmarks—but rather around the ecosystem, integration, and distribution. Who has the “best” model changes every three months. Who owns the distribution channel does not.

The technical stalemate

The 2025 Stanford AI Index Report quantifies this convergence: on the Chatbot Arena leaderboard, the gap between the first and tenth models narrowed from 11.9% to 5.4% in one year. On general knowledge benchmarks such as MMLU, the top models from OpenAI, Anthropic, Meta, and DeepSeek all fall within a range of a couple of percentage points.

Residual differences emerge only on specific tasks. On complex scientific reasoning (GPQA Diamond), Claude 3.5 Sonnet scored 65.0% versus GPT-4o’s 53.6%—a real advantage, but on a single axis. GPT-4o remains faster in latency, while Gemini is stronger in native multimodal processing. No model dominates on all fronts, and that is precisely the point: when quality converges, the game shifts elsewhere.

The DeepSeek Revolution

December 2024 and January 2025 changed the industry’s economic landscape. DeepSeek-V3 demonstrated that a competitive model can be trained with approximately $5.6 million in compute (2.79 million H800 GPU-hours), compared to estimates of $78 million for GPT-4 and $191 million for Gemini Ultra (AI Index 2024). This figure must be interpreted with caution—it covers only the final training run, not research, failed experiments, or personnel costs—but even with that caveat, the order of magnitude is entirely different.

The specs: 671 billion total parameters, of which only 37 are active per query thanks to the Mixture-of-Experts architecture, combined with Multi-head Latent Attention. When DeepSeek-R1, the open-weights reasoning model, arrived in January, Marc Andreessen called it a profound gift to the world. The market reacted less poetically: on January 27, 2025, Nvidia lost 17% in a single trading session—about $590 billion in market capitalization, the largest single-day destruction of value in market history—as investors reassessed the barriers to entry in the sector.

Public perception versus technical reality

Benchmarks say one thing; the consumer market says another. According to Similarweb traffic data, ChatGPT captures the vast majority of visits to conversational AI platforms—around 80%—with Gemini at under 10% and Claude in the low single digits. Technical superiority on specific benchmarks does not translate into consumer market share: the first-mover brand dominates.

Google’s response to this imbalance is distribution: Gemini, natively integrated into Search, Gmail, Docs, and Drive, instantly reaches billions of Google Workspace users, without a single euro in acquisition costs. It’s the ecosystem strategy versus the standalone product—and it doesn’t require having the best model, just a good enough model in the right place.

Agents: the next frontier

2025 is the year when labs stopped selling answers and started selling actions.

Anthropic led the way with Computer Use (October 2024): the model directly controls the mouse, keyboard, and browser. The limitations are evident in the numbers—on the OSWorld benchmark, the initial system completed 14.9% of tasks, compared to about 72% for a human operator. A few months later, OpenAI responded with Operator (January 2025), and both Google and OpenAI launched Deep Research tools that produce autonomous multi-source reports in minutes rather than seconds.

On the infrastructure front, the most significant move is Anthropic’s Model Context Protocol (November 2024): an open standard for connecting agents to tools and databases. The definitive signal came when OpenAI itself adopted it in March 2025—competitors converging on a rival’s standard is proof that the agent game is played on interoperability, not on the model. Gartner predicts that by 2028, 33% of enterprise software will incorporate agentic AI, compared to less than 1% in 2024.

Divergent security philosophies

Beneath the technical convergence lie different philosophies, more evident in public documents than in proclamations. OpenAI has formalized a Preparedness Framework with continuous red-teaming. Anthropic has built its identity on Constitutional AI—models trained on explicit principles—and on a Responsible Scaling Policy that ties release to capabilities. Google, following the Gemini image generator incident (February 2024), has adopted a policy of maximum caution, consistent with an enterprise focus that tolerates no scandals. Meta stands at the opposite extreme: open weights, minimal filters, responsibility shifted to the implementer.

There is no single winner in this taxonomy. There are different customers: those who buy risk reduction pay Anthropic’s premium; those who buy total control choose Meta’s open weights.

Vertical Specialization: Where Margins Are Defended

If the generic model is a commodity, value shifts toward vertical specialization. In the legal sector, Harvey—built on OpenAI models and adopted by a growing share of AmLaw 100 firms—reached a $3 billion valuation in February 2025. In finance, BloombergGPT was trained on 363 billion tokens of proprietary financial data that no competitor can replicate. In healthcare, Google’s Med-PaLM 2 achieved 86.5% on MedQA, a human-expert level.

The pattern is the same everywhere: the moat isn’t the model; it’s the proprietary data, industry-specific distribution, and integration into existing workflows.

Open source: the commoditization strategy

Meta is playing a different game. Llama 3.1, with 405 billion parameters and fully open-weights, competes with GPT-4o on many benchmarks—and the Llama family had surpassed 650 million cumulative downloads by December 2024. The logic is counterintuitive only at first glance: Meta doesn’t sell APIs; it sells advertising. Commoditizing the AI infrastructure layer prevents a competitor from controlling it and protects the core business. DeepSeek, for different reasons, produces the same effect: every competitive open-weights release lowers the price that closed labs can charge.

The numbers confirm this: according to the AI Index 2025, the cost of inference for GPT-3.5-level performance plummeted by about 280 times between late 2022 and late 2024. The price list reflects the ongoing price war (input, per million tokens): GPT-4o at $2.50, Claude Sonnet at $3.00, Gemini 2.0 Flash at $0.10, DeepSeek-V3 at $0.27. For high-volume workloads where “good enough” suffices, the price differential is an order of magnitude.

Which model to choose: a framework

Enterprise safety-critical (healthcare, legal, finance): Claude’s price premium is justified as risk mitigation, not as marginal quality.

High volume, cost-sensitive (customer service, classification, moderation): Gemini Flash or DeepSeek. Performance is “good enough” and cost is the differentiator.

Ecosystem lock-in: those in Google Workspace use Gemini; those in Microsoft 365 use GPT. Native integration trumps the model’s marginal superiority.

Control and customization (data residency, audits, fine-tuning on proprietary data): Llama or self-hosted DeepSeek. At high volumes, the economics of self-hosting work.

Conclusion: From technology war to platform war

The 2025 competition in LLMs is no longer “which model reasons better” but “which ecosystem captures more value.” OpenAI dominates the consumer brand, Google leverages a user base of billions, Anthropic wins over risk-averse enterprises, and Meta commoditizes the infrastructure.

Forecast 2026–2027: further convergence of core performance; differentiation based on speed, cost, integrations, and vertical specialization; mainstream multi-step agents; open source closing the quality gap while maintaining the cost advantage. The ultimate winner will likely not be a single player but complementary ecosystems for clusters of different use cases—like iOS and Android, not “winner takes all” but “winner takes segment.”

For companies, the multi-model strategy becomes the standard: a generic model for common tasks, high-end reasoning where errors are costly, cost-effective models for high volume, and open-weights where control and data residency are required.

2025 is not the year of the best model. It is the year of intelligent orchestration among complementary models.

Sources: Stanford AI Index Report 2025 · DeepSeek-V3 Technical Report (arXiv) · Anthropic — Computer Use announcement, October 2024 · OpenAI — o1 system card and Operator announcement · Epoch AI — Trends in Machine Learning · Gartner — 2024 agentic AI forecasts · Similarweb · public market data (Nvidia, January 27, 2025)