english 27 Feb 2025 7 min EN

The Strawberry Problem

How Large Language Models read (and do not read) words.

From the “Strawberry” Problem to the o1 Model: How OpenAI (Partially) Solved the Tokenization Limitation

In the summer of 2024, a viral internet meme stumped the world’s most advanced language models: “How many ‘r’s are there in the word ‘strawberry’?” The correct answer is three, but GPT-4o stubbornly replied “two.” A seemingly trivial error that revealed a fundamental limitation of language models: their inability to analyze individual letters within words.

On September 12, 2024, OpenAI released o1—internally known by the codename “Strawberry”—the first model in a new series of “reasoning models” designed specifically to overcome this type of limitation. And yes, the name is no coincidence: as an OpenAI researcher confirmed, o1 can finally count the ‘r’s in “strawberry” correctly.

But the solution isn’t what the original article imagined. OpenAI didn’t “teach” the model to analyze words letter by letter. Instead, it developed a completely different approach: teaching the model to “reason” before responding.

The Counting Problem: Why Models Get It Wrong

The problem remains rooted in tokenization—the fundamental process by which language models process text. As explained in a technical paper published on arXiv in May 2025 (“The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models”), models do not view words as sequences of letters but as “tokens”—units of meaning converted into numbers.

When GPT-4 processes the word “strawberry,” its tokenizer splits it into three parts: [str][aw][berry], each with a specific numerical ID (496, 675, 15717). For the model, “strawberry” is not a sequence of 10 letters but a sequence of 3 numerical tokens. It’s as if it were reading a book where every word is replaced by a code—and then someone asked it to count the letters in a code it has never seen written.

The problem is exacerbated with compound words. “Timekeeper” is broken into separate tokens, making it impossible for the model to determine the exact position of the letters without an explicit reasoning process. This fragmentation affects not only the letter count but also the understanding of the internal structure of words.

The o1 Solution: Reason Before Responding

OpenAI’s o1 solved the problem in an unexpected way: instead of modifying the tokenization—which is technically difficult and would compromise the model’s efficiency—it taught the system to “think before speaking” using a technique called “chain of thought reasoning.”

When you ask o1 how many 'r's there are in “strawberry,” the model doesn’t respond immediately. It spends several seconds—sometimes even minutes for complex questions—internally processing a “chain of reasoning” hidden from the user. This process allows it to:

Recognize that the question requires character-level analysis
Develop a strategy to break down the word
Verify the answer using different approaches
Correct any errors before providing the final answer

As OpenAI researcher Noam Brown explained in a series of posts on X: “o1 is trained using reinforcement learning to ‘think’ before responding via a private chain of thought.” The model receives rewards during training for each correct step in the reasoning process, not just for the final correct answer.

The results are impressive but come at a cost. In a qualifying exam for the International Mathematical Olympiad, o1 correctly solved 83% of the problems, compared to 13% for GPT-4o. On PhD-level science questions, it achieved 78% accuracy compared to GPT-4o’s 56%. But this power comes at a price: o1 takes 30+ seconds to answer questions that GPT-4o solves in 3 seconds, and costs $15 per million input tokens compared to GPT-4o’s $5.

Chain of Thought: How It Really Works

The technique is not magic but methodical. When it receives a prompt, o1 internally generates a long sequence of “thoughts” that are not shown to the user. For the problem of the ‘r's in “strawberry,” the internal process might be:

"First, I need to understand the word's structure. Strawberry could be tokenized as [str][aw][berry]. To count the 'r's, I need to reconstruct the complete word at the character level. Str contains: s-t-r (1 'r’). Aw contains: a-w (0 ‘r’). Berry contains: b-e-r-r-y (2 ‘r’). Total: 1+0+2 = 3 ‘r’. Verify: strawberry = s-t-r-a-w-b-e-r-r-y. Count the ‘r's: position 3, position 8, position 9. Confirmed: 3 'r’."

This internal reasoning is hidden by design. OpenAI explicitly prohibits users from attempting to reveal o1’s thought process, monitoring prompts and potentially revoking access for those who violate this rule. The company cites AI security and competitive advantage as reasons, but the decision has been criticized as a loss of transparency by developers working with language models.

Persistent Limitations: o1 Is Not Perfect

Despite the progress, o1 has not completely solved the problem. Research published on Language Log in January 2025 tested various models on a more complex challenge: “Write a paragraph where the second letter of each sentence spells out the word ‘CODE’.”

o1 Standard ($20 per month) failed, mistakenly counting the first letter of each initial word as the “second letter.” o1-Pro ($200 per month) solved the problem... after 4 minutes and 10 seconds of “thinking.” DeepSeek R1, the Chinese model that shook up the market in January 2025, made the same mistake as o1 standard.

The fundamental problem remains: models continue to view text through tokens, not letters. o1 has learned to “work around” this limitation through reasoning, but it hasn’t eliminated it. As one researcher noted on Language Log: “Tokenization is part of the very essence of what language models are; for any wrong answer, the explanation is precisely ‘well, tokenization’.”

Academic Research: The Emergence of Character-Level Understanding

A significant paper published on arXiv in May 2025 (“The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models”) analyzes this phenomenon from a theoretical perspective. The researchers created 19 synthetic tasks that isolate character-level reasoning in controlled contexts, demonstrating that these capabilities emerge suddenly and only in the advanced stages of training.

The study proposes that learning character composition is not fundamentally different from learning common-sense knowledge—it emerges through processes of “conceptual percolation” when the model reaches a critical mass of examples and connections.

The researchers suggest a minor architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword-based models. However, these modifications remain experimental and have not been implemented in commercial models.

Practical Implications: When to Trust and When Not to

The “strawberry” case teaches an important lesson about the reliability of language models: they are probabilistic tools, not deterministic calculators. As Mark Liberman noted on Language Log: “You should be cautious about trusting the answer of any current AI system in tasks involving counting things.”

This does not mean that the models are useless. As one commenter noted: “The fact that a cat makes the silly mistake of being scared by a cucumber does not mean we shouldn’t trust the cat for the much more difficult task of keeping rodents out of the building.” Language models aren’t the right tool if you want to systematically count letters, but they’re excellent for automatically processing thousands of podcast transcripts and extracting guest and host names.

For tasks requiring absolute precision—landing a spacecraft on Mars, calculating drug dosages, verifying legal compliance—current language models remain inadequate without human supervision or external verification. Their probabilistic nature makes them powerful for pattern matching and creative generation, but unreliable for tasks where error is unacceptable.

The Future: Toward Models That Reason for Hours

OpenAI has stated that it intends to experiment with o1 models that “reason for hours, days, or even weeks” to further increase their reasoning capabilities. In December 2024, o3 was announced (the name o2 was skipped to avoid trademark conflicts with the mobile operator O2), and in March 2025, the o1-pro API was released—OpenAI’s most expensive AI model to date, priced at $150 per million input tokens and $600 per million output tokens.

The direction is clear: instead of making models ever larger (scaling), OpenAI is investing in making them “think” longer (test-time compute). This approach could be more energy- and computationally-efficient than training increasingly massive models.

But one question remains: are these models truly “reasoning” or simply simulating reasoning through more sophisticated statistical patterns? Apple research published in October 2024 reported that models like o1 could replicate reasoning steps from their own training data. When numbers and names were changed in math problems, or simply by re-running the same problem, the models performed significantly worse. When extraneous but logically irrelevant information was added, performance plummeted by 65% for some models.

Conclusion: Powerful Tools with Fundamental Limits

The “strawberry” problem and the o1 solution reveal both the potential and the inherent limitations of current language models. OpenAI has demonstrated that through targeted training and additional processing time, models can overcome some structural limitations of tokenization. But they haven’t eliminated it—they’ve circumvented it.

For users and developers, the practical lesson is clear: understanding how these systems work—what they do well and where they fall short—is essential to using them effectively. Language models are extraordinary tools for probabilistic tasks, pattern matching, creative generation, and information synthesis. But for tasks requiring deterministic precision—counting, calculating, verifying specific facts—they remain unreliable without external supervision or complementary tools.

The name “Strawberry” will remain as an ironic reminder of this fundamental limitation: even the world’s most advanced AI systems can stumble on questions that a six-year-old would solve instantly. Not because they are stupid, but because they “think” in ways that are profoundly different from us—and perhaps we should stop expecting them to think like humans.

Sources:

OpenAI - “Learning to Reason with LLMs” (official blog post, September 2024)
Wikipedia - “OpenAI o1” (entry updated January 2025)
Cosma, Adrian et al. - “The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models”, arXiv:2505.14172 (May 2025)
Liberman, Mark - “AI systems still can't count”, Language Log (January 2025)
Yang, Yu - “Why Large Language Models Struggle When Counting Letters in a Word?”, Medium (February 2025)
Orland, Kyle - “How does DeepSeek R1 really fare against OpenAI's best reasoning models?”, Ars Technica
Brown, Noam (OpenAI) - Series of posts on X/Twitter (September 2024)
TechCrunch - “OpenAI unveils o1, a model that can fact-check itself” (September 2024)
16x Prompt - “Why ChatGPT Can't Count How Many Rs in Strawberry” (updated June 2025)

The Strawberry Problem

The Counting Problem: Why Models Get It Wrong

The o1 Solution: Reason Before Responding

Chain of Thought: How It Really Works

Persistent Limitations: o1 Is Not Perfect

Academic Research: The Emergence of Character-Level Understanding

Practical Implications: When to Trust and When Not to

The Future: Toward Models That Reason for Hours

Conclusion: Powerful Tools with Fundamental Limits

Read next

Deepfake: la nuova emergenza che sta riscrivendo le regole del business

Deepfakes: The New Emergency Rewriting the Rules of Business

L'Algoritmo non è cambiato

Comments ()

The Counting Problem: Why Models Get It Wrong

The o1 Solution: Reason Before Responding

Chain of Thought: How It Really Works

Persistent Limitations: o1 Is Not Perfect

Academic Research: The Emergence of Character-Level Understanding

Practical Implications: When to Trust and When Not to

The Future: Toward Models That Reason for Hours

Conclusion: Powerful Tools with Fundamental Limits

Read next

Comments ( )

Comments ()