The Bitter Lesson: Tokenization’s Future

by Chief Editor


24 Jun, 2025

a world of LLMs without tokenization is desirable and increasingly possible

Published on 24/06/2025 • ⏱️ 29 min read


Let’s dive into the fascinating, and potentially revolutionary, world of Large Language Models (LLMs) and the quest to liberate them from the shackles of tokenization. This isn’t just about technical details; it’s about the future of how we interact with, and how these powerful models understand, language. As a journalist keenly following developments in Artificial Intelligence (AI), I’ve seen firsthand the limitations of tokenization and the exciting possibilities of a “token-free” future.

The Tokenization Tangle: Why It Matters

For those unfamiliar, tokenization is the process of breaking down text into smaller units—tokens—that LLMs understand. Think of it like converting words into a language a computer can process. While seemingly straightforward, tokenization introduces inefficiencies and limitations. The ideal tokenization strategy aims for an optimal balance between representing byte compression and granular representations of language. Modern approaches, like Byte-Pair Encoding (BPE), attempt to achieve this, but even the most advanced systems have their shortcomings.

One significant issue is that tokenizers are “lossy.” They compress information, potentially discarding nuances and context that a model might otherwise leverage. This becomes particularly noticeable with specialized vocabulary, slang, or even emojis and numbers. The models need a way to be more efficient at extracting the nuances that matter without using tokenization.

Did you know? Tokenization methods like BPE are learned procedures that extract a compressed vocabulary from a dataset. These methods are not a strict requirement of transformers, but are designed to reduce the processing burden.

The Bitter Lesson and the Quest for Generality

The AI field is driven by what’s known as “The Bitter Lesson.” This principle favors general-purpose methods that leverage vast amounts of data and computational power over carefully crafted, domain-specific approaches. LLMs have been a testament to this, as model ability has improved alongside hardware upgrades, more talent, architectural advances, and the initial data ubiquity.

This “scaling law” effect has been a key trend in the LLM landscape, and the most successful LLMs embrace it. This means that the quest to optimize tokenization, which might seem like a “crafted” approach, can be a difficult solution to sell.

The Search for Alternatives: Beyond Tokens

The obvious question is: Can we bypass tokenization entirely? This is where research gets exciting. One of the most promising avenues is the development of models that can operate directly on bytes, the fundamental units of digital information. As the article mentions, models such as ByT5 and MambaByte, which work at the character or byte level, are stepping stones.

The challenges, though, are significant. Byte-level models face computational hurdles, as they require processing a larger vocabulary (256 for bytes versus 32,000 or more for tokens). This can increase training time and, more importantly, inference costs. But the potential rewards, in terms of a more complete understanding of language, are significant.

Pro tip: Research into byte-level models is ongoing, and the results are promising. However, expect that these newer models will likely be much more expensive for some time to come.

Byte Latent Transformers: A New Hope?

The Byte Latent Transformer (BLT) represents a significant leap forward. This architecture, as detailed in the referenced article, introduces a multi-scale approach. It downsamples from bytes to patches (dynamic sections of bytes), which the global transformer then processes. A local decoder then takes the byte-level information and patch-level context to predict the next byte.

BLT has advantages in that it is designed to increase the amount of information that is being processed at a time and is still able to keep computational and memory costs down. It has a similar strategy as in speculative decoding to solve this, but in this case, the model downsamples from bytes, which allows it to model more bytes at a time and use a more memory-efficient global transformer.

Key features that contribute to BLT’s efficiency include:

  • Dynamic Patching: Using the entropy of a smaller, byte-level LLM to determine where to create patches of bytes.
  • Multi-Scale Approach: Utilizing the byte-level information with the global transformer to be more efficient.
  • Compute-Controlled Scaling: Evaluating performance against subword-level models in a compute-controlled setting, rather than a compute-variable one, to provide a fairer comparison.

The Implications of BLT

One of the more interesting aspects of BLT is that it can be run in two different manners, which creates an interesting “anti-fragile” property. It can dynamically dedicate more or less compute power to the most unusual parts of text by increasing or decreasing the number of bytes in each patch. This means that the more surprising sub-sequences can get more compute power, and thus the BLT can gain more power from the uncertainty of OOD or near-OOD events. This opens up new possibilities for handling out-of-distribution data, complex reasoning tasks, and low-resource languages where tokenization struggles.

This model architecture could have far-reaching implications for the LLM landscape. It could pave the way for models that are better at understanding subtle nuances of language, handling complex tasks, and adapting to new information. BLT’s architecture opens the door for the creation of a model whose performance increases with more compute.

The evolution of LLMs will be shaped by several key trends:

  • End-to-End Learning: As BLT architecture becomes more prevalent, we can expect to see a shift towards models that learn the entire language processing pipeline, including tokenization, instead of relying on external, separately trained components. This is a key departure from how the current state of the art LLMs operate.
  • Multimodality: The integration of text, images, audio, and video into a single LLM. Models need to handle information across different modalities, which can be a challenge for the current tokenization methods, and open the door for end-to-end models.
  • Compute-Adaptive Architectures: We may see architectures that allocate computational resources dynamically based on the complexity of the input. This includes dynamic patches as well as other methods that can automatically compress or expand the data based on its surprising content.

Expect these trends to bring innovation and disrupt the established order of how we use LLMs. If the industry continues to follow the path of the Bitter Lesson, then external tokenization may fall to the wayside, paving the way for more powerful LLMs.

FAQ Section

Q: What is tokenization?

A: Tokenization is the process of breaking down text into smaller units (tokens) that a computer can understand.

Q: Why is tokenization a problem?

A: Current tokenization methods can lead to information loss and inefficiency, as well as a lack of granularity when handling language.

Q: What are byte-level models?

A: Byte-level models process text at the byte level instead of using tokens, which is more efficient. This is what architectures like BLT leverage to be more performant.

Q: What is a Byte Latent Transformer (BLT)?

A: The BLT is an architecture that can process data by downsampling the text and then performing a global transformer pass on the resulting “patches” of text.

Q: What are the benefits of BLT?

A: BLT can better understand the nuances of language, handle more complex tasks, and adapt to new information.

Q: Is the BLT the “future” of LLMs?

A: It is too early to tell, but BLT seems to be a major advancement toward a future with the characteristics that enable higher performance. The key here is compute-controlled tests and the results of these studies.

Q: What are the main challenges with byte-level models?

A: Byte-level models have the challenge of handling a much larger vocabulary of information, thus increasing training time and computational costs.

Q: How does BLT compare to other models?

A: BLT has shown that the architecture, under controlled conditions, can scale past current models. This could be a major advancement for the models of the future.

Q: What are the implications of BLT for the future?

A: If BLT continues to yield positive results, then it may be a new standard for LLM’s that is also much more efficient for inference.

Q: How does this affect tokenization?

A: The architecture of BLT allows it to be self-contained, which removes tokenization from the equation.

Q: What should I keep an eye on?

A: Watch for multi-modal applications, compute-adaptive models, and the continued innovation around more efficient language processing, and the impact on existing methods.

If you enjoyed this article, please share your thoughts in the comments below. What are your predictions for the future of LLMs? Let’s discuss!

Explore More:

Subscribe to our newsletter for more insights and updates on the AI world!

You may also like

Leave a Comment