AI Geekly
Posts
AI Geekly: Say Goodbye to Tokenizers

AI Geekly: Say Goodbye to Tokenizers

New Transformer model architecture and feature releases

Brodie Woods
December 16, 2024

Welcome back to the AI Geekly, by Brodie Woods, brought to you by usurper.ai. This week we bring you yet another week of fast-paced AI developments packaged neatly in a 5 minute(ish) read.

TL;DR Meta’s BLT brings home the bacon; Gemini’s double

This week we start out a little more technical, with a look at a new paper from researchers at Meta that offers a compelling approach to obviate tokenizers by replacing them with lightweight transformers that feed “patches” of raw bytes into a large latent transformer model. In plain English, it proposes a way of reducing the inefficiency caused by having to represent/translate text information for a Large Language Model (LLM) relying upon predefined tokens and vocabulary and instead serves text to the model in native raw byte form, improving both training efficiency and downstream performance —so while not quite a “free lunch” (no trade-offs), it is an overall net positive. Next, we look at the latest from Google’s Gemini 2.0 announcements this week and stack them up against the latest from OpenAI’s Shipmas to see how they compare and how some of the newest features might be applied in practical real world use cases. Read on below!

Meta’s BLT looks delicious
Byte Latent Transformer (BLT) paper promises improved performance

What it is: Meta researchers have developed a new byte-level large language model (LLM) architecture called Byte Latent Transformer (BLT) that operates without the need for a fixed vocabulary, unlike traditional tokenization-based models. BLT employs a novel method of dynamically grouping bytes into “patches” based on the entropy (predictability) of the next byte, allowing for dramatically more efficient allocation of computing resources. The model uses a combination of small byte-level models and a large global latent transformer to process these patches.

What it means: BLT represents a departure from conventional LLM design, which typically relies on breaking down and converting text into predefined tokens using a “tokenizer”. By working directly with raw bytes and dynamically forming patches, BLT can adapt its computational focus to the complexity of the data. This method has demonstrated performance on par with leading token-based models in training efficiency, while using significantly fewer computational resources during inference (up to 50% fewer flops). It also shows improved robustness to input noise (typos and unclarity) and a better understanding of sub-word elements.

Why it matters: This new architecture offers a path toward more efficient and adaptable AI models. By removing the need for fixed vocabularies, BLT can potentially handle a wider range of inputs and tasks without the limitations imposed by tokenization (to be fair, there is still “some tokenization going on here, but at a smaller scale, and using transformers themselves as mini-tokenizers). This approach could lead to significant reductions in the computational cost of both training and using LLMs, making advanced AI capabilities more accessible (although currently measured improvements are limited to inference). Novel developments such as these are what give us confidence that alternative methods will prevail whenever limitations are reached with current methods (like the oft-media-quote “wall” that has been reached with AI scaling laws).

David and Goliath
OpenAI and Google go toe-to-toe: Shipmas vs. Gemini 2.0

OpenAI Sora vs. Google Veo | Winner: Sora

With this week’s launch of OpenAI’s long-awaited Sora video generation model (covered in an AG Alert last week), Google’s Veo is left in the dust. While Sora allows anyone with the $20 for a ChatGPT Plus subscription account to create 720- videos of up to 20s length (1080p for Pro users). Relative to Sora, Veo comes-up short on quality, speed, and length.

(above: a 20-second commercial we generated with Sora in 6 minutes [no sound as Sora is video-only. For an example with sound see this clip we made Monday])

OpenAI o1 Pro vs. Gemini 2.0 | Winner: Draw

Readers will recall that last week OpenAI took the wraps off its state-of-the-art o1-pro model (exclusive to $200/mth Pro subscribers) which uses additional Test-Time Compute (i.e. additional computation power at the time of query/prompt) to “reason” using chain-of-thought logic to provide better responses. This week Google announced its Gemini 2.0 model with 2.0 Flash (its “workhorse” model) which outperforms prior models and will form the basis of the company’s Agentic tools. Really, the two models aren’t comparable. OpenAI’s model is geared towards the top-of-the-line high value use cases of researchers and engineers looking to eke out that extra 5-10% of accuracy from a frontier model, whereas Gemini 2.0 Flash is meant to be a bread-and-butter day-to-day model serving as the backbone for GenAI applications that cover the gamut.

OpenAI Advanced Voice Model with Live Video vs. Google AI Studio Screen Share | Winner: OpenAI

Both Google and OpenAI decided this week that LLMs should start getting a better look at what users are looking at: ChatGPT added video and screen-sharing capabilities for Plus and Pro users to its Advanced Voice Model, which now allows users to share content with the chosen model in real-time. Google unlocked the capability to share the users screen with its Gemini models via the Google AI Studio playground. Honestly, we’re happy with both offerings. One of the best use cases for this solution is for users to turn on screen sharing when using an unfamiliar piece of software (say Photoshop for the average person, or Salesforce in a corporate setting) —the AI assistant can act as a tutor looking over the user’s shoulder and guide them through using the tool live! The cost? effectively nil. The implication for employee training using any software from Excel to Blender and beyond are worth considering. Upskilling just became table stakes. We’ll give the win here to OpenAI for its integration of the video aspect here. Sure, Google’s AI has had something similar for a couple of months, but… nobody uses it and it’s restricted to Android devices. This is cross-platform AND performant.

Coming Soon: OpenAI Agents vs. Google Gemini Agents | Winner: TBD

They’re both talking a big game about agents (Google’s Project Mariner will be limited to browser interactions), but so far only Anthropic’s Computer Use Agent has delivered (and it needs work). We don’t give out points for promises, only delivered solutions. Until agents are actually released (which may be part of OpenAI’s Shipmas) we’re in a holding pattern on this one.

Before you go… We have one quick question for you:

If this week's AI Geekly were a stock, would you:

About the Author: Brodie Woods

As CEO of usurper.ai and with over 18 years of capital markets experience as a publishing equities analyst, an investment banker, a CTO, and an AI Strategist leading North American banks and boutiques, I bring a unique perspective to the AI Geekly. This viewpoint is informed by participation in two decades of capital market cycles from the front lines; publication of in-depth research for institutional audiences based on proprietary financial models; execution of hundreds of M&A and financing transactions; leadership roles in planning, implementing, and maintaining of the tech stack for a broker dealer; and, most recently, heading the AI strategy for the Capital Markets division of the eighth-largest commercial bank in North America.