AI Geekly: TTC = AGI 2025?

Turns out thinking longer is thinking smarter

Welcome back to the AI Geekly, by Brodie Woods, brought to you by usurper.ai. This week we bring you yet another week of fast-paced AI developments packaged neatly in a 5 minute(ish) read. Note: the next AI Geekly will be published on Jan 1, 2025! See you in the future.

TL;DR o3 AGI?; Google GOAT; Apollo Moonshot

Well well well. The year’s just about wrapped up, so AI news should be quieting down, right? WRONG! As loyal Geekly readers will remember from last year, year-end happens to be one of the most active times of the AI calendar:

OpenAI (OAI) wrapped up their “12 days of Shipmas” with a few little stocking stuffers (1-800-ChatGPT, Search, agency and more!) and one major gift for the whole family on Day 12: the announcement of the new o3 reasoning model —the most advanced AI model announced to date, and a critical step on the path towards Artificial General Intelligence (AGI) so much so that achieving some level of AGI in 2025 may now be a possibility —at least according to benchmarks. Another consideration is cost…

Google has had an impressive few weeks as well. Not content to let OAI hog the stage, it’s been dropping a series of announcements of its own including the Gemini 2.0 family of models, Astra (AI visual model for handsets and headsets), Mariner (AI to use your browser) Imagen 3 (image generation), and Deep Research (AI that performs deep topic research) as well as a Thinking version of Gemini 2.0 Flash that leverages Test Time Compute (TTC) in a similar way to OAI’s o-family of reasoning models (and Alibaba’s QwQ model based on same). Perhaps most impressive: the company dropped its Veo2 video generation model, and it absolutely knocks the socks off of OAI’s Sora (granted Veo2 is not yet widely available) in its ability to follow prompts and its understanding of physics. Google is truly righting the ship —OpenAI is going to have some very fierce competition in 2025.

Finally, we’ll take a peek at some interesting models from a collab between Meta and Stanford. Researchers announced the Apollo family of open source LLMs (look, there are lots of families ok?) including Apollo 7B —a new State of the Art (SOTA) model which can comprehend 1 hour of video and be run on local hardware (this is a first for a local model of such a small size). Notably, these models are based on Alibaba’s Qwen 2.5 instead of Meta’s own Llama 3 models.

Read on below!

5 More Minutes Until Pencils Down
Reasoning models, OpenAI’s o3 and Test Time Compute

What it is: Test Time Compute (previously covered in the Geekly) is having a bit of a moment. What is it? Quite simply, it’s the concept of giving an AI model (typically a Large Language Model [LLM]) more time to “reason through” its response to a prompt/query at test time (the time the prompt is input), thereby providing a more accurate response. This allows a model to reflect on its proposed responses, break the problem down step by step, and introduce a Chain of Thought reasoning sequence to whittle down possible responses to a subset with a higher accuracy. Reflexion and Chain of Thought are both concepts developed in the open-source AI research community over the past several years which have been applied in the TTC reasoning models (OAI’s o1, and o3, Alibaba’s QwQ models, etc.). With that context we note that OAI’s breakthrough is indeed impressive, but they stand on the shoulders of giants in the open-source community without whom these latest advances would not be possible (nor the initial transformer architecture underlying LLMs in general, but we digress).

What it means: With the announcement of the o3 model on Friday, OpenAI showed us the most impressive AI model built to date (based on benchmarks) and the closest publicly known model yet to achieving Artificial General Intelligence (AGI) —AI generally understood to be as competent as a co-worker. Based on the ARC-AGI Semi-Private v1 Benchmark (a respected AGI test that keeps some data private to prevent overfitting/cheating) depicted above, OpenAI’s best models went from ~5% to 88% in a year. The average human off the street would score ~75% on the same test (a measure of human-like intelligence in solving problems without training). The rate of improvement on that chart is phenomenal and we expect this progress to continue.

Why it matters: Based on this rate of progress, it seems possible that some level of AGI could be achieved in 2025. Although, opinions differ on exactly how to quantify AGI in the first place (indeed the ARC-AGI test is already being re-written based on o3’s performance). Personally, we would describe this as an AI with the intelligence and competency to complete the tasks of essentially any white-collar worker —with AI agency capabilities (the ability to carry out actions) expanding in 2025 this will be all the more meaningful. That said, the current o3 model costs $1,000 per task. To complete the above test, the low-compute o3 model cost $10k in compute costs to score 76%, while the high-compute model cost $1.6 mm to score 88%. So, we’re certainly not saying we’re there today, but it is fast approaching and these costs will decline.

Then what? We’ll save our predictions on the impact of AGI for a separate piece. It’s far, far too much to cover in one paragraph or news blurb. Instead, we’ll note the following. We view TTC as a reasoning-enhancing wrapper for models. So, you can take smaller, weaker, underperforming models, and wrap them with TTC to achieve better results —the trade-off: slower response time and higher compute imbue models with better accuracy. The result? Smaller, smarter models with less up-front training, less VRAM usage, usable on consumer hardware. With Alibaba’s QwQ TTC acting as a wrapper around a 32B model, we’ve seen it outperform or match o1-preview and o1-mini in a handful of benchmarks —this is a 32B parameter model going toe-to-toe with a multi-TRILLION parameter model! In terms of training cost, that’s roughly a small model matching or exceeding a larger model that cost ~1000x to train.

Open-Source AGI: With the foregoing in mind, we expect to see TTC act as one of the key bridging principles that will give the open-source community its own flavor. TTC has demonstrated that smaller models can punch above their weight and outperform models orders of magnitude larger. With the fast follow of the open-source QwQ vs. o1-preview, we expect democratization of AGI to follow on the heels of closed-AGI (provided that regulators do not impede the progress of open source).

RIP Sora: Best AI Video Generator: Dec 9 2024 - Dec 16, 2024
Lifespan of a top AI model

What it is: OpenAI gave it their best. And I guess their best wasn’t good enough. Sora landed with a splash just two weeks ago aaaaaand it’s already outdated. Google has demonstrated what it’s capable of when it gets all of the elements of the Google juggernaut moving in the same direction with purpose —and it’s a beautiful thing. Hats off to Sundar. He’s been accused of not being the right (war time) CEO for the moment, but we have to admit with the release of Veo2 and the series of announcements over the past two weeks, he’s getting serious about bringing the fight to OAI and MSFT.

What it means: 2025 is going to be an AI bloodsport. Expect more tit-for-tat announcements between the two companies, but keep in mind: OpenAI has had a bit of a lead here, but it seems to be shortening. Google will unquestionably release ARC-AGI scores that beat o3-high (probably at a cheaper cost but longer compute time using GOOG’s proprietary TPUv5 chips). The time between groundbreaking OAI announcement and Google catch-up is shrinking further and further. Realistically, there isn’t anything that OpenAI has that Google doesn’t, except for less capital…

Why it matters: Competition like this is so so good for consumers. We’ve been watching the cost of compute, the cost of inference, even the cost of intelligence, really, declining with the development of better and better hardware, optimized models, tighter-integrated infrastructure and more players of various scales getting involved in the space (from hyperscalers to researchers to start-ups). We see what happens in monopolistic spaces like GPUs where a single player dominates (Nvidia) and how that hampers development, creating a class system of GPU-rich (closed source) and GPU-poor (open source). As the likes of OAI, Google, Meta, AWS, and more vie for supremacy they consistently undercut one another on price (Meta’s models are free for crying out loud!) while offering better and better performance, ultimately benefitting consumers. The ultimate challenge with be how consumers will use these new tools and how they will offset the deleterious effects in other parts of the economy. This is where open source comes in.

Showtime at the Apollo
More open-source firsts for Meta and friends

What it is: Researchers at Meta and Stanford have published a paper introducing Apollo, a new family of open-source LMMs designed to process and understand video content with high accuracy. The authors found that the right choices in model design made on smaller models are often the right choices when applied to larger models as well, which they term Scaling Consistency. This can help reduce training time, save on costs, and use less power.

What it means: The Apollo models, which come in different sizes, are built to handle videos up to an hour long. One of the smaller models, Apollo-3B, outperforms most other models of similar size on several benchmarks. The research highlights that using frames-per-second video sampling leads to better results than uniform sampling, and certain vision encoders (pre-trained models that help the AI "see") work better than others for video-based tasks. The application of Scaling Consistency shows that researchers can experiment with smaller, less resource-intensive models to find effective design choices that will translate to larger, more powerful models.

Why it matters: These types of open-source studies are exactly why the AI space has been moving so quickly over the past several years and a major reason why we continue to rail against closed-source AI development. By identifying efficient design choices, the study helps in developing more powerful AI models without needing enormous amounts of computational power (save it for Test Time!). The Apollo models set a new performance bar, showing that even smaller models can achieve impressive results with the right design. We are firm believers that smaller and more efficient models will continue to outperform their older, larger, unoptimized siblings.

Before you go… We have one quick question for you:

If this week's AI Geekly were a stock, would you:

Login or Subscribe to participate in polls.

About the Author: Brodie Woods

As CEO of usurper.ai and with over 18 years of capital markets experience as a publishing equities analyst, an investment banker, a CTO, and an AI Strategist leading North American banks and boutiques, I bring a unique perspective to the AI Geekly. This viewpoint is informed by participation in two decades of capital market cycles from the front lines; publication of in-depth research for institutional audiences based on proprietary financial models; execution of hundreds of M&A and financing transactions; leadership roles in planning, implementing, and maintaining of the tech stack for a broker dealer; and, most recently, heading the AI strategy for the Capital Markets division of the eighth-largest commercial bank in North America.