Why Buy The Cow?

When you can train for free...

Welcome back to the AI Geekly, by Brodie Woods, bringing you yet another week of fast-paced AI developments packaged neatly in a 5 minute(ish) read.

This week we’re getting verrryyy close to releasing our coveted AI 101 Primer report (coveted by us anyways). Therefore, we keep it short and sweet once more, homing-in on a specific topic we think readers will find interesting: content licensing. Our friends at Open AI (maybe AI companies don’t have friends per se, so “associated beings of interest” may be more befitting) were in the news this week, inking a deal with the Financial Times while their to-the-death rival, Google, me-too’ed an agreement with The Wall Street Journal. The latter probably being the more useful dataset if we’re measuring training data by the innate intelligence of the subject material (apologies to any die-hard FT fans out there). It’s a change of pace for a sector that has been known to steal the training data they needed and then beg for forgiveness later. Why pay now when you’ve trained on the data for free? Our theory is that, by setting market prices for this data now, these companies are preparing for when the current and future lawsuits about intellectual property infringement vis-a-vis AI model training have their day in court. The price discovery that is ongoing now will set the prices of settlement agreements, damages and more. All a bit of CYA (you should know this one…) insurance for the future. So if, for example, the NY Times settles its lawsuit against OpenAI, the settlement will be for a sum closer to one of the prices paid for its data as opposed to some arbitrarily high number.

Cuckoo for Cocoa ntent Licensing
Can’t get enough of that Data crisp…

What it is: Major tech companies continued their AI licensing bonanza this week, signing agreements with major content providers to use their content in training AI models. Google and Open AI signed separate content licensing agreements with the Wall Street Journal, and the Financial Times, respectively. Google’s WSJ deal was for $5-6 mm/annum (not bad) while OpenAI played its cards close to the vest refusing to disclose the sum agreed. Note that we’ve seen a wide range here with other OpenAI licensing agreements from as low as $1 mm to as high as $5 mm.

What it means: We’re certainly in the middle of the price discovery phase of AI training content licensing. We’ve seen quite a range of prices floated for various training data assets. This includes the $60 mm/annum deal between Google and Reddit and a host of secretive deals signed by OpenAI, Google, Meta, Microsoft, Apple and Amazon with Shutterstock ranging from $25-$50 mm each (and often later upsized). Overall, the market for AI data has been pegged at $2.5 Bn moving to $30 Bn within the decade.

Tell me more: Part of what we’re seeing here is the big tech companies engaged in a sort of data prospecting/homesteading land claim race to take hold of the most valuable datasets before their competitors. The huge players like MSFT, GOOG, AMZN and META can bring their enormous balance sheets to bear here. If You’re Google/Alphabet, you can afford to pay $60 mm /yr for ONE dataset: you’ve got $120 Bn sitting in cash alone. Same deal if you’re any of the other big guys. But what about OpenAI? Anthropic? Mistral, Cohere, etc.? The relatively smaller players? It’s one thing to raise a few hundred million dollars and value your company at a couple of billion like these guys. It’s quite another to have a couple billion lying around in the bank, like the Big Tech players.

What’s happening: Big Tech is trying to “buy the pot” (poker not drugs), using their cost of capital advantage (as an ex-analyst it’s nice to use those words again) to buy-up exclusivity on as much data as possible. Failing exclusivity, if they can set the market price high enough, they can effectively lock-out their competitors; if the going rate for Reddit is set at $60 mm/y, how will players who have only raised a few hundred million be able to afford all that juicy data AND compute (processing power) to train on it, and infer from it? Spoiler: they can’t, and that’s precisely Big Tech’s game to squeeze out the little guys.

Why it matters: Mania drives down quality. We’ve seen this movie before. In our experience in the data brokerage industry, prices were sky-high in the early days, as promises for secret datasets unlocking untold alpha (stock market upside) were bandied about. Reality set-in not too shortly after. The datasets were messy. The signal was noisy. The expectations of buyers and sellers became further disconnected and some of the truly valuable datasets were taken off the market as buyers became disillusioned (seeing garbage dataset after garbage dataset priced at >$1 mm) shrinking the market. What remains today in the capital markets data brokerage universe is perhaps healthier than the one that came before: prices have come down from the stratosphere, transactions are more frequent (volume game) and buyers and sellers have grown-up —arguably, it’s democratized a little. That said, some truly amazing datasets have been forever lost…

Before you go… We have one quick question for you:

If this week's AI Geekly were a stock, would you:

Login or Subscribe to participate in polls.

About the Author: Brodie Woods

With over 18 years of capital markets experience as a publishing equities analyst, an investment banker, a CTO, and an AI Strategist leading North American banks and boutiques, I bring a unique perspective to the AI Geekly. This viewpoint is informed by participation in two decades of capital market cycles from the front lines; publication of in-depth research for institutional audiences based on proprietary financial models; execution of hundreds of M&A and financing transactions; leadership roles in planning, implementing, and maintaining of the tech stack for a broker dealer; and, most recently, heading the AI strategy for the Capital Markets division of the eighth-largest commercial bank in North America.