I just read an AI data article. Here is what it got right, and what it missed entirely.
The solution to the AI data crisis is not something you can build in response to the crisis. That window closed in 2024, when the ratio of authentic to synthetic web content crossed a threshold. What we need is an anchor to the pre-AI era. That anchor is binary (it exists or does not exist)...
Loading...
Verify on BlockchainI came across this on Friday, the one about AI labs running out of training data.
I appreciated the urgency, but I kept stopping at the framing.
Is nobody talking about it?
Researchers have been publishing on this since 2022. The findings were published in ICML, Nature, and industry reports.
The problem is not a lack of conversation.
It is a lack of consequence.
So let me give you Thor’s version that actually connects the dots.
The ceiling is real, and the math has been done
Epoch AI estimates the effective stock of quality-adjusted public human text at roughly 300 trillion tokens. That sounds enormous until you compare it with current training appetites, at which point it becomes a runway with a visible end: somewhere between 2026 and 2032, potentially earlier. What I find more interesting than the headline number is the revision itself. Epoch's 2022 paper had predicted exhaustion by 2024.
We are still here because two methodological updates bought time, not because the direction changed.
The synthetic data shortcut has a published failure mode
Every time someone proposes synthetic data as the answer, I think of the 2024 Nature paper. Shumailov and colleagues documented what they call model collapse: as models train on AI-generated data across successive generations, they progressively lose the rare, low-frequency events that give human language its full range. By generation nine of their experiments, the outputs had converged on semantically hollow text. The model had forgotten what humans sound like. Consider that 74.2% of newly created web content is now synthetic. The recursion is already underway.
Synthetic data as a supplement is workable. As a structural replacement for authentic human expression, it is a gradual erosion of the very thing these models were built to approximate.
The licensing market already knows, the headlines do not
Rob Kelly has tracked 91 publicly announced AI content licensing deals as of June 2026, with estimates of 50 to 100 private deals for every public one. The telling signal is structural.
Deals involving live access and attribution, rather than one-time archive sales, grew from 2 in 2023 to a projected 34 in 2026. Content owners are not selling their past. They are monetizing their ongoing production of human thought because they understand its scarcity value better than the labs consuming it at scale.
The part nobody has said clearly
Here is the thing the headline writers keep missing about the data problem. The solution is not something you can build in response to the crisis. The window for that closed around early 2024, when the ratio of authentic to synthetic web content crossed a threshold that it will not cross back.
What the industry actually needs is an anchor to the pre-AI era, a verified, timestamped, cryptographically immutable record of human-generated content that predates the contamination.
That anchor either already exists or does not exist.
It's binary: no compromise, no "let's wait and see," no "let's hope someone fixes it for us."
I will leave it there for now.
But I would encourage you to think carefully about what it means that the most valuable dataset in the next decade of AI development is not something anyone can build today.
It had to have been built before the problem became obvious.
The question worth asking is not whether such a thing exists.
It is whether you know where to look.