Imagine you've spent a decade making the world's fastest race-car engines, and one day you wake up to find the customer doesn't want engines anymore. They want entire factories. That, in a sentence, is what just happened to the AI hardware business.

"Moore's Law can't keep up with 10x and exponential use of AI. We need to embrace extreme co-design."

That was Jensen Huang, NVIDIA's CEO, at CES 2026. "Co-design" means designing the chip, the rack, the cooling, the network, and the software as one integrated product. The official read was that GPUs keep getting better. The actual read is that frontier AI inference is no longer a chip problem. It is a whole-system problem, designed in racks of equipment, not in individual silicon dies.

The next generation of frontier models is 10 trillion parameters. Not as a research curiosity, but as production inference that serves real users at real prices.

To understand why this is a threshold, picture how AI gets served today. Today's frontier dense models, the Claude Opus 4.6 and GPT-5.4 class, fit inside a single NVIDIA cabinet roughly the size of a tall refrigerator: 72 GPUs wired together so they behave as one giant accelerator, around 21 terabytes of fast memory, 120 kilowatts of power. The technical name is GB300 NVL72. Most of the inference economy you interact with today, every API call, every chatbot response, runs on a system that fits inside one of these cabinets.

10 trillion parameters is the first point on the roadmap where that single-cabinet assumption breaks. The model is too big for one box. It has to spread across multiple cabinets, with optical fiber doing the wiring that copper used to do, and a software stack that did not exist eighteen months ago.

This matters beyond the silicon. The cost of serving a 10T model sets the price floor for the next wave of agentic AI products, the capex envelope for hyperscalers (Amazon, Google, Microsoft, Meta), and the regional power footprint of every country trying to host it. The number on the next preview model's price sheet will tell you, in a single line, how the rest of the AI economy gets priced.

This piece covers two halves: the hardware reality, then the software stack that keeps the hardware fed.

Why 10 Trillion Is a Threshold

The math is simple and brutal. AI models are stored as long lists of numbers, and those numbers occupy memory. The size of each number depends on the precision format you use, the AI equivalent of choosing a JPEG quality setting: more bits per number means higher fidelity but bigger files.

At 16-bit precision, the format used to train most frontier models, a 10 trillion parameter model is 20 terabytes of weights before you store a single token of context. That alone exceeds the memory of an entire 72-GPU cabinet. At 8-bit, you get to 10 terabytes. At a new 4-bit format called NVFP4 that the industry is converging toward, you get to 5 terabytes. Five terabytes just for the weights, with no room left for the working memory every conversation needs.

I called this number last December. In The Memory Wars, I wrote that "GPT-4's estimated 1.76T parameters require ~3.5TB in FP16. By 2028, we're looking at 10T+ parameter models requiring 5TB minimum." That floor assumed aggressive quantization. The unsparing version of the math is what Anthropic, OpenAI, and Google are quietly engineering against right now.

Reasoning makes it worse. When a modern reasoning model handles a complex query, it generates an internal chain of thought, sometimes tens of thousands of words long, before producing a visible answer. Think of it as the model talking to itself in scratch paper before writing the final answer. Every word of that scratch paper has to live in something called the KV cache, the model's working memory of an active conversation.

When you go from a 2,000-word query to a 50,000-word reasoning chain, that working memory grows roughly 25 times. A 10T model running million-word contexts with multi-hour agentic sessions does not have a 40-gigabyte-per-user working memory like today's models. It has hundreds of gigabytes per active session, and a serving system has to hold thousands of those simultaneously.

This is why the price sheet for the next preview model is going to be a multiple of today's pricing.

The Hardware Bill at 10T

Start with the floor. A 10T model in 4-bit precision is 5 terabytes of weights. NVIDIA's next-generation GPU, called Rubin Ultra, ships with 1 terabyte of high-bandwidth memory per chip. That is five GPUs of weight storage before you serve a single user. Add working memory for production traffic at million-word contexts, and you are at 8 to 12 GPUs of pure memory residence. Add the compute for processing prompts (called prefill) and generating responses (called decode), expert overflow if the model uses a mixture-of-experts architecture, a small "draft" model for speculative decoding, and a margin for traffic hot spots. The serving footprint per replica lands between 32 and 64 Rubin Ultra GPUs.

That is one replica. Production fleets run dozens in parallel.

The natural deployment unit is no longer the 72-GPU cabinet. It is something NVIDIA calls NVL576: 144 GPU packages wired into a single coherent system with around 147 terabytes of pooled fast memory. Picture a row of eight cabinets stitched together with optical fiber so they behave like one giant computer. That is the first system on the public roadmap with enough addressable memory to hold a 10T-class model plus its working set. Below NVL576, you have to split the model across separate cabinets, which means slow network traffic on the critical path of every word the model generates. A latency disaster for interactive serving.

NVL576 is also where optical interconnects become structural, not optional. At this scale, copper cables run out of bandwidth and overheat over the distances involved. The cabinet topology requires specialized lasers and connectivity silicon that the optical names have spent two years building backlog against. The fabric stops being networking. It starts being part of the compute substrate.

Power is the gate. A 72-GPU cabinet today draws around 120 kilowatts. NVL576 will draw multiples of that. A serving deployment with eight to twelve of these systems is a contiguous 5 to 10 megawatt block dedicated to one model. Five to ten megawatts is enough to power a small American town of a few thousand homes. Picture that town's worth of electricity flowing into one model, continuously, just to answer queries. That is why fuel-cell company Bloom Energy keeps appearing in my coverage and why gas-turbine lead times keep showing up in hyperscaler earnings calls. Compute is not the binding constraint at 10T. Watts are.

The Software Stack That Makes the Hardware Deliver

Hardware capacity is necessary. It is not sufficient. The reason a 10T model is feasible to serve in 2027 rather than 2030 is that the inference software stack matured by an order of magnitude in the last twelve months.

The clearest evidence is a benchmark called MLPerf Inference v6.0, released earlier this year. MLPerf is the AI industry's equivalent of the EPA fuel-economy test: standardized workloads, audited submissions, peer-reviewed results. Same GB300 NVL72 cabinet as the previous benchmark. Same power envelope. Same silicon. But 2.7 times more tokens per second, achieved purely through software. Imagine your existing car suddenly getting 2.7 times the gas mileage from a software update. That 2.7x compounds with every release. Run it through two more cycles and you are pulling roughly 7x out of the same iron. That is what closes the gap between "10T fits in our memory budget" and "10T actually serves users at acceptable latency."

Four levers do the work.

Disaggregated prefill and decode. Reading a prompt and writing a response are two different jobs with different bottlenecks. Prefill is compute-heavy. Decode is memory-heavy. Putting them on the same GPUs at the same time is the original sin of inference serving, like asking the same chef to chop vegetables and plate dishes during a dinner rush. NVIDIA's Dynamo software physically separates them onto different machines. Utilization on both rises.

Wide Expert Parallel for mixture-of-experts routing. The realistic 10T frontier model is probably not a single dense network where every word activates every parameter. It is a mixture-of-experts architecture: 10 trillion total parameters, but only 200 to 400 billion active for any given word. Think of a hospital with hundreds of specialists. A patient walks in, triage routes them to the relevant doctors, and only those doctors are consulted. Wide Expert Parallel shards the experts across hundreds of GPUs and routes each word to the GPUs holding the relevant experts.

NVFP4 quantization end-to-end. Four-bit weights, four-bit activations, with mathematical recovery loops to keep accuracy on par with 8-bit. NVFP4 is what shrinks 20 terabytes of 16-bit weights down to 5 terabytes of memory residence. The compression target is now part of the architecture specification, not a post-training afterthought.

KV-aware routing and speculative decoding. KV-aware routing sends incoming requests to GPUs that already hold relevant cached context, like routing a returning customer to the waiter who remembers their order. Microsoft Azure measured 20x faster time-to-first-response on this lever alone. Speculative decoding uses a small "draft" model to guess the next several words, which a large model verifies in parallel, like an autocomplete that gets corrected only when wrong. At 10T scale, the "small" draft model is itself a 70-billion-parameter model.

Stack the four. That is how you turn a 5 to 10 megawatt hardware deployment into a serving system that can actually answer queries at latencies users tolerate.

The Bear Case

Three things could weaken this read.

First, the frontier may not actually scale to 10 trillion dense parameters. If the labs converge on aggressive mixture-of-experts (10T total with 100 to 200 billion active), per-token serving cost stays close to current pricing and the "threshold" is a marketing event rather than an infrastructure one. The optical and memory suppliers still benefit, but the slope is shallower. Watch the active parameter count, not the total.

Second, software gains may compress the hardware bill faster than I am modeling. If the 2.7x cadence holds for two more cycles, a 10T model might serve on a single 72-GPU cabinet by late 2027 rather than requiring NVL576. Bullish for NVIDIA's installed base, neutral-to-negative for the optical scale-up names that need NVL576 volume.

Third, the framing assumes serving capacity is the binding constraint. If demand for 10T-class capability is narrower than the hyperscaler buildout assumes, and agentic AI productivity gains disappoint, tokens get cheaper but the serving fleet sits idle. This is the bear case the "AI bubble" framers reach for. Not absurd, but inconsistent with the $30 billion Anthropic run-rate and the multi-gigawatt TPU contracts already on paper.

None of these change the direction. They change the slope.

So What

The 10T threshold reframes the AI infrastructure map in three ways.

It re-anchors the memory thesis. High-bandwidth memory at 1 terabyte per GPU is the gate that decides whether the next model class ships in 2027 or 2029. The HBM oligopoly (Micron, SK Hynix, Samsung) is a direct beneficiary of every 10T deployment booked, with unit economics that scale linearly with parameter count.

It re-anchors the optical thesis. NVL576 is where copper dies and optical takes over the cabinet. The co-packaged optics supply chain captures the difference between today's single-cabinet regime and the 10T-threshold regime. That is why optical specialist Lumentum's "sold out through 2028" comment on a recent earnings call should be read as a structural signal, not a quarter-end disclosure.

And it re-anchors the power thesis. A 10T deployment is a 5 to 10 megawatt continuous block. The hyperscaler capex guides that the sell-side keeps calling a bubble are not buying chips. They are buying interconnect, packaging, and megawatts. Bloom Energy's backlog, Oracle's multi-billion-dollar power agreements, the multi-gigawatt TPU contracts: 10T-threshold infrastructure being procured before the model that requires it has a public name.

The first 10T preview will land with a price sheet. When it does, look at the dollars-per-million-tokens number and ask what it would take, physically, to produce those tokens at that price. The answer is the bill of materials I just walked you through. The price sheet is the bill of materials confessing itself in dollars. That's the trade.

About the Author

Ben Pouladian is CEO of BEP Holdings and publisher of BEP Research, an institutional-grade publication covering AI infrastructure, semiconductors, and the supply chain underneath the AI buildout. EE, UC San Diego. NVIDIA investor since 2016. The full memory-and-interconnect series this post draws on lives on the BEP Research Substack.

Disclosure: The author holds positions in NVDA, LITE, CRDO, ALAB, LSCC, TSEM, BE, and ORCL (2027 LEAPS). This is not investment advice. Do your own work.

Reply

Avatar

or to participate