The numbers behind AI inference are no longer abstract. Token usage already runs in the quadrillions per month, and conservative projections put that figure at roughly 10¹⁸ tokens per month by 2030. Even under best-case efficiency assumptions, that kind of demand points to hundreds of gigawatts of infrastructure. The demand signal is clear. What remains unsettled is design.
At a recent industry event built around the theme AI Everywhere, Raja Koduri delivered a keynote titled Chiplet Quilting for the Age of Inference. His argument was precise: architecture decisions made today will shape performance, cost, and energy use for years to come, and traditional design assumptions are no longer keeping pace with where inference demand is heading.
Raja Koduri Grounds the Conversation in First Principles
Before getting into architecture specifics, Raja Koduri pulled the discussion back to fundamentals. These are the variables that determine whether a silicon design actually works at scale, and they do not change regardless of how quickly the market shifts:
- Performance per dollar
- Performance per watt
- Flexibility across future workloads
- Packaging cost
- Energy to compute
- Energy to move data
- Energy to access memory
Physics defines the limits. Economics determines what scales. Compute operations now cost femtojoules per bit. Data movement costs far more. Off-chip memory access dominates the energy budget. This is not a software problem. Distance matters. Memory placement matters. Packaging matters. And the gap between getting these things right and getting them wrong is widening as inference workloads grow.
Why Post-Dennard Scaling Changes the Calculus
For decades, chipmakers could rely on transistor scaling to deliver consistent gains in power efficiency and performance. That era is effectively over. Advanced nodes now cost more per square millimeter, and power efficiency gains have flattened considerably. Not every function belongs on the most advanced process, which means architects can no longer default to a monolithic approach and expect it to hold up.
This is the moment where chiplets stop being optional.
What Chiplet Quilting Actually Means
Chiplet architectures are not a new idea, but Raja Koduri presented chiplet quilting as a meaningful extension of the concept. Rather than treating a system as a fixed layout, chiplet quilting approaches it as a configurable fabric. Compute, memory, and interconnect elements become modular components that architects can tune based on workload-specific requirements for cost, power, bandwidth, and latency.
The benefit, as framed in the keynote, is flexibility without sacrificing rigor. That distinction matters. Flexibility in silicon design can easily become an excuse for imprecision. Chiplet quilting, done well, is the opposite of that.
To illustrate the point, Raja Koduri walked through a comparison between a reference system based on the NDGX B200 flagship GPU platform and a hypothetical quilted system built around tighter memory coupling and higher-bandwidth interconnects. The results were not incremental. The comparison showed:
- Order-of-magnitude gains in throughput
- Order-of-magnitude reductions in energy per token
The presentation was careful to frame this as architectural leverage, not a product claim. When memory sits closer to compute and interconnect bottlenecks shrink, inference economics shift. That is the core of the argument.
Turning Architectural Theory Into Working Systems
One of the more practical threads running through the keynote was the question of tooling. Oxmiq Labs, where Raja Koduri is engaged, builds modeling tools that allow teams to evaluate chiplet-based system configurations using real parameters rather than theoretical estimates.
The inputs those tools work with include:
- Die size and geometry
- Memory bandwidth and interconnect bandwidth
- Power consumption
- Cost inputs
- Picojoules per bit and picojoules per operation
- Inference workload profiles
From Parameters to Decisions
The value of that kind of tooling is not just computational. It changes how design decisions get made. When architects can model the tradeoffs between memory placement, interconnect choices, and process node selection against actual inference workload profiles, the decisions become concrete rather than speculative. That is a different kind of design process than the one most teams are used to.
The reference system comparison Raja Koduri presented during the keynote was an example of this in practice. It was not a theoretical exercise. It used real parameters to show what happens to throughput and energy efficiency when the architectural assumptions shift.
Three Takeaways That Frame the Moment
Raja Koduri closed the keynote with three ideas that cut through the complexity of everything discussed:
- The age of inference has arrived.
- Details at the physics level decide winners.
- Chiplets are fun and offer a practical path forward.
The third point is easy to overlook, but it reflects something real about where silicon architecture is right now. Chiplet-based design opens up options that monolithic approaches simply cannot offer. For teams working on inference systems, that is not a minor footnote. It is the central design challenge of the next several years.
The first principles Raja Koduri outlined at the start of his talk connect directly to that conclusion. Performance per watt, energy per token, memory placement, interconnect bandwidth — these are not abstract metrics. They are the inputs that will determine which inference systems scale and which ones do not.
About Raja Koduri
Raja Koduri is a silicon architect and technologist whose work focuses on the intersection of chip design, AI inference, and system-level performance. He delivered the keynote Chiplet Quilting for the Age of Inference at an industry event centered on AI infrastructure, where he presented architectural analysis and modeling frameworks developed in conjunction with Oxmiq Labs.

