Federated GPU Computing: Better Hardware Utilization Through Mobile Continuations
The usual story about GPU scarcity is simple: there is too much demand and not enough hardware. That story is true, but incomplete. A second problem hides underneath it: we waste expensive accelerators because our software architecture pins live computation to a single runtime instance. When memory pressure rises, when a batch gets fragmented, when a node is overloaded, or when private data cannot leave a boundary, the standard response is surprisingly crude: retry, re-batch, evict, or fail.
This is not a law of physics. It is a consequence of how current ML systems represent computation. Most stacks treat the running job as an opaque process with hidden mutable state: device buffers, KV cache, optimizer moments, communication handles, scheduler queues. Those choices are fast on the happy path, but brittle under pressure. A computation that cannot be suspended and moved is a computation that must be thrown away when its local environment becomes inconvenient.
Datom.World points toward a different answer. If code, data, and runtime state are all stream-shaped, then a live computation can be represented as a first-class continuation. That continuation can be suspended without semantic loss. It can be persisted. It can be migrated. It can resume elsewhere. Under this model, the scarce resource is not "the process" but the GPU cycles themselves. The computation becomes mobile; the hardware becomes fungible.
The waste hidden inside today's GPU stacks
Modern inference engines and training systems are impressive, but their performance model is local. They assume that the most important thing is to keep tensors resident, fuse kernels aggressively, and avoid host overhead. All of that is sensible. The trouble begins when local assumptions fail.
- Inference waste: a live session is tied to one process and one KV cache layout. If that node becomes overloaded or memory-constrained, the work in progress is difficult to relocate without rebuilding state.
- Training waste: parameter shards, gradients, and optimizer state are distributed, but the job itself is still tightly coupled to the runtime topology chosen at launch.
- Operational waste: long-running jobs are fragile because their most valuable state lives in opaque mutable memory rather than in explicit portable structures.
- Privacy waste: when data locality matters, we often move data to computation because moving computation itself is not a first-class primitive.
The result is lower effective utilization than the raw hardware numbers suggest. A cluster may show high aggregate usage while still wasting valuable work through restarts, over-provisioning, session pinning, and stranded memory.
The architectural pivot: preserve live computation
The central trick is simple to state and surprisingly rare in practice: do not throw away a live computation just because local resources became insufficient. Suspend it instead.
In Yin.VM, a continuation is a first-class value. It captures where execution is, what environment it has closed over, and what remains to be done. Because continuations are serializable and mobile, suspension is not an exception path. It is part of the normal execution model.
That changes the economics of GPU scheduling. Instead of treating a running decode loop or training step as an inseparable process, we can treat it as a suspended computation plus explicit references to the streams it depends on. If the local machine has enough GPU memory and bandwidth, the continuation resumes there. If not, it migrates to a node that does.
DaoStream is not the bottleneck
A common objection appears immediately: streams sound slower than direct process-local state. That objection only holds if "stream" is mistaken for "remote durable queue." DaoStream is an abstraction, not a single transport.
A DaoStream can be:
- a local ringbuffer for immediate throughput
- a shared-memory queue between scheduler and worker
- a bounded persistent stream for checkpointing
- a network transport for migration or federation
The semantics stay uniform while the physics changes. On the hot path, a GPU dispatch stream can be a local ringbuffer with minimal overhead. On the slow path, the same stream protocol can support persistence, replay, migration, or cross-node transfer. This is the advantage of using streams as the contract and implementations as projections.
Offloading math without collapsing the VM into the GPU runtime
Yin.VM is not supposed to become a tensor-core simulator. Its role is to orchestrate computation, not to implement fast dense math in the interpreter itself. The right split is structural:
- Yin.VM owns: control flow, suspension, resumption, migration, provenance, policy, and composition.
- GPU runtimes own: matrix multiplication, attention kernels, quantized packing, memory layout, and device scheduling.
A Yin library for fast math therefore does not perform the math directly. It constructs explicit effects such as a GPU dispatch request. The continuation parks. A local or remote GPU interpreter consumes the request, materializes the relevant stream segments into the buffer layout it prefers, executes the kernels, and emits the result. Then the parked continuation resumes.
This is a crucial distinction. The library is not parked. The continuation is parked. The library merely emits the effect that makes the suspension point explicit.
Generalized datoms make tensors stream-native
Older discussions of datoms often start from the five-tuple. Datom.World has since relaxed that definition: a datom can be any N-dimensional tuple, and tuples can be bitpacked into integers. That matters enormously for machine learning.
Under this generalized model, tensors do not need to be treated as opaque foreign blobs with metadata on the side. They can be represented as streams of packed tuples. The same is true for KV cache, gradients, optimizer state, and parameter shards.
| Artifact | Canonical Representation | Execution Projection |
|---|---|---|
| Model weights | Bounded stream of bitpacked tensor-block tuples | Device buffers optimized for kernel layout |
| KV cache | Append-only stream of packed cache-page tuples | Paged attention cache in GPU memory |
| Gradients | Stream of packed update blocks with provenance | All-reduce or local accumulation buffers |
| Continuation state | Suspended computation plus refs to bounded stream regions | Resumed runtime frames and active buffers |
The key is to choose the right granularity. Scalar-level tuples are usually too fine and will drown the system in interpretation overhead. The practical unit is the block: tile, page, shard, quantization group, head block, cache segment. Streams remain explicit, but they carry structures large enough to preserve throughput.
The projection is the cache: Handling brittle hardware
A common question arises: if GPU kernels are opaque and brittle, how can we suspend them? Most accelerators do not support pausing a running dispatch and serializing its registers. If memory pressure becomes too high, the hardware state is often simply discarded.
The answer is structural. We do not attempt to suspend the hardware state. We suspend the orchestrator at the boundary of a discrete effect. When a Yin.VM program requests a matrix multiplication or a transformer block, it emits an effect datom and parks. The GPU consumes the effect, executes a discrete, non-suspendable task, and emits the result. The continuation then resumes.
Under this model, the physical memory of the GPU is not the primary home for state. It is an execution projection (a high-speed cache) of the canonical tuple streams. If a node must drop its local memory state due to pressure, it is not a catastrophic loss of work. It is merely a cache miss at the projection layer. The parked continuation still exists. It carries the identity of the required streams. It can migrate to a node with available capacity, re-materialize the necessary projection from the stream, and resume as if nothing happened.
Why this improves inference utilization
Inference, especially autoregressive inference, is a perfect fit for suspendable mobile continuations. A live session is mostly a decode position, sampling state, a continuation, model references, and KV cache references. None of those inherently require being trapped inside one process.
With packed stream-native cache pages and mobile continuations, a session can:
- Run locally while resources are available
- Suspend when the local node becomes constrained
- Migrate with its continuation and explicit cache references
- Resume on another GPU node without restarting from the prompt
This is much stronger than simply routing a new request to another machine. The work-in-progress itself moves. Already-computed causal structure is preserved. The system is no longer forced to choose between locality and continuity.
Why this also improves training utilization
Training is harder than inference because communication overhead is harsher and the tolerance for abstraction tax is lower. But the architectural benefits are also larger. Training already distributes parameter shards and gradients; what it rarely distributes cleanly is live computation as a portable value.
Under a stream-native continuation model, a training step can emit explicit gradient or update streams, and aggregation becomes an interpreter over those streams rather than a pile of hidden mutable side channels. That gives us:
- federated provenance, because updates remain attributable
- federated policy, because different aggregation rules are just different interpreters
- fault tolerance, because suspended work can be resumed rather than replayed from coarse checkpoints
- better locality, because computation can move toward private data instead of forcing data to centralize
This does not eliminate the need for fast all-reduce, fused optimizers, or smart memory planning. It changes the layer at which those optimizations live. They become projections of the stream model rather than the canonical model itself.
Federation becomes ordinary systems engineering
The most interesting consequence is conceptual. In mainstream ML infrastructure, federated inference and federated training often feel like special products bolted onto a fundamentally centralized stack. In Datom.World they become ordinary consequences of the substrate.
If computations are mobile continuations and state is carried by bounded tuple streams, then federation is no longer exotic. It is simply the ability to place interpreters and storage across trust boundaries while preserving causality. The same primitives that support local ringbuffer dispatch also support migration across a cluster, across an enterprise boundary, or onto an edge device.
Convergent Evolution: How the industry is catching up
This is not a purely theoretical vision. The high-performance inference community is already converging on these patterns, though they often describe them as specialized optimizations rather than a unified substrate.
- vLLM and Llumnix: By breaking the KV cache into discrete blocks (PagedAttention), vLLM created the atomic unit for migration. Llumnix extends this by implementing iterative "context switching" for requests, copying cache blocks in the background so a session can move between servers to balance load. This is a direct validation of the block-page-as-tuple model.
- Petals: In decentralized inference, Petals moves the "continuation" to the client. The client orchestrates a swarm of GPUs, treating each one as a temporary interpreter for a specific layer. If a node fails, the client simply finds a new projection of that layer elsewhere.
- Mooncake and LMCache: Newer architectures are explicitly disaggregating the KV cache, treating it as a shared data stream that lives independently of any single GPU's memory. They are realizing that the KVCache is not just a buffer; it is the canonical record of the conversation.
The difference in Datom.World is one of integration. In mainstream stacks, live migration and disaggregated caching are complex features bolted onto an opaque tensor runtime. In this architecture, they are ordinary consequences of the substrate. If the VM is an interpreter over streams, then migration is just re-pointing that interpreter to a new segment of the same canonical fact-stream.
Why this architecture is not mainstream yet
If the idea is so attractive, why is the industry still built on opaque tensor runtimes? The answer is not that the mainstream stack is foolish. It is that the mainstream stack optimized for a narrower objective: maximal local throughput under familiar deployment assumptions.
GPU software ecosystems reward dense contiguous buffers, static-ish kernel plans, and aggressive locality. They do not yet reward first-class mobility, explicit provenance, or suspendable computation strongly enough to dominate the design. Tooling, compilers, profilers, and operator habits all reinforce the tensor-runtime worldview.
But the pressure is changing. Long-running agentic workloads, privacy-sensitive inference, heterogeneous clusters, intermittent edge devices, and expensive accelerator scarcity all make resumability and mobility more valuable. The more we care about preserving live work rather than merely launching new work, the more continuation-native architectures become attractive.
The real target: better effective utilization, not prettier semantics
This architecture should not be judged by whether it sounds elegant in a whiteboard conversation. It should be judged by whether it increases effective GPU utilization: more useful work completed per scarce accelerator hour, less discarded state, fewer cold restarts, more flexible placement, better privacy boundaries, and lower operational fragility.
That is the promise of mobile continuations over stream-native tensor state. We stop treating GPUs as places where opaque jobs go to live and die. We start treating them as interpreters for explicit computation that can be suspended, relocated, and resumed.
Once that shift happens, federated inference and federated training stop looking like compromises. They start looking like the natural consequence of taking computation seriously as data.