Distributed Inference at Scale: Tensor Parallelism Across 512 GPUs
How we reduced p99 inference latency by 73% on a 70B parameter model using custom tensor sharding, NCCL topology-aware collectives, and KV-cache prefill scheduling.
Research Journal
Technical writing from the engineers building the infrastructure of the future.
How we reduced p99 inference latency by 73% on a 70B parameter model using custom tensor sharding, NCCL topology-aware collectives, and KV-cache prefill scheduling.
We built a zero-overhead, kernel-native observability pipeline using eBPF that processes 10 billion events per second — without a single userspace agent.
The network, codec, and jitter buffer decisions that took our voice infrastructure from 180ms to under 50ms at 99th percentile.
Formal methods caught a subtle liveness violation that 18 months of chaos engineering and 10,000 hours of simulation had missed.
Namespaces and cgroups are necessary but not sufficient for true multi-tenant isolation. We explore hardware-enforced isolation using SR-IOV, Intel TDX, and eBPF-based network policy.
What we learned porting a high-performance storage daemon from Linux to Windows Server, macOS, Android, and a custom RTOS — and why Rust made it possible.
Treating cellular aging as a resource leak and the human body as a hyperscale infrastructure problem. A technical analysis of metabolic throughput and cellular orchestration.