How Async Batching Boosts GPU Performance
2 min read
Furthermore, continuous batching helps a GPU work better by removing empty spaces in its tasks. However, this method is often synchronous, meaning the CPU and GPU take turns working. Consequently, this creates small waiting periods where one component is idle.
Similarly, asynchronous batching solves this problem. Essentially, it allows the CPU to prepare the next batch of tasks while the GPU is still working on the current one. As a result, this teamwork can speed up the whole process by nearly 24%, using the hardware much more efficiently.
Critically, this technique is a major step for faster inference. Importantly, it does not require new model designs, just smarter scheduling.
| Aspect | Synchronous Batching | Asynchronous Batching |
|---|---|---|
| CPU/GPU Execution | CPU and GPU take turns — while one works, the other idles, creating ~24% wasted GPU time | CPU and GPU run concurrently using CUDA streams and events, achieving ~99.4% GPU utilization |
| Synchronization Mechanism | Relies on the default CUDA stream which blocks the CPU until all GPU operations complete | Uses non-default CUDA streams (H2D, Compute, D2H) with explicit CUDA events for fine-grained ordering |
| Memory Management | Single set of input/output buffers; standard single CUDA graph allocation | Double-buffered slots (A/B) to prevent race conditions; shared memory pool across CUDA graphs to limit VRAM overhead |
| Batch Data Dependencies | Built-in sequential flow: batch N+1 preparation begins only after batch N results are available on the CPU | Uses placeholder tokens and a carry-over mechanism to speculatively prepare batch N+1 inputs while batch N is still computing |
| End-to-End Latency | 300.6s for 8K tokens (batch size 32, 8B model) — GPU idle for ~72s | 234.5s for the same workload — a ~22% speedup with no kernel or model changes required |
Asynchronous Continuous Batching
Specifically, asynchronous batching separates CPU and GPU workloads. Consequently, the GPU stays active nearly all the time. Therefore, GPU utilization increases significantly. Moreover, it uses CUDA streams for parallel tasks. Additionally, this method reduces idle gaps between batch processing. Furthermore, everyone benefits from faster and more efficient computing. As a result, users experience a major performance boost.
Faster and Cheaper LLM Inference
“We moved from schedule-based dependencies to data-based dependencies and refining synchronization points, we managed to disentangle the CPU and GPU workloads, making parallel execution of both hardwares possible. This finally resulted in a large increase of generation speed… Pretty much a slam dunk.”
Ultimately, asynchronous batching significantly improves system performance by allowing CPU and GPU tasks to run concurrently. Consequently, this approach reduces idle time and achieves a substantial 22% speedup in generation tasks.
Therefore, this strategy presents a cost-effective method for maximizing hardware utilization in large-scale language model operations. Accordingly, it creates a more efficient foundation for future applications requiring high-throughput and extended sequence generation.




