Google brings multi-token prediction Gemma 4 LLMs – TechTalks
3 min read
Google Brings Multi-Token Prediction to Gemma 4 LLMs
Multi-token prediction is changing how fast AI models work. Furthermore, Google added this feature to its new Gemma 4 open models. Moreover, instead of guessing one word at a time, the model now predicts several words at once. Consequently, responses are much faster on regular computers.
Additionally, small “drafter” models suggest upcoming tokens. Then, the main model checks them all in one step. Therefore, users get up to 3x faster results without losing quality. Notably, the system shares memory to keep costs low.
Importantly, the open-source design lets the community build on this work. For example, researchers created DFlash to boost speeds even further. Similarly, open weights help developers improve AI for everyday devices.
| Feature | Standard Autoregressive LLM | Gemma 4 with Multi-Token Prediction (MTP) | Gemma 4 with DFlash (Community Layer) |
|---|---|---|---|
| Token Generation Method | Predicts and generates one token at a time; each token requires a full weight-matrix fetch from memory (memory-wall bottleneck). | A lightweight “drafter” head proposes n candidate tokens in parallel; the main “target” model verifies them all in a single forward pass, accepting correct prefixes and regenerating from the first mismatch. | Builds on MTP output by applying Block Diffusion: instead of strictly sequential token refinement, the model refines entire blocks of embedding-space representations simultaneously, treating generation more like an image-diffusion process. |
| Inference Speed Gain | Baseline — throughput limited by memory bandwidth, especially on consumer-grade hardware (MacBooks, limited-VRAM GPUs). | Up to 3× acceleration reported on both on-device and larger Gemma 4 variants. | Up to 6× acceleration over baseline when DFlash optimizations are layered on top of MTP. |
| Memory & Compute Overhead | Single model weight set; KV cache used for standard attention. No additional heads required. | Overhead mitigated via three techniques: KV-cache sharing (drafters reuse main model’s cache), shared target activations (drafters piggyback on deeper-layer representations), and efficient embedders on smaller models (E2B/E4B) that cluster 260 K vocab into a smaller projection matrix. | Optimizes GPU kernels for KV-cache transfers and attention-layer computation — the bottlenecks MTP alone does not address. No additional model weights; purely a systems-level optimization pass. |
| Quality Safeguard | Output quality is deterministic per generation — no draft risk. | Drafter heads are specifically trained to align with the main model, ensuring high acceptance rates. Rejected drafts waste GPU cycles, so mis-alignment would negate the speed benefit. Final output quality is identical to standard autoregressive generation. | DFlash is described as “same quality” as MTP — it operates at the kernel/memory-transfer level, not the model’s learned parameters, so generation quality is preserved. |
| Openness & Ecosystem Impact | Closed frontier models offer no avenue for external researchers to optimize architectures or kernel code. | Gemma 4 ships with open weights (Apache 2.0), documented MTP architecture, and modular design — enabling community inspection and extension. | Developed by Z-Lab as an open-source add-on within days of Gemma 4’s release, demonstrating how open weights create a crowdsourced R&D ecosystem that accelerates real-world deployment on everyday devices. |
Gemma 4 Multi-Token Prediction
Specifically, Google’s multi-token prediction in Gemma 4 lets the model guess several words at once, not just one. Moreover, a smaller drafter model proposes tokens that the main model verifies in a single pass. Furthermore, this reduces the memory wall problem on devices everyone uses daily. Additionally, shared KV cache keeps memory costs low. Consequently, people can run faster AI locally. Notably, the community’s DFlash integration boosts speed up to 6x, proving open weights help everyone.
Implications for On-Device AI
This indicates that multi-token prediction makes Gemma 4 much faster. Therefore, users get up to 3x speedup on local devices. Moreover, shared resources reduce memory costs for everyone. Similarly, the community-driven DFlash optimization boosts speed to 6x. In contrast to closed models, open weights enable broader innovation. Thus, AI becomes more accessible and efficient for all.
“DFlash for Gemma 4: Up to 6x Faster. ⚡⚡ Great to see MTP land natively in Gemma 4 today. If you want to push it further, try DFlash — open source, same quality, more speed!!”
Ultimately, Google’s multi-token prediction in Gemma 4 provides a major speed boost for everyone using AI locally on their own devices. Consequently, this makes advanced language models much more responsive and practical for daily tasks. In summary, the key advantage is making AI faster without needing more powerful hardware.
Thus, the open-weight design of Gemma 4 is equally important. Therefore, it allows the global community to build on the technology and create further improvements. As a result, this approach ensures AI progress benefits all people, not just a few large companies.



