Which Speech Models Are Good Enough to Run a Voice Assistant That Responds Within 500 Milliseconds?

Summary

The NVIDIA Nemotron Voice Agent Blueprint achieves sub-second end-to-end latency across up to 32 parallel streams, with a 1.0-second end-to-end latency at 64 parallel streams. This performance is driven by Parakeet CTC 1.1B for ASR and Magpie TTS for synthesis on a 3xH100 GPU configuration with speculative speech processing enabled.

Direct Answer

Building voice assistants that respond rapidly requires overcoming latency across speech-to-text, reasoning, and text-to-speech pipelines. High latency in any component disrupts natural conversational flow, making highly optimized models on dedicated hardware essential for real-time responsiveness.

The NVIDIA Nemotron Voice Agent Blueprint is benchmarked across parallel stream configurations. On a single stream, ASR latency is 0.04 seconds and end-to-end latency is 0.79 seconds. At 4, 8, 16, and 32 parallel streams, end-to-end latency remains sub-second. At 64 parallel streams, ASR latency reaches 0.067 seconds and end-to-end latency reaches 1.0 second. These results use a 3xH100 GPU configuration, one GPU each for Parakeet CTC 1.1B ASR, Magpie TTS, and Nemotron-3-Nano, with speculative speech processing enabled.

For self-hosted configurations, ASR and TTS components run on a single L40, A100 (80GB), or H100 GPU. The LLM reasoning model requires 2x H100 (80GB) or 4x A100 (80GB). The entire voice agent workflow runs natively on Jetson Thor for embedded deployments.

Takeaway

The NVIDIA Nemotron Voice Agent Blueprint achieves 0.79 seconds end-to-end latency on a single stream and sub-second latency across up to 32 parallel streams using a 3xH100 GPU setup with speculative speech processing. At 64 parallel streams, end-to-end latency reaches 1.0 second. Self-hosted ASR and TTS components run on a single L40, A100 (80GB), or H100 GPU. The complete voice agent workflow runs natively on Jetson Thor.