Hours
Minutes
Seconds

Today at 4pm EST I Webinar: Dapta 101: Go from zero to your first AI agent in one session.

Grok Opens Its Voice APIs with Aggressive Pricing That Undercuts Deepgram and ElevenLabs

AI News Stories of the Week

Grok Opens Its Voice APIs with Aggressive Pricing That Undercuts Deepgram and ElevenLabs

Picture of Annie Neal
Annie Neal

Growth Advisor

Table of Contents

Share this post

xAI released standalone Speech-to-Text (STT) and Text-to-Speech (TTS) APIs on April 17, 2026, and the pricing is designed to take market share from incumbents. Grok STT costs $0.10 per hour in batch mode or $0.20 per hour in streaming, while Grok TTS runs $4.20 per million characters. For context, ElevenLabs TTS pricing reaches around $50 per million characters and OpenAI’s TTS offering is priced near $30 per million characters, putting Grok TTS roughly 90% below the nearest premium competitor. This pricing alone would be a story, but the accuracy benchmarks make it more compelling.

On phone call entity recognition, a notoriously difficult task where noise, accents, and proper nouns regularly trip up transcription models, Grok STT claims a 5.0% error rate. ElevenLabs posts 12.0% on the same benchmark, Deepgram 13.5%, and AssemblyAI 21.3%. For video and podcast transcription, Grok and ElevenLabs tie at 2.4% error rate, with Deepgram at 3.0% and AssemblyAI at 3.2%. If these benchmarks hold up in production, xAI has just delivered a product that is both cheaper and more accurate than every major competitor in the enterprise voice API space.

The feature set is built for real-world use cases. Grok STT supports 25 languages, speaker diarization, word-level timestamps, Inverse Text Normalization, and 12 audio formats. It works in both batch and real-time streaming modes, which matters for developers building voice agents, transcription services, meeting assistants, or accessibility features. The word-level timestamps are particularly useful for applications that need to align transcripts with audio playback or identify the exact moment a specific word was said.

Grok TTS includes expressive controls that push it beyond standard synthetic speech. Inline tags like [laugh], [sigh], and [breath] let developers insert human-like non-verbal sounds mid-sentence. Wrapping tags like <whisper>text</whisper> and <emphasis>text</emphasis> provide prosodic control without requiring custom audio engineering. Five expressive voices are available at launch, which is fewer than ElevenLabs’s voice library but more than enough for most commercial applications.

The infrastructure story is also notable. xAI says the voice APIs are built on the same systems that currently power Grok Voice, Tesla vehicles, and Starlink customer support. That’s meaningful because it means the infrastructure has been battle-tested at scale across consumer cars, satellite internet customer service, and a major consumer AI product before being opened to external developers. For enterprise buyers evaluating voice API reliability, a provider that already runs production voice workloads for Tesla and Starlink has a credibility advantage over smaller specialists.

Presented by: Dapta

For sales teams tired of cold leads, slow customer responses, and manual processes, Dapta is the ultimate tool.

Dapta is the leading platform for creating AI sales agents specifically designed to increase inbound lead conversion. Respond to your leads in less than a minute with voice AI and WhatsApp that converts.

If you want your team to sell more while AI handles the complex stuff, you have to try it.

There are caveats. On Spanish and other non-English voices, early testing indicates the voice quality still trails ElevenLabs, which has invested heavily in multilingual TTS for years. For LATAM applications where Spanish voice output quality directly affects user experience, teams should benchmark Grok TTS on their specific use case before standardizing. The English product appears genuinely competitive; the non-English story is more nuanced.

For developers and AI companies in Latin America and globally, the Grok voice API launch creates immediate pricing pressure on the incumbent market. Deepgram and ElevenLabs have both built strong enterprise businesses on premium pricing, and a credible competitor at roughly one-tenth the price will force contract renegotiations and margin compression across the industry. Voice agents, call center automation, podcast transcription services, and accessibility applications all benefit directly from cheaper, more accurate voice APIs.

The broader xAI strategy is becoming clearer. Elon Musk’s AI company is bundling voice, vision, reasoning, and now standalone APIs into a competitive offering that spans consumer (Grok chatbot), vehicle (Tesla), connectivity (Starlink), and developer (APIs) surfaces. While OpenAI focuses on enterprise infrastructure through its Pentagon and AWS deals and Anthropic emphasizes safety-focused research previews, xAI is aggressively commoditizing the voice layer of the AI stack. If the benchmarks hold in production, every voice-first AI product built in 2026 will have to at least evaluate Grok as a primary provider.

Link here.

You might also be interested in