Real-time AI models, a price comparison

Hello Asterisk Community!

I was updating pricing information from AI models in real time (speech-to-speech interaction with the model, without intermediate STT or TTS conversion, with low latency) and comparing them to have up-to-date data. Sample: 1 Hr. call duration. Created with Grok 4 Fast. Here it is, Updated!:

Model Economic Version Cost 1 Hour (USD) Notes
Google Gemini 2.5 Flash Live ~0.68 Based on 45k audio tokens in/out (25/sec); $3/1M input audio, $12/1M output audio. Source: https://cloud.google.com/vertex-ai/generative-ai/pricing
OpenAI Realtime gpt-realtime-mini ~1.35 Based on 45k audio tokens in/out (25/sec); $10/1M input audio, $20/1M output audio. Source: https://openai.com/api/pricing/
Microsoft Azure Speech Standard Real-time ~1.92 STT $1.20/h + TTS ~$0.72/h. Source: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/
Hume.ai EVI Starter ~4.40 $3 for 40 min + $0.07/min additional. Source: https://www.hume.ai/pricing
ElevenLabs Starter Agents ~6.00 $5 for 50 min, equivalent $0.10/min. Source: https://elevenlabs.io/pricing

Today, October 11, 2025, Google Gemini 2.5 Flash Live could bet the most affordable option for real-time agents. Good to know!

Greetings!

Looking for Asterisk to AI Realtime Agents Integration? Lets talk: hello@infinitocloud.com

You appear to have used the pricing for text input and output for gpt-realtime-mini, but those a for audio input and output are much higher. Reading between the lines, the text pricing implies queued processing, and therefore some delay. I assume the audio pricing accounts for both high priority and for the costs, above the core model, of doing speech to tokens and tokens to speech.

You seem to have assume 100 tokens per minute, but the number of tokens is normally more than the the number of words, and the average words per minute for conversational American English is about 150 words per minute. On the other hand, you have assumed that one side is talking over the other (and with a reverse effect, that nothing is done to truncate the response).

Also there are different models because they are good at different things, A model is not economical if it isn’t up to the job required.

Hi! Yes, you’re right.
The comparison considered the text output for Openai Realtime Mini. I’ve updated the table with additional information about the particular metrics.
I’ll also run a couple of 10-minute calls to Openai Realtime and Google Dialogflow to check their billing and compare it again.
Of course, everyone is free to use whatever tool they want.
Greetings!