AGI for Speech Syntesis

Hello,

I have come across some APIs like AEAP for Speech-to-Text development in Asterisk. Since AEAP likely doesn’t include Text-to-Speech capabilities, could I use AGI to implement TTS alongside AEAP’s STT? If feasible, what potential complications might arise from combining these APIs, and what are the primary drawbacks of opting for AGI over AEAP for TTS/STT? Is there a better option other than these two?

Thanks

The most you could do in AGI is play back a generated file, which works perfectly fine for people…

If you can generate the audio fast enough, you could feed it into a named pipe and tell Asterisk to play it back in real time from there.

It depends on how ambitious you are. If you consider advanced use cases with human-like bot, likely you will see many limitations of AGI. Few examples:

  1. Modern LLMs generate responses pretty slowly, like 3-5 tokens per second. If you want to cooperate with LLM, you likely need a streaming TTS, not a file-based TTS. This means simple Playback app not going to work

  2. You likely need to implement barge-in to interrupt TTS properly.And intelligent barge-in so simple cough won’t stop the TTS. A tight integration between TTS and ASR is required then.

  3. AEAP is also quite limited and doesn’t cover important usecases like in-conference assistance.

Given that many advanced dialog systems ended in custom module for ASR/TTS instead of existing UniMRCP, AGI, ARI, AEAP implementations. But for simple systems they are perfectly fine.

Thanks for you explanation @nshmyrev !
Despite UniMRCP, AGI, ARI, and AEAP implementations lacking real-time translation support, what are the primary advantages of choosing an UniMRCP implementation over AGI or ARI for example? The ease of implementation, given that there’s no need to build the system from scratch, or the capacity to efficiently handle a large number of calls?

Thank you.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.