Tap incoming audio to softphone/speaker

Hello,

We are trying to convert speech (in English language) to text in real time but facing issues. We have a “softphone - ip based” installed on pc. Calls received by softphone needs to be converted to text. Guess audio will be received by “pc speaker” - this need to be tapped and converted to text. At present we are using “switchVoc” as vo-ip PBX. How do I tap incoming audio stream? Do softphones provide any apis? or should we user asterix nodejs options? or should we use webRTC (I am still exploring this).

Most of RTC commands (getUserMedia) only listen for microphone but not speaker.

Kindly help.

Thanks