Hello all,
I have spent the last two full weeks trying to get Asterisk AudioSocket to work with Pipecat. I’m trying to build an agentic voicebot, so my flow looks something like this:
[Voice from User] -> STT -> LLM (with tools) -> TTS -> [Voice to User]
The problem is that there is no clear or defined way to actually transmit the real time voice from Asterisk AudioSocket to Pipecat that I could fine. Even after downsampling/upsampling to the proper rate, It did not work at all, I can’t hear anything at all.
this is what I have done thus far
## **Complete Summary of All Attempts and Results**
### **Initial Setup**
- **What we had**: AudioSocket (Asterisk) → Bridge → WebSocket → Pipecat Server
- **Initial problem**: Frame processing errors, no audio flow
### **Attempt 1: Custom Transport Classes**
**What we tried**: Created `WebSocketAudioTransport` extending `BaseTransport`
```python
class WebSocketAudioTransport(BaseTransport):
def \__init_\_(*self*, *websocket*: WebSocket, *params*: TransportParams):
super().\__init_\_(params)
```
**Result**:
Failed with `TypeError: BaseTransport._init_() takes 1 positional argument but 2 were given`
### **Attempt 2: Fixed Transport Methods**
**What we tried**: Changed from `input_processor()`/`output_processor()` to `input()`/`output()`
**Result**:
Failed with `TypeError: Can’t instantiate abstract class WebSocketAudioTransport with abstract methods input, output`
### **Attempt 3: Simplified Without Custom Transport**
**What we tried**: Removed custom transport, used simple `FrameProcessor` classes
```python
class AudioInputProcessor(FrameProcessor):
async def run_input_loop(*self*):
*# Process queued audio*
```
**Result**:
Failed with `AttributeError: ‘AudioRawFrame’ object has no attribute ‘id’`
### **Attempt 4: Fixed Frame Types**
**What we tried**: Changed to `InputAudioRawFrame` with proper ID attributes
```python
audio_frame = InputAudioRawFrame(
*audio*=audio_data,
*sample_rate*=16000,
*num_channels*=1
)
audio_frame.id = str(uuid.uuid4())
```
**Result**:
Greeting plays,
No STT transcription
### **Attempt 5: Direct Frame Queuing**
**What we tried**: Queue frames directly to task instead of through processors
```python
await task.queue_frame(audio_frame)
```
**Result**:
Greeting plays,
No STT transcription
### **Attempt 6: Fixed Sample Rates**
**What we tried**:
- Set `audio_out_sample_rate=24000` for OpenAI TTS requirement
- Updated bridge to downsample from 24kHz to 8kHz
**Result**:
Greeting plays,
No STT transcription
### **Attempt 7: PipelineRunner Signal Handler Fix**
**What we tried**: Added `handle_sigint=False` to fix Windows signal handler error
```python
runner = PipelineRunner(handle_sigint=False)
```
**Result**:
No more signal handler error,
Greeting plays,
No STT transcription
### **Attempt 8: Service Validation**
**What we tried**: Added startup validation to test each OpenAI service
```python
async def validate_openai_services(api_key: str):
*# Test STT, LLM, TTS with actual API calls*
```
**Result**:
All services validated successfully,
Greeting plays,
No STT transcription
### **Attempt 9: Pipeline Order Fix**
**What we tried**: Moved audio output handler after TTS in pipeline
```python
pipeline = Pipeline([
stt,
context_aggregator.user(),
llm,
tts,
output_handler, *# Moved here*
context_aggregator.assistant()
])
```
**Result**:
Greeting plays,
No STT transcription
### **Attempt 10: Disabled VAD and Added Transcription Logging**
**What we tried**:
- Set `vad_enabled=False` to ensure audio isn’t filtered
- Added `TranscriptionLogger` to log all transcription events
**Result**:
Greeting plays,
No transcriptions logged at all
### **Attempt 11: WebSocketInputProcessor with Separate Async Task**
**What we tried**: Created `WebSocketInputProcessor` with `_audio_receiver_loop()` running as separate async task
```python
class WebSocketInputProcessor(FrameProcessor):
async def \_audio_receiver_loop(*self*):
*# Receive audio and push frames*
await **self**.push_frame(audio_frame)
```
**Result**:
Frames created in disconnected async task didn’t reach STT service
### **Attempt 12: Direct Frame Queuing with audio_receiver Function**
**What we tried**: Separate `audio_receiver()` function with direct `task.queue_frame()`
```python
async def audio_receiver(websocket: WebSocket, task: PipelineTask):
await task.queue_frame(audio_frame)
```
**Result**:
Greeting plays,
Audio received (300+ chunks),
No STT transcription
### **Attempt 13: Audio Accumulation for Whisper (500ms chunks)**
**What we tried**: Accumulated audio into 8000-byte chunks (500ms at 16kHz) for Whisper
```python
BUFFER_SIZE = 8000 # 500ms at 16kHz
audio_buffer.extend(audio_data)
while len(audio_buffer) >= BUFFER_SIZE:
*# Send accumulated chunk*
```
**Result**:
Greeting plays,
30 frames sent to STT,
No transcriptions
### **Attempt 14: VAD Enabled with Wrong Parameters**
**What we tried**: Tried `SileroVADAnalyzer` with non-existent parameters
```python
vad_analyzer = SileroVADAnalyzer(
*min_speech_duration*=0.3, *# Doesn't exist*
*max_silence_duration*=0.5 *# Doesn't exist*
)
```
**Result**:
Failed with `TypeError: unexpected keyword argument ‘min_speech_duration’`
### **Attempt 15: VAD with Correct VADParams**
**What we tried**: Used proper `VADParams` class with correct parameters
```python
vad_params = VADParams(
*confidence*=0.7,
*start_secs*=0.3,
*stop_secs*=0.5,
*min_volume*=0.6
)
vad_analyzer = SileroVADAnalyzer(sample_rate=16000, params=vad_params)
```
**Result**:
Greeting plays,
Audio received (400+ chunks),
Still no STT transcription
### **Attempt 16: Deepgram Streaming STT with Audio Accumulation**
**What we tried**: Suggested using Deepgram for true streaming STT with 200ms accumulation
```python
if DEEPGRAM_AVAILABLE and os.getenv(“DEEPGRAM_API_KEY”):
stt = DeepgramSTTService(
*api_key*=os.getenv("DEEPGRAM_API_KEY"),
*model*="nova-2",
*interim_results*=True
)
```
**Result**:
Not tested - user doesn’t have Deepgram API key
### **Attempt 17: Debug Processor with Comprehensive Logging**
**What we tried**: Added `DebugProcessor` to log all frames and audio levels
```python
class DebugProcessor(FrameProcessor):
def \__init_\_(*self*, *label*: str):
super().\__init_\_()
**self**.label = label
*# Log RMS, MAX amplitude for each audio frame*
```
**Result**:
Confirmed audio flowing (RMS=5961, MAX=13948),
No VAD events,
No transcriptions
### **Attempt 18: Manual VAD Processing**
**What we tried**: Manually call `vad_analyzer.analyze()` on audio chunks
```python
vad_result = await vad_analyzer.analyze(audio_frame)
if vad_result.speech_probability > 0.5:
*# Accumulate speech*
```
**Result**:
Failed with `‘SileroVADAnalyzer’ object has no attribute ‘analyze’`
### **Attempt 19: Fixed StartFrame Timing Issues**
**What we tried**: Removed `AudioInputProcessor` from pipeline, used direct frame queuing with proper timing
```python
# Audio receiver waits 0.5s before starting
await asyncio.sleep(0.5)
# Pipeline waits 1.0s before greeting
await asyncio.sleep(1.0)
```
**Result**:
Greeting plays,
700+ chunks received,
No VAD events,
No transcriptions
### **Attempt 20: FastAPIWebsocketTransport with Custom Serializer**
**What we tried**: Created `RawAudioSerializer` with proper `type` property
```python
class RawAudioSerializer(FrameSerializer):
@property
def type(*self*) -> str:
return "raw_audio"
```
**Result**:
Failed with `exception receiving data: KeyError (‘text’)` - transport expects text messages
### **Attempt 21: WebsocketServerTransport**
**What we tried**: Used `WebsocketServerTransport` instead of `FastAPIWebsocketTransport`
```python
transport = WebsocketServerTransport(
*websocket*=websocket,
*params*=WebsocketServerParams(...)
)
```
**Result**:
Failed with `unexpected keyword argument ‘websocket’` - this transport creates its own server
### **Attempt 22: CustomWebSocketWrapper with FastAPIWebsocketTransport**
**What we tried**: Created wrapper to handle mixed text/binary WebSocket messages
```python
class CustomWebSocketWrapper:
def \__init_\_(*self*, *websocket*: WebSocket):
**self**.websocket = websocket
**self**.config_handled = False
```
**Result**:
Failed with `‘CustomWebSocketWrapper’ object has no attribute ‘client_state’`
### **Attempt 23: Simple Solution with AudioAccumulator**
**What we tried**: Created `AudioAccumulator` processor to accumulate chunks
```python
class AudioAccumulator(FrameProcessor):
def \__init_\_(*self*, *chunk_size*: int = 16000):
*# Accumulate 1 second chunks*
```
**Result**:
Audio received (500+ chunks),
AudioAccumulator never processed frames (bypassed by `task.queue_frame()`)
### **Attempt 24: Final Solution with WebSocketInputProcessor in Pipeline**
**What we tried**: Put `WebSocketInputProcessor` IN the pipeline with its own receive loop
```python
class WebSocketInputProcessor(FrameProcessor):
async def \_receive_loop(*self*):
*# First, send StartFrame*
await **self**.push_frame(StartFrame())
*# Then receive and push audio frames*
```
I am not looking for a theoretical fix, because I have tried almost all theoretical fixes. I am looking for help in the form of a concrete, minimalist, code example that connects Asterisk AudioSocket with Pipecat which would then allow me to switch out parts of the Pipecat pipeline with ease