How to sync dual-channel transcripts via OpenAI Whisper (VAD silence stripping destroys absolute timestamps)

I am building an automated call transcription pipeline for a PBX system. The goal is to generate a perfectly chronological, multi-speaker transcript (Caller vs. Callee) from standard 8kHz telephony audio.
(My Attempted Solution) Because the OpenAI API downmixes stereo files to mono (which destroys speaker separation and causes heavy hallucination on 8kHz audio), I built a split-channel architecture:

  1. Asterisk: I use MixMonitor with the b,r(),t() flags to record the call legs into two separate, mathematically synchronized files (_caller.wav and _callee.wav).

  2. PHP Worker: A background script converts the files and fires two separate cURL requests to the Whisper API, requesting verbose_json to get exact timestamps.

  3. The Merge: The PHP script parses both JSON arrays, tags the speakers, merges the arrays, and sorts them chronologically by their start times to reconstruct the conversation.

The Specific Issue I am Facing
getting jumbled transcription
the transcription i am getting:
[00:00] Caller: Hello, this is a Policy Test, my name is John Miller, today is Wednesday, May 27th, the

[00:00] Callee: Hi, if you record your name and reason for calling, I’ll see if this person is available.

[00:15] Caller: reference number is 473169, can you hear me clearly?

[00:24] Callee: Yes, I can hear you clearly.

[00:26] Callee: This is the Kohli site test.

[00:28] Callee: My name is Sarah Johnson.

[00:30] Callee: The audio quality sounds good from my side.

[00:33] Callee: Please continue with the verification
[00:35] Caller: I will now test timestamps and speaker changes, the amount is $125, the meeting is scheduled

[00:43] Caller: for 10.30am, please confirm the details.

[00:48] Callee: Confirmed.

[00:49] Callee: $125.

[00:51] Callee: Meeting at 10.30 AM.

[00:53] Callee: I am also testing punctuation, pauses, and pronunciation.

[00:58] Caller: Now testing short interruptions, can you just say the color blue while I continue speaking?

[01:05] Callee: Blue.

[01:07] Caller: Thank you, now testing phone numbers 9876543210, final verification test, this call recording

[01:17] Callee: Received.

[01:18] Callee: Now testing email pronunciation.

[01:20] Callee: john.miller at example.com

[01:26] Caller: should contain timestamps, speaker labels and accurate English transcriptions, ending

[01:32] Caller: test now.

the actual script of the test call i made:
Caller

Hello, this is the caller side test.

My name is John Miller.

Today is Wednesday, May twenty seventh.

The reference number is four seven three one six nine.

Can you hear me clearly?

Callee

Yes, I can hear you clearly.

This is the callee side test.

My name is Sarah Johnson.

The audio quality sounds good from my side.

Please continue with the verification.

Caller

I will now test timestamps and speaker changes.

The amount is one hundred twenty five dollars.

The meeting is scheduled for ten thirty AM.

Please confirm the details.

Callee

Confirmed.

One hundred twenty five dollars.

Meeting at ten thirty AM.

I am also testing punctuation, pauses, and pronunciation.

Caller

Now testing short interruptions.

Can you say the color blue while I continue speaking?

Callee (interrupt slightly)

Blue.

Caller

Thank you.

Now testing phone numbers.

Nine eight seven six five four three two one zero.

Callee

Received.

Now testing email pronunciation.

john dot miller at example dot com.

Caller

Final verification test.

This call recording should contain timestamps,

speaker labels,

and accurate English transcription.

Ending test now.

AGI and asterisk experts please help if any solution from AGI side possible to this problem

Assuming that VAD doesn’t also break this, I’d record one stereo file, and then spit the file. I also seem to recall options to in fill silence.

However, whilst requests to transcribe separated speakers seem common here, high accuracy transcription would require the context window to include both sides, and it sounds like these AI services don’t actually meet the requirements of people wanting to transcribe phone calls.

I put the following into Google and got some suggestions of services that do this, but I have verified the results and definitely haven’t tried them:

“AI voice trasncription service that supports more than one speaker, on different channels, but using input from all the channels to form the context window.”