Telephony Controller
The Telephony Controller is the most complex subsystem within the Core API. It acts as the bridge between the traditional SIP/VoIP world (managed by FreeSWITCH) and the modern AI world (managed by LLMs).
All logic for handling voice interactions is located in src/routes/telephony.ts and the src/controllers/telephony/ directory.
1. The Call Initialization Flow
When a user dials a phone number associated with Rendez-vous.ai, the following sequence occurs:
- SIP Trunk to FreeSWITCH: The call arrives at our FreeSWITCH instance via Twilio or VoIP.ms.
- Webhook Trigger: FreeSWITCH executes its dialplan and sends an HTTP POST request to the Core API’s
/api/telephony/incoming-callendpoint (src/controllers/telephony/incoming-call.ts). - Agent Resolution: The API extracts the
destination_number(the number the user dialed) and uses GenQL to query the Dashboard (Payload CMS) for the associated Agent configuration. - Validation: The API checks:
- Is the telephony service active for this client?
- Are we currently within the configured business hours?
- Instruction Response: The API responds to FreeSWITCH with XML instructions (or via Event Socket) telling it to answer the call and immediately bridge the audio stream to a WebSocket opened by the API.
2. Real-Time Audio Streaming
Unlike text-based chatbots where we wait for a user to press “Send”, telephony requires continuous, real-time audio processing.
Bridging the Audio (src/lib/audio-streamer.ts)
Once the call is established, FreeSWITCH begins streaming the raw RTP audio packets over a WebSocket connection to the Core API.
The API uses a custom audio streamer buffer to handle this byte-stream. Because LLMs require specific audio formats (typically 16kHz, 16-bit PCM mono), the API often has to transcode or properly chunk the incoming mu-law or a-law audio packets on the fly.
Voice Activity Detection (VAD) & Interruption
A critical component of a natural voice agent is knowing when to listen and when to stop talking if the user interrupts.
- Listening: The API pipes the incoming audio chunks directly into the LLM’s WebSocket connection (e.g., using the Gemini Multimodal Live API or OpenAI Realtime API).
- Interruption (Barge-in): If the LLM is currently speaking (playing audio back to the caller) and the user suddenly speaks, the API detects the incoming audio spike, immediately halts the playback buffer, sends a “cancel” signal to the LLM, and begins processing the user’s new input.
3. Post-Call Processing & Recordings
When the user hangs up, the conversation lifecycle is not over. FreeSWITCH triggers a hangup webhook, which initiates the teardown process:
- Summary Generation: The complete transcript of the call is sent to a secondary, lightweight LLM prompt (
src/services/llm/summary/telephony.ts) to generate a concise summary and determine the final call status (e.g.,completed,voicemail,failed). - Recording Upload (
src/controllers/telephony/upload-recording.ts): FreeSWITCH automatically records the entire bridged call. Upon hangup, it sends this.wavfile to the API. The API processes the file, uploads it to our S3-compatible storage (MinIO/S3), and retrieves the public URL. - Database Logging: Finally, the API uses GenQL to create a new
Callrecord in the Dashboard’s PostgreSQL database, saving the caller ID, duration, transcript, summary, and S3 recording URL for the client to review.