Telephony Controller

The Telephony Controller is the most complex subsystem within the Core API. It acts as the bridge between the traditional SIP/VoIP world (managed by FreeSWITCH) and the modern AI world (managed by LLMs).

All logic for handling voice interactions is located in src/routes/telephony.ts and the src/controllers/telephony/ directory.

1. The Call Initialization Flow

When a user dials a phone number associated with Rendez-vous.ai, the following sequence occurs:

SIP Trunk to FreeSWITCH: The call arrives at our FreeSWITCH instance via Twilio or VoIP.ms.
Webhook Trigger: FreeSWITCH executes its dialplan and sends an HTTP POST request to the Core API’s /api/telephony/incoming-call endpoint (src/controllers/telephony/incoming-call.ts).
Agent Resolution: The API extracts the destination_number (the number the user dialed) and uses GenQL to query the Dashboard (Payload CMS) for the associated Agent configuration.
Validation: The API checks:

Is the telephony service active for this client?
Are we currently within the configured business hours?

Instruction Response: The API responds to FreeSWITCH with XML instructions (or via Event Socket) telling it to answer the call and immediately bridge the audio stream to a WebSocket opened by the API.

2. Real-Time Audio Streaming

Unlike text-based chatbots where we wait for a user to press “Send”, telephony requires continuous, real-time audio processing.

Bridging the Audio (`src/lib/audio-streamer.ts`)

Once the call is established, FreeSWITCH begins streaming the raw RTP audio packets over a WebSocket connection to the Core API.

The API uses a custom audio streamer buffer to handle this byte-stream. Because LLMs require specific audio formats (typically 16kHz, 16-bit PCM mono), the API often has to transcode or properly chunk the incoming mu-law or a-law audio packets on the fly.

Voice Activity Detection (VAD) & Interruption

A critical component of a natural voice agent is knowing when to listen and when to stop talking if the user interrupts.

Listening: The API pipes the incoming audio chunks directly into the LLM’s WebSocket connection (e.g., using the Gemini Multimodal Live API or OpenAI Realtime API).
Interruption (Barge-in): If the LLM is currently speaking (playing audio back to the caller) and the user suddenly speaks, the API detects the incoming audio spike, immediately halts the playback buffer, sends a “cancel” signal to the LLM, and begins processing the user’s new input.

3. Post-Call Processing & Recordings

When the user hangs up, the conversation lifecycle is not over. FreeSWITCH triggers a hangup webhook, which initiates the teardown process:

Summary Generation: The complete transcript of the call is sent to a secondary, lightweight LLM prompt (src/services/llm/summary/telephony.ts) to generate a concise summary and determine the final call status (e.g., completed, voicemail, failed).
Recording Upload (src/controllers/telephony/upload-recording.ts): FreeSWITCH automatically records the entire bridged call. Upon hangup, it sends this .wav file to the API. The API processes the file, uploads it to our S3-compatible storage (MinIO/S3), and retrieves the public URL.
Database Logging: Finally, the API uses GenQL to create a new Call record in the Dashboard’s PostgreSQL database, saving the caller ID, duration, transcript, summary, and S3 recording URL for the client to review.