Overview

Audio Streaming

Audio streaming gives you access to the raw audio stream of your voice calls by transmitting live audio to your backend using WebSockets.

With audio streaming, you can build solutions such as:

  • Voice-based AI applications that respond to incoming support queries
  • Real-time transcription or sentiment analysis for sales outreach
  • IVR systems with conversational support for handling customer requests

Pre-requisites

  • If you haven’t worked with Plivo’s voice APIs before, ensure you’re familiar with Plivo’s voice platform before getting started.
  • Additionally, you’ll need to purchase phone numbers through the Plivo console or API before proceeding.
Some of the use cases mentioned above may require integration with third-party software, such as OpenAI (for LLM), Deepgram (for Speech-to-Text), ElevenLabs (for Text-to-Speech), and others.

Websockets

Stream Direction

Additionally, ensure you understand whether you need unidirectional or bidirectional streaming and use the correct parameter.

  • Unidirectional: You should use a unidirectional stream when audio is being sent to your Websocket, but the Websocket does not need to send audio back to Plivo.
  • Bidirectional: You should use a bidirectional stream when your WebSocket application needs to both receive and send audio. A common use case is enabling two-way communication between your customer and an AI agent.

Stream your audio

You can stream audio from a voice call using the XML element

<Response>
<Stream streamTimeout="30" keepCallAlive="true" bidirectional="true" contentType="audio/x-mulaw;rate=8000" statusCallbackUrl="https://yourdomain.com/callbacks" >
wss://yourstream.websocket.io/audiostream
</Stream>
</Response>

See our XML reference documentation for complete details.

  • When using audio streaming, ensure that keepCallAlive is set to true. This will prevent calls from being disconnected immediately after connection due to the absence of an XML execution element.
  • For a bidirectional stream, make sure to set bidirectional to true
  • In the case of a bidirectional stream, you can use the playAudio event to transmit audio through a WebSocket.

Additional features

During the lifecycle of a stream, Plivo transmits certain events to your backend, which can be used to build additional features:

  • start: Indicates a successful connection between Plivo and your WebSocket server.
  • media: Indicates actual audio chunks being sent to your WebSocket server.
  • stop: Indicates that there is no more audio to be streamed.
  • dtmf: Sends back key presses during the live call, which can be used to build AI powered IVR solutions.
{ 
  "event": "media",
  "sequenceNumber": "3", 
  "media": { 
     "track": "outbound", 
     "chunk": "1", 
     "timestamp": "5",
     "payload": “..”
} ,
  "streamId": "MZ18ad3ab5a668481ce02b83e7395059f0",
  "extra_headers": "{X-PH-key1: value1}"
}

Additionally, you can instruct Plivo to take specific actions:

  • clearAudio: Clears any buffered audio previously sent to Plivo. This is useful in cases where playback has been interrupted.
  • checkPoint: Applicable only for bidirectional streams. When this event is received, Plivo responds with a played event to confirm that all audio up to the checkpoint has been successfully played.

Sample applications

We have created demo applications to help you get started with integration for the following use cases:

  • Voice bots: A demo application for building a conversational AI bot using Plivo’s audio streaming feature, DeepGram (for Speech-to-Text), OpenAI (for LLM), and ElevenLabs (for Text-to-Speech). You are free to use alternative tools for these use cases if desired.
  • Real-time transcription: A demo application for creating a live transcription solution using Plivo’s audio streaming and DeepGram (for Speech-to-Text).

Pricing

Audio streaming is priced at $0.003 per minute, per stream, in addition to the standard charges for voice minutes associated with the call.