Skip to main content

Command Palette

Search for a command to run...

🐍 Answer the phone! with Python

Updated
6 min read
🐍 Answer the phone! with Python
K
On kevincoder.co.za, I write about my journey as a developer working across Django, Go, and everything in between, from large-scale systems to small, useful tools. Programming has been my passion for over 15 years now. I love learning new skills and am thrilled I can share these with you! A big part of my career was built on knowledge shared by others; in the main, open-source projects, forums, and communities like Stack Overflow. This blog is my way of contributing back and sharing what I’ve learned along the way.

Python is a powerful language that can do many things. Most people use it for machine learning or web development, but did you know that it can also interact with hardware and other services like SIP?

💡 Check out my newest article on how to build a WebRTC phone in Python here. With WebRTC, you no longer need PyVoip or a SIP provider, accept calls directly from the browser.

In this article, I am going to walk you through a step-by-step guide on how to build a basic SIP phone and connect that to AI, so that you can have a 2-way conversation with any LLM.

What is SIP Anyway?

Similar to how we have HTTP for the web, voice-over-internet systems usually run on a protocol named SIP, which provides guidelines on how to establish, modify, and terminate sessions over the network.

A SIP session can then carry voice, text, or even video. In the case of our application, SIP is just a signaling protocol and, therefore, is responsible for connecting and disconnecting the call to our Python script.

Once the call is answered and established, we then use the "RTP" or Real-time Transport Protocol to handle the audio stream.

Thankfully with PyVoIP, the library takes care of all the SIP and streaming mechanics, thus we don't have to worry too much about how SIP works or RTP for that matter.

Let's build something cool!?

💡 Are you new to Python? Learn Python with LearnPython.com courses. Fun interactive courses that are based on real-life business scenarios, meaning you’ll be writing Python code and seeing the results instantly. No need to install Python or other tools on your device, everything happens through your favorite web browser (Sponsored content).

In this guide, I will show you how to build a simple phone answering service with Python.

The script will do the following:

  1. Register as a SIP VOIP phone and wait for calls.

  2. Accept incoming calls.

  3. Transcribe the audio using OpenAI's Whisper.

Installing pip packages

We are going to need a few PIP packages as follows:

pip install pyVoIP
pip install pywav
pip install openai

Be sure to also add your OpenAI key to your environment, in bash you can easily do this by doing the following:

nano ~/.bashrc

# Add to the end of the file
export OPENAI_API_KEY="sk-xxx"

You will need to restart your terminal for this to take effect.

Setting up a VOIP virtual phone

PyVoIP is a nifty little library that can easily help you set up a virtual phone with just a few lines of code.

ℹ️ You probably want to use something like Twilio instead for a real-world application. PyVoIP audio quality isn't the best and needs quite a bit of modification to work correctly.

To get started, let's set up a basic phone:

from pyVoIP.VoIP import VoIPPhone, CallState

def answer(call):
    try:
        call.answer()

    except Exception as e:
        print(e)
    finally:
       call.hangup()

vp = VoIPPhone(
    'sip domain', 5060, 'sipuser', 
    'sippassword', callCallback=answer
)
vp.start()
print(vp._status)
input("Press any key to exit the VOIP phone session.")
vp.stop()

In this example, we create a virtual phone using the "VoiPPhone" class. This class takes in a few arguments as follows:

  • SIP Credentials: When you purchase a SIP account from a VOIP provider, you should have received a username, password, and an IP or domain name that will be connected to a phone number. (3Cx.com is an example of a SIP provider).

  • callCallback: This is the function that will handle answering the phone call.

The callback function will receive one argument, i.e. the "call" object which will contain all the relevant information relating to the caller and provide various methods for you to accept and receive or send audio back to the caller.

ℹ️ Did you know that you can build your own VOIP server as well? Asterisk is a powerful open-source VOIP server that you can use to set up your own SIP accounts, phone numbers, extensions, and so forth.

Transcribing audio

To convert audio into text we can use OpenAI's Whisper service, here's a simple example of how to convert our audio into text:

from openai import OpenAI
import pywav

def convert_to_wav(audio, tmpFileName):
    data_bytes = b"".join(audio)
    wave_write = pywav.WavWrite(tmpFileName, 1, 8000, 8, 7)
    wave_write.write(data_bytes)
    wave_write.close()

    return open(tmpFileName, "rb")

def transcribe_to_text(audio_file):
    tmpFileName = f"/tmp/audio/_audio_buffer_{uuid.uuid4()}.wav"
    client = OpenAI()

    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=convert_to_wav(audio_file, tmpFileName)
    )

    try:
        return transcription.text
    except Exception as ex:
       print(ex)

    return ""

The "transcribe_to_text" function takes in a list of raw audio byte samples, we then need to convert those samples into an actual audio file because the OpenAI SDK is expecting a file object, not raw audio bytes.

We therefore use "pywav" in our "convert_to_wav" function to convert the raw audio bytes into a ".wav" audio file.

⚠️ This logic is simplified so that it's easier to understand, but essentially it can be optimized to remove the need for saving to a temp file since disk IO on a slow drive might cause issues.

Updating our answer method to chunk the audio

In our "answer" method we receive the audio as a continuous stream of bytes, therefore each chunk of audio is 20ms. We cannot send a 20ms chunk of audio to Whisper because the minimum length is 100ms.

Thus, we need to append the audio to a buffer and we'll only send the audio to Whisper once we reach 1000ms (or 1 second).

Here is the updated "answer" function:

def answer(call):
    try:
        call.answer()
        buffer = []
        buff_length = 0

        while call.state == CallState.ANSWERED:
            audio = call.read_audio()
            # We divide by 8 because the audio sample rate is 8000 Hz
            buff_length += len(audio) / 8 # or simply 20

            if buff_length <= 1000:
                buffer.append(audio)
            else:
                print(transcribe_to_text(buffer))
                buffer = []
                buff_length = 0

    except Exception as e:
        print(e)
    finally:
       call.hangup()

💡 You can also send back audio to the caller by calling "call.write_audio(raw_audio_bytes_here)"

The full code

from pyVoIP.VoIP import VoIPPhone, CallState
import uuid
from openai import OpenAI
import os
import pywav

def convert_to_wav(audio, tmpFileName):
    data_bytes = b"".join(audio)
    wave_write = pywav.WavWrite(tmpFileName, 1, 8000, 8, 7)
    wave_write.write(data_bytes)
    wave_write.close()

    return open(tmpFileName, "rb")

def transcribe_to_text(audio_file):
    tmpFileName = f"/tmp/audio/_audio_buffer_{uuid.uuid4()}.wav"
    client = OpenAI()

    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=convert_to_wav(audio_file, tmpFileName)
    )

    try:
        return transcription.text
    except Exception as ex:
       print(ex)
    return ""

def answer(call):
    try:
        call.answer()
        buffer = []
        buff_length = 0

        while call.state == CallState.ANSWERED:
            audio = call.read_audio()
            buff_length += len(audio) / 8

            if buff_length <= 1000:
                buffer.append(audio)
            else:
                print(transcribe_to_text(buffer))
                buffer = []
                buff_length = 0

    except Exception as e:
        print(e)
    finally:
       call.hangup()

vp = VoIPPhone('xxx', 5060, 'xxx', 'xxx', callCallback=answer)
vp.start()
print(vp._status)
input("Press any key to exist")
vp.stop()

Conclusion

There you have it! A simple phone answering system that can stream live audio and transcribe that audio into text.

As mentioned earlier, PyVoIP is not the best tool for the job since it doesn't handle background noise and static very well. You would need to write some logic to strip out the bad audio samples first before transcribing for an actual real-world application, but hopefully, this is a good start.

J
John1y ago

Amazing article ! Thanks a lot for the insights.

I am currently contemplating building an in-house client for SIP virtual phone numbers.

We are building an AI conversational agent on top open AI realtime audio and after diving into how the VOIP works I realized it was not difficult to cut out Twilio. Since it would represent most of our costs in the future it makes sense to remove it.

My issue is that I am not sure if the client alone would be a good enough replacement. In your article you mention audio quality, background noise, silence being handled properly by Twilio.

I can't find relevant information online on the behind the scene added value of using Twilio as an SIP client.

The worst would be to build the client ourselves (which is not difficult) and underestimate what we really goes on in the backend at Twilio.

Do you have any insights or resources on the matter ?

Your help would be greatly appreciated!

Happy new year

1
K

Hi John, Happy 2025, hope you have an awesome year ahead :-)

Thanks for reading, and glad this was useful to you.

Twilio is not a SIP client, they provide a managed VOIP solution, so basically, you can buy telephone numbers from them and programmatically connect code to a real-time call.

With the media streams, they fork the audio from the real-time call and stream it to your WebSocket server.

The WebSocket server then receives the audio and you can transcribe or pass that on to a real-time model and then send the audio back to the client.

You can achieve the same by setting up Asterisk, but that is a bit more complicated and a pain to manage.

One problem with Twilio is that the audio is streamed at 8Khz and is base64 encoded so there is a conversion step that might slow down your call. Here's some documentation on the Twilio approach: https://www.twilio.com/docs/voice/tutorials/consume-real-time-media-stream-using-websockets-python-and-flask#create-a-socket-decorator

A better approach would be WebRTC(I built a simple client here: https://github.com/kevincoder-co-za/zazu-voiceai) or just build a SIP WebSocket server and use a provider like 3cx.

I did play with a few libraries, before going with Twilio (it was an MVP so shipping fast was essential). Maybe they'll be of use:

https://github.com/sipsorcery-org/sipsorcery https://www.mizu-voip.com/Software/SIPSDK/JavaSIPSDK.aspx https://www.pjsip.org/ https://github.com/emiago/sipgo https://www.linphone.org/en/voip-unified-communications-software

Mizu worked nicely, I just didn't have much time to fully implement the SDK and it's a commercial product so there are licensing costs.

Hope this helps.

Z
Zaid Aman1y ago

Hello, I just have one question. Were you able to play audio files like mp3 and wav over the phone? I have real problem with that plus it is Europe it supports puma (alaw).

1
K

Hi Zaid, thanks for the question. Yes, you can but you need to format the audio into a WAV and ensure that the sample rate is 8000 Hz(mono channel).

Then you can send that WAV file back to the user:

audio = AudioSegment.from_mp3("your_file.mp3")

audio = audio.set_frame_rate(8000).set_channels(1)

audio.export("output.wav", format="wav")

f = wave.open('output.wav', 'rb')

frames = f.getnframes()

data = f.readframes(frames)

f.close()

call.write_audio(data)

I suggest looking into my web socket tutorial over at: https://kevincoder.co.za/how-i-used-voice-ai-to-bring-imaginary-characters-to-life

You don't need to use WebRTC if you need the phone system, you can use Twilio media streams or forward the call from a PBX server like Asterisk.

Hope this helps.

1
Z
Zaid Aman1y ago

Kevin Naidoo Thank you for the response. Actually I have tried that before and it makes the audio very noise (painful for the ear) and unbearable. I have a bit better audio by processing the raw audio into g711 (alaw) format but the RTP couldn't send the packets or the stream in un choppy manner. I have perfectly working pipeline for voice conversation on laptop but not over the phone.

1
K

I see. Yeah PyVOIP is more for educational purposes, its not efficient for production. You probably should use pjsua/pjsip or twilio media streams for your use case.

1

More from this blog

Kevin Coder | tutorials, thought experiments & tech ramblings

37 posts