Kevin Coder | tutorials, thought experiments & tech ramblings

🐍 Answer the phone! with Python

UpdatedApril 23, 2025

On kevincoder.co.za, I write about my journey as a developer working across Django, Go, and everything in between, from large-scale systems to small, useful tools. Programming has been my passion for over 15 years now. I love learning new skills and am thrilled I can share these with you! A big part of my career was built on knowledge shared by others; in the main, open-source projects, forums, and communities like Stack Overflow. This blog is my way of contributing back and sharing what I’ve learned along the way.

Python is a powerful language that can do many things. Most people use it for machine learning or web development, but did you know that it can also interact with hardware and other services like SIP?

💡 Check out my newest article on how to build a WebRTC phone in Python here. With WebRTC, you no longer need PyVoip or a SIP provider, accept calls directly from the browser.

In this article, I am going to walk you through a step-by-step guide on how to build a basic SIP phone and connect that to AI, so that you can have a 2-way conversation with any LLM.

What is SIP Anyway?

Similar to how we have HTTP for the web, voice-over-internet systems usually run on a protocol named SIP, which provides guidelines on how to establish, modify, and terminate sessions over the network.

A SIP session can then carry voice, text, or even video. In the case of our application, SIP is just a signaling protocol and, therefore, is responsible for connecting and disconnecting the call to our Python script.

Once the call is answered and established, we then use the "RTP" or Real-time Transport Protocol to handle the audio stream.

Thankfully with PyVoIP, the library takes care of all the SIP and streaming mechanics, thus we don't have to worry too much about how SIP works or RTP for that matter.

Let's build something cool!?

💡 Are you new to Python? Learn Python with LearnPython.com courses. Fun interactive courses that are based on real-life business scenarios, meaning you’ll be writing Python code and seeing the results instantly. No need to install Python or other tools on your device, everything happens through your favorite web browser (Sponsored content).

In this guide, I will show you how to build a simple phone answering service with Python.

The script will do the following:

Register as a SIP VOIP phone and wait for calls.
Accept incoming calls.
Transcribe the audio using OpenAI's Whisper.

Installing pip packages

We are going to need a few PIP packages as follows:

pip install pyVoIP
pip install pywav
pip install openai

Be sure to also add your OpenAI key to your environment, in bash you can easily do this by doing the following:

nano ~/.bashrc

# Add to the end of the file
export OPENAI_API_KEY="sk-xxx"

You will need to restart your terminal for this to take effect.

Setting up a VOIP virtual phone

PyVoIP is a nifty little library that can easily help you set up a virtual phone with just a few lines of code.

ℹ️ You probably want to use something like Twilio instead for a real-world application. PyVoIP audio quality isn't the best and needs quite a bit of modification to work correctly.

To get started, let's set up a basic phone:

from pyVoIP.VoIP import VoIPPhone, CallState

def answer(call):
    try:
        call.answer()

    except Exception as e:
        print(e)
    finally:
       call.hangup()

vp = VoIPPhone(
    'sip domain', 5060, 'sipuser', 
    'sippassword', callCallback=answer
)
vp.start()
print(vp._status)
input("Press any key to exit the VOIP phone session.")
vp.stop()

In this example, we create a virtual phone using the "VoiPPhone" class. This class takes in a few arguments as follows:

SIP Credentials: When you purchase a SIP account from a VOIP provider, you should have received a username, password, and an IP or domain name that will be connected to a phone number. (3Cx.com is an example of a SIP provider).
callCallback: This is the function that will handle answering the phone call.

The callback function will receive one argument, i.e. the "call" object which will contain all the relevant information relating to the caller and provide various methods for you to accept and receive or send audio back to the caller.

ℹ️ Did you know that you can build your own VOIP server as well? Asterisk is a powerful open-source VOIP server that you can use to set up your own SIP accounts, phone numbers, extensions, and so forth.

Transcribing audio

To convert audio into text we can use OpenAI's Whisper service, here's a simple example of how to convert our audio into text:

from openai import OpenAI
import pywav

def convert_to_wav(audio, tmpFileName):
    data_bytes = b"".join(audio)
    wave_write = pywav.WavWrite(tmpFileName, 1, 8000, 8, 7)
    wave_write.write(data_bytes)
    wave_write.close()

    return open(tmpFileName, "rb")

def transcribe_to_text(audio_file):
    tmpFileName = f"/tmp/audio/_audio_buffer_{uuid.uuid4()}.wav"
    client = OpenAI()

    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=convert_to_wav(audio_file, tmpFileName)
    )

    try:
        return transcription.text
    except Exception as ex:
       print(ex)

    return ""

The "transcribe_to_text" function takes in a list of raw audio byte samples, we then need to convert those samples into an actual audio file because the OpenAI SDK is expecting a file object, not raw audio bytes.

We therefore use "pywav" in our "convert_to_wav" function to convert the raw audio bytes into a ".wav" audio file.

⚠️ This logic is simplified so that it's easier to understand, but essentially it can be optimized to remove the need for saving to a temp file since disk IO on a slow drive might cause issues.

Updating our answer method to chunk the audio

In our "answer" method we receive the audio as a continuous stream of bytes, therefore each chunk of audio is 20ms. We cannot send a 20ms chunk of audio to Whisper because the minimum length is 100ms.

Thus, we need to append the audio to a buffer and we'll only send the audio to Whisper once we reach 1000ms (or 1 second).

Here is the updated "answer" function:

def answer(call):
    try:
        call.answer()
        buffer = []
        buff_length = 0

        while call.state == CallState.ANSWERED:
            audio = call.read_audio()
            # We divide by 8 because the audio sample rate is 8000 Hz
            buff_length += len(audio) / 8 # or simply 20

            if buff_length <= 1000:
                buffer.append(audio)
            else:
                print(transcribe_to_text(buffer))
                buffer = []
                buff_length = 0

    except Exception as e:
        print(e)
    finally:
       call.hangup()

💡 You can also send back audio to the caller by calling "call.write_audio(raw_audio_bytes_here)"

The full code

from pyVoIP.VoIP import VoIPPhone, CallState
import uuid
from openai import OpenAI
import os
import pywav

def convert_to_wav(audio, tmpFileName):
    data_bytes = b"".join(audio)
    wave_write = pywav.WavWrite(tmpFileName, 1, 8000, 8, 7)
    wave_write.write(data_bytes)
    wave_write.close()

    return open(tmpFileName, "rb")

def transcribe_to_text(audio_file):
    tmpFileName = f"/tmp/audio/_audio_buffer_{uuid.uuid4()}.wav"
    client = OpenAI()

    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=convert_to_wav(audio_file, tmpFileName)
    )

    try:
        return transcription.text
    except Exception as ex:
       print(ex)
    return ""

def answer(call):
    try:
        call.answer()
        buffer = []
        buff_length = 0

        while call.state == CallState.ANSWERED:
            audio = call.read_audio()
            buff_length += len(audio) / 8

            if buff_length <= 1000:
                buffer.append(audio)
            else:
                print(transcribe_to_text(buffer))
                buffer = []
                buff_length = 0

    except Exception as e:
        print(e)
    finally:
       call.hangup()

vp = VoIPPhone('xxx', 5060, 'xxx', 'xxx', callCallback=answer)
vp.start()
print(vp._status)
input("Press any key to exist")
vp.stop()

Conclusion

There you have it! A simple phone answering system that can stream live audio and transcribe that audio into text.

As mentioned earlier, PyVoIP is not the best tool for the job since it doesn't handle background noise and static very well. You would need to write some logic to strip out the bad audio samples first before transcribing for an actual real-world application, but hopefully, this is a good start.

#machine-learning #python #sip #voip #ai

Comments (2)

Join the discussion

John1y ago

Amazing article ! Thanks a lot for the insights.

I am currently contemplating building an in-house client for SIP virtual phone numbers.

We are building an AI conversational agent on top open AI realtime audio and after diving into how the VOIP works I realized it was not difficult to cut out Twilio. Since it would represent most of our costs in the future it makes sense to remove it.

My issue is that I am not sure if the client alone would be a good enough replacement. In your article you mention audio quality, background noise, silence being handled properly by Twilio.

I can't find relevant information online on the behind the scene added value of using Twilio as an SIP client.

The worst would be to build the client ourselves (which is not difficult) and underestimate what we really goes on in the backend at Twilio.

Do you have any insights or resources on the matter ?

Your help would be greatly appreciated!

Happy new year

Kevin Naidoo1y ago

Hi John, Happy 2025, hope you have an awesome year ahead :-)

Thanks for reading, and glad this was useful to you.

Twilio is not a SIP client, they provide a managed VOIP solution, so basically, you can buy telephone numbers from them and programmatically connect code to a real-time call.

With the media streams, they fork the audio from the real-time call and stream it to your WebSocket server.

The WebSocket server then receives the audio and you can transcribe or pass that on to a real-time model and then send the audio back to the client.

You can achieve the same by setting up Asterisk, but that is a bit more complicated and a pain to manage.

One problem with Twilio is that the audio is streamed at 8Khz and is base64 encoded so there is a conversion step that might slow down your call. Here's some documentation on the Twilio approach: https://www.twilio.com/docs/voice/tutorials/consume-real-time-media-stream-using-websockets-python-and-flask#create-a-socket-decorator

A better approach would be WebRTC(I built a simple client here: https://github.com/kevincoder-co-za/zazu-voiceai) or just build a SIP WebSocket server and use a provider like 3cx.

I did play with a few libraries, before going with Twilio (it was an MVP so shipping fast was essential). Maybe they'll be of use:

https://github.com/sipsorcery-org/sipsorcery https://www.mizu-voip.com/Software/SIPSDK/JavaSIPSDK.aspx https://www.pjsip.org/ https://github.com/emiago/sipgo https://www.linphone.org/en/voip-unified-communications-software

Mizu worked nicely, I just didn't have much time to fully implement the SDK and it's a commercial product so there are licensing costs.

Hope this helps.

Zaid Aman1y ago

Hello, I just have one question. Were you able to play audio files like mp3 and wav over the phone? I have real problem with that plus it is Europe it supports puma (alaw).

Kevin Naidoo1y ago

Hi Zaid, thanks for the question. Yes, you can but you need to format the audio into a WAV and ensure that the sample rate is 8000 Hz(mono channel).

Then you can send that WAV file back to the user:

audio = AudioSegment.from_mp3("your_file.mp3")

audio = audio.set_frame_rate(8000).set_channels(1)

audio.export("output.wav", format="wav")

f = wave.open('output.wav', 'rb')

frames = f.getnframes()

data = f.readframes(frames)

f.close()

call.write_audio(data)

I suggest looking into my web socket tutorial over at: https://kevincoder.co.za/how-i-used-voice-ai-to-bring-imaginary-characters-to-life

You don't need to use WebRTC if you need the phone system, you can use Twilio media streams or forward the call from a PBX server like Asterisk.

Hope this helps.

Zaid Aman1y ago

Kevin Naidoo Thank you for the response. Actually I have tried that before and it makes the audio very noise (painful for the ear) and unbearable. I have a bit better audio by processing the raw audio into g711 (alaw) format but the RTP couldn't send the packets or the stream in un choppy manner. I have perfectly working pipeline for voice conversation on laptop but not over the phone.

Kevin Naidoo1y ago

I see. Yeah PyVOIP is more for educational purposes, its not efficient for production. You probably should use pjsua/pjsip or twilio media streams for your use case.

More from this blog

Golang: Building a windows parental control app using wails

C# is one of my favorite languages; if you're building Windows apps, you're probably better off with C#, but I'm kind of rusty, and WinForms is old school now, so I've no choice but to use WPF 😔. The

Mar 9, 202616 min read

Golang: Building a windows parental control app using wails

Why AGI is a Pipe Dream and what we should build instead

I have been thinking about where we are at with AI in 2026; I think most models have now reached their intelligence ceiling. They’ll get incrementally better over time, sure, but there’s a big difference between 10% to 80% and 80% to 100%. In most te...

Feb 12, 20269 min read

Why AGI is a Pipe Dream and what we should build instead

Django pocket reference guide

Django is one of my favourite frameworks; however, the official docs are not always the best resource to look up information when you are in a hurry. Don't get me wrong, the official docs are very use

Nov 29, 202521 min read

Workhorse AI models you probably ignored

While the whole YouTube influencer space goes crazy about Gemini 3, I’m not so impressed. Don’t get me wrong, Gemini models are fantastic! And the V3 iteration is undoubtedly cool and capable. I’ve briefly run some tests (not benchmarks but actual re...

Nov 20, 20254 min read

Workhorse AI models you probably ignored

~Zod and React Hook Form

I’m learning Next.js; I know React or kinda know React since I started building React apps in 2018 or thereabout, and then stopped using it for several years. Shew! A lot has changed: Server actions, Zod, and just the whole ecosystem generally. This ...

Aug 30, 20257 min read

Kevin Coder | tutorials, thought experiments & tech ramblings

37 posts

Command Palette

What is SIP Anyway?

Let's build something cool!?

Installing pip packages

Setting up a VOIP virtual phone

Transcribing audio

Updating our answer method to chunk the audio

The full code

Conclusion

Comments (2)

More from this blog