🐍 Answer the phone! with Python

🐍 Answer the phone! with Python

·

5 min read

Python is a powerful language that can do many things; most use it for machine learning or web development, but did you know that Python can also interact with hardware and other services like SIP?

What is SIP Anyway?

Similar to how we have HTTP for the web, voice-over-internet systems usually run on a protocol named SIP, which provides guidelines on how to establish, modify, and terminate sessions over the network.

A SIP session can then carry voice, text, or even video. In the case of our application, SIP is just a signaling protocol and, therefore is responsible for connecting and disconnecting the call to our Python script.

Once the call is answered and established, we then use the "RTP" or Real-time Transport Protocol to handle the audio stream.

Thankfully with PyVoIP, the library takes care of all the SIP and streaming mechanics, thus we don't have to worry too much about how SIP works or RTP for that matter.

Let's build something cool!?

In this guide, I will show you how to build a simple phone answering service with Python.

The script will do the following:

  1. Register as a SIP VOIP phone and wait for calls.

  2. Accept incoming calls.

  3. Transcribe the audio using OpenAI's Whisper.

Installing pip packages

We going to need a few PIP packages as follows:

pip install pyVoIP
pip install pywav
pip install openai

Be sure to also add your OpenAI key to your environment, in bash you can easily do this by doing the following:

nano ~/.bashrc

# Add to the end of the file
export OPENAI_API_KEY="sk-xxx"

You will need to restart your terminal for this to take effect.

Setting up a VOIP virtual phone

PyVoIP is a nifty little library that can easily help you set up a virtual phone with just a few lines of code.

ℹ️ You probably want to use something like Twilio instead for a real-world application. PyVoIP audio quality isn't the best and needs quite a bit of modification to work correctly.

To get started, let's set up a basic phone:

from pyVoIP.VoIP import VoIPPhone, CallState

def answer(call):
    try:
        call.answer()

    except Exception as e:
        print(e)
    finally:
       call.hangup()

vp = VoIPPhone(
    'sip domain', 5060, 'sipuser', 
    'sippassword', callCallback=answer
)
vp.start()
print(vp._status)
input("Press any key to exit the VOIP phone session.")
vp.stop()

In this example, we create a virtual phone using the "VoiPPhone" class. This class takes in a few arguments as follows:

  • SIP Credentials: When you purchase a SIP account from a VOIP provider, you should have received a username, password, and an IP or domain name that will be connected to a phone number. (3Cx.com is an example of a SIP provider).

  • callCallback: This is the function that will handle answering the phone call.

The callback function will receive one argument, i.e. the "call" object which will contain all the relevant information relating to the caller and provide various methods for you to accept and receive or send audio back to the caller.

ℹ️ Did you know that you can build your own VOIP server as well? Asterisk is a powerful open-source VOIP server that you can use to set up your own SIP accounts, phone numbers, extensions, and so forth.

Transcribing audio

To convert audio into text we can use OpenAI's Whisper service, here's a simple example of how to convert our audio into text:

from openai import OpenAI
import pywav

def convert_to_wav(audio, tmpFileName):
    data_bytes = b"".join(audio)
    wave_write = pywav.WavWrite(tmpFileName, 1, 8000, 8, 7)
    wave_write.write(data_bytes)
    wave_write.close()

    return open(tmpFileName, "rb")

def transcribe_to_text(audio_file):
    tmpFileName = f"/tmp/audio/_audio_buffer_{uuid.uuid4()}.wav"
    client = OpenAI()

    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=convert_to_wav(audio_file, tmpFileName)
    )

    try:
        return transcription.text
    except Exception as ex:
       print(ex)

    return ""

The "transcribe_to_text" function takes in a list of raw audio byte samples, we then need to convert those samples into an actual audio file because the OpenAI SDK is expecting a file object, not raw audio bytes.

We therefore use "pywav" in our "convert_to_wav" function to convert the raw audio bytes into a ".wav" audio file.

⚠️ This logic is simplified so that it's easier to understand, but essentially it can be optimized to remove the need for saving to a temp file since disk IO on a slow drive might cause issues.

Updating our answer method to chunk the audio

In our "answer" method we receive the audio as a continuous stream of bytes, therefore each chunk of audio is 20ms. We cannot send a 20ms chunk of audio to Whisper because the minimum length is 100ms.

Thus, we need to append the audio to a buffer and we'll only send the audio to Whisper once we reach 1000ms (or 1 second).

Here is the updated "answer" function:

def answer(call):
    try:
        call.answer()
        buffer = []
        buff_length = 0

        while call.state == CallState.ANSWERED:
            audio = call.read_audio()
            # We divide by 8 because the audio sample rate is 8000 Hz
            buff_length += len(audio) / 8 # or simply 20

            if buff_length <= 1000:
                buffer.append(audio)
            else:
                print(transcribe_to_text(buffer))
                buffer = []
                buff_length = 0

    except Exception as e:
        print(e)
    finally:
       call.hangup()

💡 You can also send back audio to the caller by calling "call.write_audio(raw_audio_bytes_here)"

The full code

from pyVoIP.VoIP import VoIPPhone, CallState
import uuid
from openai import OpenAI
import os
import pywav

def convert_to_wav(audio, tmpFileName):
    data_bytes = b"".join(audio)
    wave_write = pywav.WavWrite(tmpFileName, 1, 8000, 8, 7)
    wave_write.write(data_bytes)
    wave_write.close()

    return open(tmpFileName, "rb")

def transcribe_to_text(audio_file):
    tmpFileName = f"/tmp/audio/_audio_buffer_{uuid.uuid4()}.wav"
    client = OpenAI()

    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=convert_to_wav(audio_file)
    )

    try:
        return transcription.text
    except Exception as ex:
       print(ex)
    return ""

def answer(call):
    try:
        call.answer()
        buffer = []
        buff_length = 0

        while call.state == CallState.ANSWERED:
            audio = call.read_audio()
            buff_length += len(audio) / 8

            if buff_length <= 1000:
                buffer.append(audio)
            else:
                print(transcribe_to_text(buffer))
                buffer = []
                buff_length = 0

    except Exception as e:
        print(e)
    finally:
       call.hangup()

vp = VoIPPhone('xxx', 5060, 'xxx', 'xxx', callCallback=answer)
vp.start()
print(vp._status)
input("Press any key to exist")
vp.stop()

Conclusion

There you have it! A simple phone answering system that can stream live audio and transcribe that audio into text.

Now PyVoIP as mentioned earlier is not the best tool for the job since it doesn't handle background noise and static very well. You would need to write some kind of logic to strip out the bad audio samples first before transcribing for an actual real-world application, but hopefully, this is a good start.