Building an AI Assistant Using gpt-4o Audio-Preview API

srikantan

Microsoft

Dec 19, 2024

This sample application demonstrates the use of the gpt-4o audio-preview API to build an AI Assistant that handles audio input and output directly (i.e. without requiring STT/TTS)

Before I get into more details of using this API, I want to call out that this API is different from the gpt-4o Realtime API.

Feature	GPT-4o Realtime API	GPT-4o Audio-Preview API
Purpose	Designed for low-latency, real-time conversational interactions with speech input and output.	Supports audio inputs and outputs in the Chat Completions API, suitable for asynchronous interactions.
Use Cases	Ideal for live interactions such as customer support agents, voice assistants, and real-time translators.	Suitable for applications that handle text and audio inputs/outputs without the need for real-time processing.
Integration Method	Utilizes a persistent WebSocket connection for streaming communication.	Operates via standard API calls within the Chat Completions framework.
Latency	Offers low-latency responses, enabling natural conversational experiences.	Not optimized for low-latency; better suited for non-real-time interactions.

The steps to use this API are:

1. Capture user audio input

Accept audio input from the user and add that to the request payload. While the System Message is of type 'text', the user input is of type 'input_audio'.

audio_value = st.audio_input("Ask your question!")

encoded_string = None
if audio_value:
    audio_data = audio_value.read()
    encoded_audio_string = base64.b64encode(audio_data).decode("utf-8")

    st.session_state.messages.append(
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {"data": encoded_audio_string, "format": "wav"},
                }
            ],
        }
    )

2. Invoke the Chat completions endpoint:

Specify the modalities that need to be supported, for e.g. through the configuration shown below, both text and audio output will be supported, using the neural voice specified. Include the function definitions to use during tool calling.

    completion = None
    try:
        completion = client.chat.completions.create(
            model=config.model,
            modalities=["text", "audio"],
            audio={"voice": "alloy", "format": "wav"},
            functions=st.session_state["connection"].functions,
            function_call="auto",
            messages=st.session_state.messages,
        )
    except Exception as e:
        print("Error in completion", e)
        st.write("Error in completion", e)
        st.stop()

3. Pass the output from tool calling to gpt-4o to generate an audio response:

Pass the tool calling response along with the audio input from the user to gpt-4o to generate the audio response.

        l_completion = client.chat.completions.create(
            model=config.model,
            modalities=["text", "audio"],
            audio={"voice": "alloy", "format": "wav"},
            messages=[
                {
                    "role": "system",
                    "content": [{"type": "text", "text": system_prompt_response}],
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "---- context -----\n"+str(function_response) + "\n --- User Query----:\n",
                        },
                        {
                            "type": "input_audio",
                            "input_audio": {
                                "data": encoded_audio_string,
                                "format": "wav",
                            },
                        },
                    ],
                },
            ],
        )
        wav_bytes = base64.b64decode(l_completion.choices[0].message.audio.data)

4. Extract the text transcript of the response

In addition to playing the audio response over the speaker, we can also populate the chat conversation with the text transcript of the audio response.

        transcript_out = l_completion.choices[0].message.audio.transcript
        st.session_state.messages.append(
            {
                "role": "assistant",
                "content": transcript_out,
            }
        )

Note: The gpt-4o audio-preview API is not available in Azure as of this writing. This sample uses the API from OpenAI directly.

The sample application below, powered by gpt-4o audio-preview API, showcases a customer querying their gaming progress and grievance status. Tool integration is used to: