Setting up LLM voice conversations

Tools like ollama make running LLMs locally trivial, but setting up a complete pipeline for near real-time spoken conversations with them remains a little less common. However, with a little care in component selection and configuration, even consumer hardware is up to the task.

Goals

Before we get started, let's define the scope of what this article will aim for (and what not):

  1. All components must run locally. Not external services or APIs.

  2. Total VRAM usage should be viable for low-mid tier graphics cards (The stack was verified to run on a single GTX 1660TI with 6GB VRAM)

  3. The setup will target modern linux distributions (verified on debian 13. a few adjustments may make it work on macos/windows, but that's out of scope)

  4. Conversations must be near real-time, meaning some optimizations are made for latency at the expense of quality.

  5. Stack must be LLM-agnostic. The setup should work with any LLM, regardless of additional features like tool calling, audio processing etc.

With these goals in mind, let's plan out the stack.

Design

The basic pipeline will work as following: open-webui will take audio input from the user and convert it to text using the embedded faster-whisper model. The input text is then passed to a model hosted on a local ollama instance. This example will use gemma4:e2b to strike a balance between output quality and vram requirements. The resulting text response is turned into speech using a fastapi container hosting kokoro-82m and played back to the user.


The components are chosen for simplicity: faster-whisper is available by default in open-webui, kokoro-82m has amazing voice quality like af_heart while being tiny enough to run near instant on CPU alone, leaving the GPU to the LLM. gemma4:e2b is a cutting-edge model that effectively behaves like a 2b model during inference, so most consumer graphics cards have a good chance at loading it into vram entirely for low latency responses. If you have better hardware, you can switch this for any other model you like.


Most of the plumbing (listening for audio, passing input/output text around) is already handled by open-webui, we just need to set up the dependency services and wire everything together.

Installing dependencies

For simplicity, ollama will be installed on the host. Follow the official installation instructions.

Once installed, pull the gemma4:e2b model for later:

ollama pull gemma4:e2b

While the download runs, the open-webui and fastapi-kokoro services can be configured as a docker-compose file:

networks:
  private_net:
    ipam:
      config:
        - subnet: 10.10.10.0/24
          gateway: 10.10.10.1

services:
  openwebui:
    image: ghcr.io/open-webui/open-webui:main
    environment:
      ENABLE_OLLAMA_API: true
      OLLAMA_BASE_URL: http://10.10.10.1:11434
      AUDIO_TTS_ENGINE: openai
      AUDIO_TTS_OPENAI_API_BASE_URL: http://10.10.10.11:8880/v1
      AUDIO_TTS_API_KEY: not-needed
      AUDIO_TTS_MODEL: kokoro
      AUDIO_TTS_VOICE: af_heart
    networks:
      private_net:
        ipv4_address: 10.10.10.10
  kokoro-fastapi:
    image: ghcr.io/remsky/kokoro-fastapi-cpu:latest
    networks:
      private_net:
        ipv4_address: 10.10.10.11
  caddy:
    image: caddy:alpine
    container_name: caddy
    ports:
      - "443:443"
    networks:
      private_net:
        ipv4_address: 10.10.10.12
    command: >
      caddy reverse-proxy
      --from https://localhost
      --to http://10.10.10.10:8080
      --internal-certs

This is intentionally a very barebones compose file for a proof of concept. If you want to use this long-term, you will likely want to enable GPU acceleration for the containers, mount storage volumes and set restart policies.


Start the docker containers:

docker compose up -d

After downloading the images, the containers should start automatically. The static IP addresses and default environment variable config should already wire the services together .

Verify connectivity

Modern browsers require https to enable access to microphones, so audio mode only works when accessing the open-webui instance through the caddy reverse proxy included in the docker compose setup.


Therefore, only access the interface through https://localhost/. On first successful load, you should see an error message about a "Potential security risk" or "Your connection is not private". Open the advanced options on the error page and confirm an exception to proceed "unsafely" to the web address.

If you cannot connect to the web service directly after starting the compose project, allow a few minutes for open-webui to start, as it will download some required model files from huggingface before starting the web interface.


After confirming the exception, you should see the initial signup UI from open-webui, where you have to create your admin account.

Optimizing an LLM for low-latency conversations

Now that all services are working, you could directly jump into a voice conversation with a model, but the experience may be quite disappointing for two reasons:

  1. Typical LLM responses are long and verbose, unlike normal spoken responses.

  2. The gemma4:e2b model has "thinking" enabled by default, which causes heavy delay before responses.

Both of these can be fixed by making an adjusted copy of the model in the settings. Navigate to "Admin Settings" > "Models" and select "clone" in the options of the "gemma4:e2b" model.


In the new form, change the name to something you recognize, like "gemma4-voice".

Set a system prompt for shorter and more conversation-like responses:

You are a real-time voice conversation assistant. Speak naturally and keep replies short, usually 1-3 sentences. Prioritize conversational flow over detailed explanations. No markdown, bullet points, emojis, or formal structure. Avoid tutorials unless asked. Answer directly, sound human, and leave space for the user to respond.

Finally, find the "think (Ollama)" setting in the "Advanced Params" section and click the value until it says "Off" (not "low" or "false"!).


With all these configurations, you should be set to have a more pleasant conversation. Don't forget to safve your changes before leaving the page!

Having a conversation

In order to have a conversation, start a new conversation, select your newly cloned and adjusted voice "gemma4-voice" model in the top bar. Click the "Voice Mode" button at the bottom right of the prompt input box and allow access to your microphone in the browser popup.


Simply speak into the microphone. A short pause in speaking is enough to forward your voice prompt to the processing pipeline, and you should hear a female voice responding shortly after.


The first response may take a little longer as ollama loads the model into vram, subsequent responses will be near 1s delay, depending on how many tokens the answer contained.


Once you get tired of talking to a bare model, you can add MCP tool servers or upload documents for a more productive conversation. Open-WebUI offers a lot of possibilities to customize your experience.

More articles

Understanding HTTP redirects

How not to lose an hour to form debugging

Mastering range loops in go

From builtins to custom sequences

Passing secrets to applications

Comparing different methods and their tradeoffs