Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Overview
weight: 2
layout: learningpathall
---
Voice-based LLM applications often rely primarily on transcribed text from speech input, such as in interactions with non-player characters in games or voice assistants. This approach can overlook vocal cues—like tone, pitch, and emotion—present in a speaker’s voice. As a result, responses may feel less natural and may not fully capture the user’s underlying intent.

To address this, voice-based sentiment classification analyzes audio input to determine the user’s emotional state, which is then incorporated into the LLM prompt to enable more context-aware responses. In this Learning Path, we will build a sentiment-aware voice assistant that runs entirely on-device. The application records audio, performs transcription—converting speech into written text—using Whisper, classifies sentiment directly from the voice signal, and combines the transcript and voice-based sentiment to guide responses from a local LLM running with llama.cpp.

![Voice sentiment classification pipeline#center](1_vsapipeline2.png "Voice sentiment classification pipeline")

You will start by building a baseline voice-to-LLM pipeline—capturing audio, transcribing it into text, and using it to generate responses with an LLM. You will then extend this pipeline with a voice-based sentiment classification model. This involves training the model, optimizing it for efficient on-device inference, and integrating it into a unified application.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: Set up your environment
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

Before building the voice assistant, create a project workspace and set up an isolated `UV` environment. This keeps project dependencies separate from your system installation and makes it easier to reproduce the steps in the rest of the Learning Path.

These instructions support Ubuntu, macOS, and Windows, with Python 3.9 or later and a working microphone.

Check your Python version before continuing:

**Ubuntu or macOS**

```bash
python3 --version
```

**Windows PowerShell**

```powershell
py -3 --version
```

## Set up the Python environment with UV

Install `UV` first from PyPI using `pip`. `UV` is a fast Python package and environment manager that we will use throughout this Learning Path to create the project environment and install dependencies.

**Ubuntu or macOS**

```bash
mkdir -p ~/voice-sentiment-assistant
cd ~/voice-sentiment-assistant
python3 -m pip install uv
uv venv .venv
source .venv/bin/activate
```

**Windows PowerShell**

```powershell
mkdir $HOME\voice-sentiment-assistant -Force
cd $HOME\voice-sentiment-assistant
py -3 -m pip install uv
uv venv .venv
.\.venv\Scripts\Activate.ps1
```

Keep this virtual environment activated while you complete the rest of the Learning Path.

Create a `requirements.txt` file for the packages used across the rest of the Learning Path:

```txt
gradio
openai-whisper
requests
torch
transformers
pandas
numpy
librosa
scikit-learn
onnx
onnxruntime
```

Install the dependencies into your active `UV` virtual environment:

```console
uv pip install -r requirements.txt
```

This installs the libraries needed for the Gradio interface, Whisper transcription, model training, and ONNX Runtime inference. Some packages in this list are used later in the Learning Path when you optimize and export the sentiment model.

## Download, build, and run llama.cpp

Next, clone the [llama.cpp GitHub repository](https://github.com/ggml-org/llama.cpp), build the local inference server, and start it. This server exposes an OpenAI-compatible API that the Python application will call later in the Learning Path.

**Ubuntu or macOS**

```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
```

**Windows PowerShell**

```powershell
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
```

When the build completes, the `llama-server` executable should be available in the build output directory. This Learning Path uses a quantized [Gemma 3 1B instruction-tuned model](https://huggingface.co/google/gemma-3-1b-it) served locally through `llama.cpp`.

The first time you run this command, `llama.cpp` will download the model from Hugging Face. This can take several minutes depending on your network connection.

**Ubuntu or macOS**

Run the following command from the `llama.cpp` directory:

```bash
./build/bin/llama-server -hf ggml-org/gemma-3-1b-it-GGUF
```

**Windows PowerShell**

```powershell
.\build\bin\Release\llama-server.exe -hf ggml-org/gemma-3-1b-it-GGUF
```

Leave this terminal running while you test the application in later steps. The server listens on a local OpenAI-compatible endpoint that your app will call to generate responses.

At this point, your development environment is ready. You have installed the required audio and build tools, created a `UV` environment, installed the Python dependencies, and started a local `llama.cpp` server. In the next section, you will use this setup to build the baseline voice-to-LLM pipeline by creating a simple Gradio interface, transcribing microphone input with Whisper, and sending the transcript to the local LLM.
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
title: Build the voice-to-LLM pipeline
weight: 4
layout: learningpathall
---

In this section, you will build an end-to-end pipeline that:

1. Records audio from your microphone
2. Transcribes it to text using Whisper
3. Sends the text to a locally hosted LLM
4. Displays the model's response

This forms the foundation of your voice assistant.

![Baseline voice-to-LLM pipeline#center](3_vsapipeline1.png "Baseline voice-to-LLM pipeline")

Before you begin, make sure you have completed the environment setup in the previous section and that your `llama-server` is still running.

### Step 1.1 - Create a basic Gradio UI

Start by creating a simple web interface that captures microphone input.
Gradio is a Python library for building simple browser-based interfaces. Here, you use it to create a small front end that records audio from your microphone.

This is a good first step because it lets you confirm that microphone capture works before you add transcription and model inference.

Create a file called `app.py`:

```python
import gradio as gr

with gr.Blocks() as demo:
mic = gr.Audio(sources="microphone", type="filepath")

demo.launch()
```

Run the app:

```bash
python app.py
```

Open your browser at:

`http://127.0.0.1:7860`

You should now see a simple interface that allows you to record audio.
At this stage, the app only captures audio. It does not yet transcribe speech or send anything to the LLM.

### Step 1.2 - Add speech-to-text with Whisper

Next, add transcription using the Whisper model.
Whisper is a speech-to-text model. It takes audio as input and returns a text transcript. In this pipeline, it converts spoken input into text before anything is sent to the LLM.

Update `app.py` with the following code:

```python
import whisper

# Load a small Whisper model for local transcription
model = whisper.load_model("base")

def transcribe(audio):
return model.transcribe(audio)["text"]
```

The first time you run this, Whisper will download the model, which may take a few minutes.

At this stage, your app can convert recorded audio into text.
The output of this step is a text transcript that represents what the user said.

### Step 1.3 - Connect to the local LLM

Define the OpenAI-compatible endpoint exposed by `llama-server`.
An endpoint is the URL your program uses to talk to another service. In this case, `llama-server` exposes a local API on your machine, and your app sends the transcript there to get a response.

Because the server is OpenAI-compatible, the request format looks like a standard chat completions API.

Update `app.py` with the following import and endpoint definition:

```python
import requests

LOCAL_LLM_URL = "http://127.0.0.1:8080/v1/chat/completions"
```

Make sure your `llama-server` from the previous section is running before continuing.
Without the local server running, the next step will not be able to generate an answer.

### Step 1.4 - Build the full pipeline

Now combine transcription and LLM interaction into a single function.
This function becomes the core of the application. Audio goes in, text is extracted, that text is sent to the model, and the response comes back out.

Keeping the logic in one function makes it easier to connect the pipeline to the user interface in the next step.

Update `app.py` by adding the following function:

```python
def handle_audio(audio):
# Step 1: Transcribe audio
text = transcribe(audio)

# Step 2: Send transcript to local LLM
response = requests.post(
LOCAL_LLM_URL,
json={
"model": "local-model",
"messages": [{"role": "user", "content": text}],
},
)

if response.status_code != 200:
return text, "Error: LLM request failed"

data = response.json()
answer = data["choices"][0]["message"]["content"]

return text, answer
```

### Step 1.5 - Connect the UI to the pipeline

Update your UI so that recorded audio triggers the full pipeline and displays results.
This is the final integration step. You now connect the interface, transcription, and model request so the app behaves like a real voice assistant.

When the user records audio, Gradio calls your pipeline function. The app then shows both the transcript and the assistant response in the browser.

Update `app.py` so it contains the following complete version:

```python
import gradio as gr
import whisper
import requests

model = whisper.load_model("base")

LOCAL_LLM_URL = "http://127.0.0.1:8080/v1/chat/completions"

def transcribe(audio):
return model.transcribe(audio)["text"]

def handle_audio(audio):
text = transcribe(audio)

response = requests.post(
LOCAL_LLM_URL,
json={
"model": "local-model",
"messages": [{"role": "user", "content": text}],
},
)

if response.status_code != 200:
return text, "Error: LLM request failed"

data = response.json()
answer = data["choices"][0]["message"]["content"]

return text, answer

with gr.Blocks() as demo:
mic = gr.Audio(sources="microphone", type="filepath")
transcript = gr.Textbox(label="Transcript")
response = gr.Textbox(label="LLM Response")

mic.change(fn=handle_audio, inputs=mic, outputs=[transcript, response])

demo.launch()
```

## What you should see

After recording audio in the browser:

- Your speech is transcribed into text
- The transcript is sent to the local LLM
- The LLM response is displayed in the interface

## Troubleshooting

- No response from LLM: ensure `llama-server` is still running.
- Whisper is slow on first run: this is expected due to model download and initialization.
- Microphone not working: check browser permissions for microphone access.

At this point, you have a working voice-to-LLM pipeline. In the next section, you will extend this pipeline by adding a voice sentiment classification model.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading