Building a Speech Recognition System with VOSK: A Step-by-Step Guide ~ RRJ

In a world driven by voice interfaces—from smart assistants to transcription tools; speech recognition is a key component of modern AI. VOSK is an open-source speech recognition toolkit that makes it incredibly easy to build fast and accurate offline voice systems, even on low-resource devices like Raspberry Pi.

Whether you're a beginner or looking to integrate voice into your next project, this blog will guide you step-by-step on using VOSK effectively.

What is VOSK?

VOSK is a lightweight, offline-capable speech recognition engine based on Kaldi. It supports:

20+ languages (English, Hindi, Spanish, etc.)
Python, Java, JavaScript, C# APIs
Offline recognition (no internet required)
Real-time transcription

GitHub: VOSK GitHub Repo

Python Docs: Python Docs

Prerequisites

Before getting started, make sure you have the following:

Python 3.6+
pip package manager
A microphone (optional for live recognition)
OS: Linux, Windows, macOS or Raspberry Pi

Step 1: Install VOSK API

bash

pip install vosk

Optionally, install PyAudio for microphone input:


pip install pyaudio

On Linux, you may need:
sudo apt install portaudio19-dev python3-pyaudio

Step 2: Download a Pretrained Model

You can find models here: VOSK Models

Example for English (small):

wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip

For Raspberry Pi, use vosk-model-small-en-us-0.15
For high accuracy, try vosk-model-en-us-0.22 (~1.4GB)

Step 3: Transcribe from Audio File

Here's a basic Python script to transcribe audio:

from vosk import Model, KaldiRecognizer
import wave
import json

wf = wave.open("test.wav", "rb")
model = Model("vosk-model-small-en-us-0.15")
rec = KaldiRecognizer(model, wf.getframerate())

while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(rec.Result())

print(rec.FinalResult())

Make sure your audio file is:

Mono
16-bit PCM
16000 Hz sample rate

Use ffmpeg to convert:

ffmpeg -i your_audio.mp3 -ar 16000 -ac 1 -c:a pcm_s16le test.wav

Step 4: Real-Time Microphone Transcription

import pyaudio
from vosk import Model, KaldiRecognizer

model = Model("vosk-model-small-en-us-0.15")
rec = KaldiRecognizer(model, 16000)

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000,
                input=True, frames_per_buffer=8000)
stream.start_stream()

print("Speak now...")

while True:
    data = stream.read(4000, exception_on_overflow=False)
    if rec.AcceptWaveform(data):
        print(rec.Result())

Step 5: Multilingual Support

Want Odia, Hindi, or Spanish?

Just download the corresponding model:

Hindi: vosk-model-small-hi-0.4
Odia: [Custom training required]
Spanish: vosk-model-small-es-0.42

Usage remains the same—just switch the model path.

Step 6: Custom Vocabulary (Limited Grammar)

To improve accuracy on known phrases:

rec = KaldiRecognizer(model, 16000, '["hello", "world", "turn on light"]')

This helps for command-based interfaces or limited-domain apps.

Step 7: Integrate into Applications

Home Automation: Use recognized text to trigger GPIO or MQTT
Chatbot: Convert voice to text for chatbot input
Transcriber: Save output to .txt or .json
Call Monitor: Analyze phone calls (recorded) in real-time

Raspberry Pi Setup

On a Raspberry Pi (Zero 2 W or 4):

pip install vosk
sudo apt install ffmpeg portaudio19-dev python3-pyaudio

Use a small model (<50MB) for optimal performance.

What’s Next?

Fine-tune or train a model with Kaldi (advanced)
Use Whisper or DeepSpeech for larger models (cloud-based)
Use G2P + phonemizer for custom languages like Odia

VOSK is a simple, powerful way to bring speech recognition into your projects without the internet. Its cross-platform support, Python-first approach, and offline models make it perfect for embedded and edge AI systems.

Whether you're building a smart assistant, transcription tool, or creative audio app—VOSK is a brilliant starting point.

Resources:

RRJ

(RAKESH RANJAN JENA)

Categories

Social

Translate

Wednesday, 23 July 2025

Building a Speech Recognition System with VOSK: A Step-by-Step Guide