Wednesday, 23 July 2025

Building a Speech Recognition System with VOSK: A Step-by-Step Guide

Standard

 


In a world driven by voice interfaces—from smart assistants to transcription tools; speech recognition is a key component of modern AI. VOSK is an open-source speech recognition toolkit that makes it incredibly easy to build fast and accurate offline voice systems, even on low-resource devices like Raspberry Pi.

Whether you're a beginner or looking to integrate voice into your next project, this blog will guide you step-by-step on using VOSK effectively.


What is VOSK?

VOSK is a lightweight, offline-capable speech recognition engine based on Kaldi. It supports:

  • 20+ languages (English, Hindi, Spanish, etc.)
  • Python, Java, JavaScript, C# APIs
  • Offline recognition (no internet required)
  • Real-time transcription

GitHub: VOSK GitHub Repo

Python Docs: Python Docs



Prerequisites

Before getting started, make sure you have the following:

  • Python 3.6+
  • pip package manager
  • A microphone (optional for live recognition)
  • OS: Linux, Windows, macOS or Raspberry Pi

Step 1: Install VOSK API

bash

pip install vosk

Optionally, install PyAudio for microphone input:


pip install pyaudio

On Linux, you may need:

sudo apt install portaudio19-dev python3-pyaudio


Step 2: Download a Pretrained Model

You can find models here: VOSK Models

Example for English (small):

wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip unzip vosk-model-small-en-us-0.15.zip

For Raspberry Pi, use vosk-model-small-en-us-0.15
For high accuracy, try vosk-model-en-us-0.22 (~1.4GB)


 Step 3: Transcribe from Audio File

Here's a basic Python script to transcribe audio:

from vosk import Model, KaldiRecognizer import wave import json wf = wave.open("test.wav", "rb") model = Model("vosk-model-small-en-us-0.15") rec = KaldiRecognizer(model, wf.getframerate()) while True: data = wf.readframes(4000) if len(data) == 0: break if rec.AcceptWaveform(data): print(rec.Result()) print(rec.FinalResult())

 Make sure your audio file is:

  • Mono
  • 16-bit PCM
  • 16000 Hz sample rate

Use ffmpeg to convert:

ffmpeg -i your_audio.mp3 -ar 16000 -ac 1 -c:a pcm_s16le test.wav


Step 4: Real-Time Microphone Transcription

import pyaudio from vosk import Model, KaldiRecognizer model = Model("vosk-model-small-en-us-0.15") rec = KaldiRecognizer(model, 16000) p = pyaudio.PyAudio() stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000) stream.start_stream() print("Speak now...") while True: data = stream.read(4000, exception_on_overflow=False) if rec.AcceptWaveform(data): print(rec.Result())


Step 5: Multilingual Support

Want Odia, Hindi, or Spanish?

Just download the corresponding model:

  • Hindi: vosk-model-small-hi-0.4
  • Odia: [Custom training required]
  • Spanish: vosk-model-small-es-0.42

Usage remains the same—just switch the model path.

Step 6: Custom Vocabulary (Limited Grammar)

To improve accuracy on known phrases:

rec = KaldiRecognizer(model, 16000, '["hello", "world", "turn on light"]')

This helps for command-based interfaces or limited-domain apps.


Step 7: Integrate into Applications

  • Home Automation: Use recognized text to trigger GPIO or MQTT
  • Chatbot: Convert voice to text for chatbot input
  • Transcriber: Save output to .txt or .json
  • Call Monitor: Analyze phone calls (recorded) in real-time


Raspberry Pi Setup



On a Raspberry Pi (Zero 2 W or 4):

pip install vosk sudo apt install ffmpeg portaudio19-dev python3-pyaudio

Use a small model (<50MB) for optimal performance.


 What’s Next?

  •  Fine-tune or train a model with Kaldi (advanced)
  •  Use Whisper or DeepSpeech for larger models (cloud-based)
  •  Use G2P + phonemizer for custom languages like Odia

VOSK is a simple, powerful way to bring speech recognition into your projects without the internet. Its cross-platform support, Python-first approach, and offline models make it perfect for embedded and edge AI systems.

Whether you're building a smart assistant, transcription tool, or creative audio app—VOSK is a brilliant starting point.


Resources:



0 comments:

Post a Comment