Speech recognition made easy

There are numerous howtos for Raspberry Pi and other portable computer voice applications, like this one, but generally they are using Google’s voice api. This may work… when the wifi or network is working, but not only is this sending your voice to google, it requires payment for usage over a certain amount. Users of your robotic application may be not so thrilled when they see it is sending audio samples to Google, and that it does not even work if there is a wifi hiccup! Instead, let’s go through a simple on-device installation that works fairly accurately with no external dependencies!

Julius has extensive documentation – the book is here – and it has a Python library to interface with Python… however I wasn’t able to get that working on my Python3, and there is actually no need to run Julius in server mode like they are doing, we can call the Julius command directly with the subprocess module and get the results, reacting to each line if necessary. To start with, let’s keep a config file for our script so that we don’t have to make coding changes every time we try different models or modes. Most likely you will be using the models for English (ENVR-v5.4.Gmm), unzipped into your directory. The julius command always needs a .jconf file, and optionally some other arguments. To start with we will have a “hear.config” that can be used with Python3 and configparser:

juliabinary = julius/julius/julius
dnnconffile = ENVR-v5.4.Dnn.Bin/dnn.jconf
jconffile = ENVR-v5.4.Dnn.Bin/mic.jconf

The code will be calling Julius binary with various arguments, that should be separated into an array – that is, “ls -lah” would be [“ls”, “-lah”] for the subprocess command for example.

So for starters, create three files, the above mentioned hear.config, main.py, and this setup.sh that you can run to easily install Julius and a model required to use it:

echo "Downloading and compiling required files..."
sudo apt-get update
sudo apt-get install -y git build-essential zlib1g-dev libsdl2-dev libasound2-dev
wget https://sourceforge.net/projects/juliusmodels/files/ENVR-v5.4.Dnn.Bin.zip/download
unzip download && rm ./download
git clone https://github.com/julius-speech/julius.git
cd julius
./configure --enable-words-int && make -j4 && echo "ALL DONE"

Run that script, entering admin password for installation as applicable, and wait for all the downloading and compiling. Should take 10min or so. Now put the mic.jconf file in the Gmm.Bin folder: mic.jconf:

-input mic
-filelist test.dbl
-htkconf wav_config
-h ENVR-v5.3.am
-hlist ENVR-v5.3.phn
-d ENVR-v5.3.lm
-v ENVR-v5.3.dct
-b 4000 
-lmp 12 -6
-lmp2 12 -6
-iwcd1 max
-spmodel sp
-gprune none
-sepnum 150
-b2 360 
-n 40 
-s 2000 
-m 8000 
-lookuprange 5 
-sb 80

And the revised dnn.jconf:

feature_type MFCC_E_D_A_Z
feature_options -htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cvnstatic
num_threads 1
feature_len 48
context_len 11
input_nodes 528
output_nodes 7461
hidden_nodes 1536
hidden_layers 5
W1 ENVR-v5.3.layer2_weight.npy
W2 ENVR-v5.3.layer3_weight.npy
W3 ENVR-v5.3.layer4_weight.npy
W4 ENVR-v5.3.layer5_weight.npy
W5 ENVR-v5.3.layer6_weight.npy
B1 ENVR-v5.3.layer2_bias.npy
B2 ENVR-v5.3.layer3_bias.npy
B3 ENVR-v5.3.layer4_bias.npy
B4 ENVR-v5.3.layer5_bias.npy
B5 ENVR-v5.3.layer6_bias.npy
output_W ENVR-v5.3.layerout_weight.npy
output_B ENVR-v5.3.layerout_bias.npy
state_prior_factor 1.0
state_prior ENVR-v5.3.prior
state_prior_log10nize false

Start out with a basic Python script getting all output from the process and printing it out: With the config above this would be running “julius/julius/julius -C ENVR-v5.4.Dnn.Bin/mic.jconf -dnnconf ENVR-v5.4.Dnn.Bin/dnn.jconf” and get the results line by line as you clearly speak into the mic:

#!/usr/bin/env python3

from threading import Thread
import subprocess
import configparser
import json
import re

class Recognizer(Thread):
    def __init__(self, config):
        self.config = config
    def run(self):
        cfg = self.config['julia']
        cmd = [cfg['juliabinary'], '-C', cfg['jconffile']]
        if 'dnnconffile' in cfg:
            cmd.append( '-dnnconf' )
        self.proc = subprocess.Popen(cmd,
        for line in iter(self.proc.stdout.readline, b''):
            line = line.decode('utf-8')
            if line.find('sentence1: ')==0:
                found = line.replace('sentence1: <s>','').replace('</s>','').strip()
                #Your logic here...
            #elif line.find('Warning: strip: ')==0: # = Warning: strip: sample 0-666 is invalid, stripped
                #print('no audio')
        #Binary finished
        print('Julia quit')

config = configparser.ConfigParser()
recognizeThread = Recognizer(config)

Now if you run the above Python script you will see lines of text as they are spoken. Speak “One Two Three” and see the words in the terminal – and you can respond by doing some action in the “Your Logic Here” section above. This is a thread that would be appropriate to be working in the background of a GUI project as well.

Now let’s say you want to read the text out of various audio files instead of your microphone… This is the default config in the example: julius.conf:

-input file
-filelist test.dbl
-htkconf wav_config

Set that test.dbl to have a different file if you want… unfortunately this only supports 16 bit raw.

For a list of other libraries you may also find useful, see the list that Mycroft put together – there are many other libraries for speech to text, open souce or paid.

Leave a Reply

Your email address will not be published. Required fields are marked *

× nine = twenty seven