There are numerous howtos for Raspberry Pi and other portable computer voice applications, like this one, but generally they are using Google’s voice api. This may work… when the wifi or network is working, but not only is this sending your voice to google, it requires payment for usage over a certain amount. Users of your robotic application may be not so thrilled when they see it is sending audio samples to Google, and that it does not even work if there is a wifi hiccup! Instead, let’s go through a simple on-device installation that works fairly accurately with no external dependencies!
Julius has extensive documentation – the book is here – and it has a Python library to interface with Python… however I wasn’t able to get that working on my Python3, and there is actually no need to run Julius in server mode like they are doing, we can call the Julius command directly with the subprocess module and get the results, reacting to each line if necessary. To start with, let’s keep a config file for our script so that we don’t have to make coding changes every time we try different models or modes. Most likely you will be using the models for English (ENVR-v5.4.Gmm), unzipped into your directory. The julius command always needs a .jconf file, and optionally some other arguments. To start with we will have a “hear.config” that can be used with Python3 and configparser:
[julia] juliabinary = julius/julius/julius dnnconffile = ENVR-v5.4.Dnn.Bin/dnn.jconf jconffile = ENVR-v5.4.Dnn.Bin/mic.jconf
The code will be calling Julius binary with various arguments, that should be separated into an array – that is, “ls -lah” would be [“ls”, “-lah”] for the subprocess command for example.
So for starters, create three files, the above mentioned hear.config, main.py, and this setup.sh that you can run to easily install Julius and a model required to use it:
#!/bin/bash #Prerequisites echo "Downloading and compiling required files..." sudo apt-get update sudo apt-get install -y git build-essential zlib1g-dev libsdl2-dev libasound2-dev wget https://sourceforge.net/projects/juliusmodels/files/ENVR-v5.4.Dnn.Bin.zip/download unzip download && rm ./download git clone https://github.com/julius-speech/julius.git cd julius ./configure --enable-words-int && make -j4 && echo "ALL DONE"
Run that script, entering admin password for installation as applicable, and wait for all the downloading and compiling. Should take 10min or so. Now put the mic.jconf file in the Gmm.Bin folder: mic.jconf:
-input mic -filelist test.dbl -htkconf wav_config -h ENVR-v5.3.am -hlist ENVR-v5.3.phn -d ENVR-v5.3.lm -v ENVR-v5.3.dct -b 4000 -lmp 12 -6 -lmp2 12 -6 -walign -fallback1pass -multipath -iwsp -norealtime -iwcd1 max -spmodel sp -spsegment -gprune none -no_ccd -sepnum 150 -b2 360 -n 40 -s 2000 -m 8000 -lookuprange 5 -sb 80 -forcedict -cutsilence
And the revised dnn.jconf:
feature_type MFCC_E_D_A_Z feature_options -htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cvnstatic num_threads 1 feature_len 48 context_len 11 input_nodes 528 output_nodes 7461 hidden_nodes 1536 hidden_layers 5 W1 ENVR-v5.3.layer2_weight.npy W2 ENVR-v5.3.layer3_weight.npy W3 ENVR-v5.3.layer4_weight.npy W4 ENVR-v5.3.layer5_weight.npy W5 ENVR-v5.3.layer6_weight.npy B1 ENVR-v5.3.layer2_bias.npy B2 ENVR-v5.3.layer3_bias.npy B3 ENVR-v5.3.layer4_bias.npy B4 ENVR-v5.3.layer5_bias.npy B5 ENVR-v5.3.layer6_bias.npy output_W ENVR-v5.3.layerout_weight.npy output_B ENVR-v5.3.layerout_bias.npy state_prior_factor 1.0 state_prior ENVR-v5.3.prior state_prior_log10nize false
Start out with a basic Python script getting all output from the process and printing it out: With the config above this would be running “julius/julius/julius -C ENVR-v5.4.Dnn.Bin/mic.jconf -dnnconf ENVR-v5.4.Dnn.Bin/dnn.jconf” and get the results line by line as you clearly speak into the mic:
#!/usr/bin/env python3 from threading import Thread import subprocess import configparser import json import re class Recognizer(Thread): def __init__(self, config): Thread.__init__(self) self.config = config def run(self): cfg = self.config['julia'] cmd = [cfg['juliabinary'], '-C', cfg['jconffile']] if 'dnnconffile' in cfg: cmd.append( '-dnnconf' ) cmd.append(cfg['dnnconffile']) print(cmd) self.proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) for line in iter(self.proc.stdout.readline, b''): line = line.decode('utf-8') if line.find('sentence1: ')==0: found = line.replace('sentence1: <s>','').replace('</s>','').strip() print(found) #Your logic here... #elif line.find('Warning: strip: ')==0: # = Warning: strip: sample 0-666 is invalid, stripped #print('no audio') #else: #print('audio') #Binary finished print('Julia quit') config = configparser.ConfigParser() config.read('hear.config') recognizeThread = Recognizer(config) recognizeThread.start()
Now if you run the above Python script you will see lines of text as they are spoken. Speak “One Two Three” and see the words in the terminal – and you can respond by doing some action in the “Your Logic Here” section above. This is a thread that would be appropriate to be working in the background of a GUI project as well.
Now let’s say you want to read the text out of various audio files instead of your microphone… This is the default config in the example: julius.conf:
-input file -filelist test.dbl -htkconf wav_config ...
Set that test.dbl to have a different file if you want… unfortunately this only supports 16 bit raw.
For a list of other libraries you may also find useful, see the list that Mycroft put together – there are many other libraries for speech to text, open souce or paid.