In this Arduno Machine learning tutorial we're going to use a microphone to identify the word you speak.
This is going to run on an Arduino Nano (old generation), equipped with 32 kb of flash and only 2 kb of RAM.

from https://www.udemy.com/course/learn-audio-processing-complete-engineers-course/

In this project the features are going to be the Fast Fourier Transform of 50 analog readings from a microphone, taken starting from when a loud sound is detected, sampled at intervals of 5 millis.

This tutorial is not about "Wake word" detection: it can't distinguish a known word from any other word. It can classify the word you speak among the ones you trained it to recognize!!!

1. Features definition

The microphone we're going to use is a super simple device: it produces an analog signal (0-1024) based on the sound it detects.

from http://arduinolearning.com/code/ky038-microphone-module-and-arduino-example.php

When working with audio you almost always don't want to use raw readings, since they're hardly useful. Instead you often go with Fourier Transform, which extracts the frequency information from a time signal. That's going to become our features vector: let's see how in the next step.

2. Record sample data

First of all, we start with raw audio data. The following plot is me saying random words.

Raw audio stream

#define MIC A0
#define INTERVAL 5

void setup() {
    Serial.begin(115200);
    pinMode(MIC, INPUT);
}

void loop() {
    Serial.println(analogRead(MIC));
    delay(INTERVAL);
}

2.1 Translate the raw values

For the Fourier Transform to work, we need to provide as input an array of values both positive and negative. analogRead() is returning only positive values, tough, so we need to translate them.

int16_t readMic() {
    // this translated the analog value to a proper interval
    return  (analogRead(MIC) - 512) >> 2;
}

2.2 Detect sound

As in the tutorial about gesture classification, we'll start recording the features when a word is beginning to be pronounced. Also in this project we'll use a threshold to detect the start of a word.

To do this, we first record a "background" sound level, that is the value produced by the sensor when we're not talking at all.

float backgroundSound = 0;

void setup() {
    Serial.begin(115200);
    pinMode(MIC, INPUT);
    calibrate();
}

void calibrate() {
    for (int i = 0; i < 200; i++)
        backgroundSound += readMic();

    backgroundSound /= 200;

    Serial.print("Background sound level is ");
    Serial.println(backgroundSound);
}

At this point we can check for the starting of a word when the detected sound level exceeds tha background one by a given threshold.

// adjust as per your need
// it will depend on the sensitivity of you microphone
#define SOUND_THRESHOLD 3

void loop() {
    if (!soundDetected()) {
        delay(10);
        return;
    }
}

bool soundDetected() {
    return abs(read() - backgroundSound) >= SOUND_THRESHOLD;
}

2.3 Record the words

As for the gestures, we'll record a fixed number of readings at a fixed interval.
Here a tradeoff arises: you want to have a decent number of readings to be able to accurately describe the words you want to classify, but not too much otherwise your model is going to be too large to fit in your board.

I made some experiments, and I got good results with 32 samples at 5 millis interval, which covers ~150 ms of speech.

The dilemma here is that the Fourier Transform to work needs a number of samples that is a power of 2. So, if you think 32 features are not enough for you, you're forced to go with at least 64: this has a REALLY bad impact on the model size.
#define NUM_SAMPLES 32
#define INTERVAL 5

float features[NUM_SAMPLES];
double featuresForFFT[NUM_SAMPLES];

void loop() {
    if (!soundDetected()) {
        delay(10);
        return;
    }

    captureWord();
    printFeatures();
    delay(1000);
}

void captureWord() {
    for (uint16_t i = 0; i < NUM_SAMPLES; i++) {
        features[i] = readMic();
        delay(INTERVAL);
    }
}

void printFeatures() {
    const uint16_t numFeatures = sizeof(features) / sizeof(float);
    
    for (int i = 0; i < numFeatures; i++) {
        Serial.print(features[i]);
        Serial.print(i == numFeatures - 1 ? 'n' : ',');
    }
}

2.4 Fast Fourier Transform

Here we are with the Fourier Transform. When implemented in software, the most widely implementation of the FT is actually called Fast Fourier Transform (FFT), which is - as you may guess - a fast implementation of the FT.

Luckily for us, there exists a library for Arduino that does FFT.

And is so easy to use that we only need a line to get usable results!

#include <arduinoFFT.h>

arduinoFFT fft;

void captureWord() {
    for (uint16_t i = 0; i < NUM_SAMPLES; i++) {
        featuresForFFT[i] = readMic();
        delay(INTERVAL);
    }

    fft.Windowing(featuresForFFT, NUM_SAMPLES, FFT_WIN_TYP_HAMMING, FFT_FORWARD);

    for (int i = 0; i < NUM_SAMPLES; i++)
        features[i] = featuresForFFT[i];
}

You don't need to know what the Windowing function actually does (I don't either): what matters is that it extracts meaningful informations from our signal. Since it overwrites the features array, after calling that line we have what we need to input to our classifier.

At this point, record 10-15 samples for each word and save them to a file, one for each word.

After you have recorded the samples for a word, I suggest you to manually check them. It is sufficient to look at the first 3 values: if one of them seems to be clearly out of range, I suggest you to delete it. You will lose some accuracy, but your model will be smaller.

3. Train and export the classifier

For a detailed guide refer to the tutorial

from sklearn.ensemble import RandomForestClassifier
from micromlgen import port

# put your samples in the dataset folder
# one class per file
# one feature vector per line, in CSV format
features, classmap = load_features('dataset/')
X, y = features[:, :-1], features[:, -1]
classifier = RandomForestClassifier(n_estimators=30, max_depth=10).fit(X, y)
c_code = port(classifier, classmap=classmap)
print(c_code)

At this point you have to copy the printed code and import it in your Arduino project, in a file called model.h.

In this project on Machine learning we're not achieving 100% accuracy easily.
Audio is quite noise, so you should experiment with a few params for the classifier and choose the ones that perform best. I'll showcase a few examples:

Decision boundaries of 2 PCA components of Word classification, linear kernel

Decision boundaries of 2 PCA components of Word classification, poly-3 kernel

Decision boundaries of 2 PCA components of Word classification, rbf kernel

2.5 Select a suitable model

Here's an overview table of the 3 tests I did.

KernelNo. support vectorsAvg. accuracy
Linear2287%
Poly 32991%
RBF3694%

Of course the one with the RBF kernel would be the most desiderable since it has a very high accuracy: 36 support vectors, tough, will produce a model too large to fit on an Arduino Nano.

So you're forced to pick the one with the highest accuracy that fit on your board: in my case it was the Linear kernel one.

4. Run the inference

#include "model.h"

void loop() {
    if (!soundDetected()) {
        delay(10);
        return;
    }

    captureWord();
    Serial.print("You said ");
    Serial.println(classIdxToName(predict(features)));

    delay(1000);
}

And that's it: word classification through machine learning on your Arduino board! Say some word and see the classification result on the Serial monitor.

Here's me testing the system (English is not my language, so forgive my bad pronounce). The video quality is very low, I know, but you get the point.

As you can hear from the video, you should be quite accurate when pronouncing the words. I have to admit there are cases where the system totally fails to classify correctly the words. Restarting helps most of the time, so I'm suspecting there could be some kind of leak that "corrupts" the inference procedure.

Troubleshooting

It can happen that when running micromlgen.port(clf) you get a TemplateNotFound error. To solve the problem, first of all uninstall micromlgen.

pip uninstall micromlgen

Then head to Github, download the package as zip and extract the micromlgen folder into your project.


Did you find this tutorial useful? Was is it easy to follow or did I miss something? Let me know in the comments so I can keep improving the blog.



Check the full project code on Github

Help the blow grow