Let's revamp the post I wrote about word classification using Machine Learning on Arduino, this time using a proper microphone (the MP34DT05 mounted on the Arduino Nano 33 BLE Sense) instead of a chinese, analog one: will the results improve?

from https://www.udemy.com/course/learn-audio-processing-complete-engineers-course/

Updated on 16 October 2020: step by step explanation of the process with ready-made sketch code

What you'll learn

This tutorial will teach you how to capture audio from the Arduino Nano 33 BLE Sense microphone and classify it: at the end of this post, you will have a trained model able to detect in real-time the word you tell, among the ones that you trained it to recognize. The classification will occur directly on your Arduino board.

This is not a general-purpose speech recognizer able to convert speech-to-text: it works only on the words you train it on.

What you'll need

To install the software, open your terminal and install the libraries.

pip install -U scikit-learn
pip install -U micromlgen

Step 1. Capture audio samples

First of all, we need to capture a bunch of examples of the words we want to recognize.

In the original post, we used an analog microphone to record the audio. It is for sure the easiest way to interact with audio on a microcontroller since you only need to analogRead() the selected pin to get a value from the sensor.

This semplicity, however, comes at the cost of a nearly inexistent signal pre-processing from the sensor itself: most of the time, you will get junk - I don't want to be rude, but that's it.

Theory: Pulse-density modulation (a.k.a. PDM)

The microphone mounted on the Arduino Nano 33 BLE Sense (the MP34DT05) is fortunately much better than this: it gives you access to a modulated signal much more suitable for our processing needs.

The modulation used is pulse-density: I won't try to explain you how this works since I'm not an expert in DSP and neither it is the main scope of this article (refer to Wikipedia for some more information).

What matters to us is that we can grab an array of bytes from the microphone and extract its Root Mean Square (a.k.a. RMS) to be used as a feature for our Machine Learning model.

I had some difficulty finding examples on how to access the microphone on the Arduino Nano 33 BLE Sense board: fortunately, there's a Github repo from DelaGia that shows how to access all the sensors of the board.

I extracted the microphone part and incapsulated it in an easy to use class, so you don't really need to dig into the implementation details if you're not interested.

Practice: the code to capture the samples

When loaded on your Arduino Nano 33 BLE Sense, the following sketch will await for you to speak in front of the microphone: once it detects a sound, it will record 64 audio values and print them to the serial monitor.

From my experience, 64 samples are sufficient to cover short words such as yes, no, play, stop: if you plan to classify longer words, you may need to increase this number.

I suggest you keep the words short: longer words will probably decrease the accuracy of the model. If you want nonetheless a longer duration, at least keep the number of words as low as possible

Download the Arduino Nano 33 BLE Sense - Capture audio samples sketch, open it the Arduino IDE and flash it to your board.

Here's the main code.

#include "Mic.h"

// tune as per your needs
#define SAMPLES 64
#define GAIN (1.0f/50)
#define SOUND_THRESHOLD 2000

float features[SAMPLES];
Mic mic;

void setup() {
    Serial.begin(115200);
    PDM.onReceive(onAudio);
    mic.begin();
    delay(3000);
}

void loop() {
    // await for a word to be pronounced
    if (recordAudioSample()) {
        // print features to serial monitor
        for (int i = 0; i < SAMPLES; i++) {
            Serial.print(features[i], 6);
            Serial.print(i == SAMPLES - 1 ? '\n' : ',');
        }

        delay(1000);
    }

    delay(20);
}

/**
 * PDM callback to update mic object
 */
void onAudio() {
    mic.update();
}

/**
 * Read given number of samples from mic
 */
bool recordAudioSample() {
    if (mic.hasData() && mic.data() > SOUND_THRESHOLD) {

        for (int i = 0; i < SAMPLES; i++) {
            while (!mic.hasData())
                delay(1);

            features[i] = mic.pop() * GAIN;
        }

        return true;
    }

    return false;
}

Now that we have the acquisition logic in place, it's time for you to record some samples of the words you want to classify.

Action: capture the words examples

Now you have to capture as many samples of the words you want to classify as possible.

Open the serial monitor and pronounce a word near the microphone: a line of numbers will be printed on the monitor.

This is the description of your word.

You need many lines like this for an accurate prediction, so keep repeating the same word 15-30 times.

**My advice**: while recording the samples, vary both the distance of your mounth from the mic and the intensity of your voice: this will produce a more robust classification model later on.

After you repeated the same words many times, copy the content of the serial monitor and save it in a CSV file named after the word, for example yes.csv.

Then clear the serial monitor and repeat the process for each word.

Keep all these files in a folder because we need them to train our classifier.

Step 2. Train the machine learning model

Now that we have the samples, it's time to train the classifier.

Create a Python project in your favourite IDE or use your favourite text editor, if you don't have one.

As described in my post about how to train a classifier, we create a Python script that reads all the files inside a folder and concatenates them in a single array you feed to the classifier model.

Be sure your folder structure is like the following:

ArduinoWordClassification
  |-- train_classifier.py
  |-- data/
  |---- yes.csv
  |---- no.csv
  |---- play.csv
  |---- any other .csv file you recorded
# file: train_classifier.py

import numpy as np
from os.path import basename
from glob import glob
from sklearn.svm import SVC
from micromlgen import port
from sklearn.model_selection import train_test_split

def load_features(folder):
    dataset = None
    classmap = {}
    for class_idx, filename in enumerate(glob('%s/*.csv' % folder)):
        class_name = basename(filename)[:-4]
        classmap[class_idx] = class_name
        samples = np.loadtxt(filename, dtype=float, delimiter=',')
        labels = np.ones((len(samples), 1)) * class_idx
        samples = np.hstack((samples, labels))
        dataset = samples if dataset is None else np.vstack((dataset, samples))
    return dataset, classmap

np.random.seed(0)
dataset, classmap = load_features('data')
X, y = dataset[:, :-1], dataset[:, -1]
# this line is for testing your accuracy only: once you're satisfied with the results, set test_size to 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = SVC(kernel='poly', degree=2, gamma=0.1, C=100)
clf.fit(X_train, y_train)

print('Accuracy', clf.score(X_test, y_test))
print('Exported classifier to plain C')
print(port(clf, classmap=classmap))

Among the classifiers I tried, SVM produced the best accuracy at 96% with 32 support vectors: it's not a super-tiny model, but it's quite small nevertheless.

If you're not satisifed with SVM, you can use Decision Tree, Random Forest, Gaussian Naive Bayes, Relevant Vector Machines. See my other posts for a detailed description of each.

In your console, after the accuracy score, you will have the plain C implementation of the classifier you trained. The following reports my SVM model.

// File: Classifier.h

#pragma once
namespace Eloquent {
    namespace ML {
        namespace Port {
            class SVM {
            public:
                /**
                * Predict class for features vector
                */
                int predict(float *x) {
                    float kernels[35] = { 0 };
                    float decisions[6] = { 0 };
                    int votes[4] = { 0 };
                    kernels[0] = compute_kernel(x,   33.0  , 41.0  , 47.0  , 54.0  , 59.0  , 61.0  , 56.0  , 51.0  , 50.0  , 51.0  , 44.0  , 32.0  , 23.0  , 15.0  , 12.0  , 8.0  , 5.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 5.0  , 3.0  , 5.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0 );
                    kernels[1] = compute_kernel(x,   40.0  , 50.0  , 51.0  , 60.0  , 56.0  , 57.0  , 58.0  , 53.0  , 50.0  , 45.0  , 42.0  , 34.0  , 23.0  , 16.0  , 10.0  , 7.0  , 3.0  , 3.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 14.0  , 3.0  , 8.0  , 0.0  , 0.0  , 3.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 3.0  , 0.0  , 0.0  , 5.0  , 3.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 3.0  , 0.0  , 5.0  , 3.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 0.0  , 3.0  , 0.0  , 0.0  , 0.0  , 3.0 );
                    kernels[2] = compute_kernel(x,   56.0  , 68.0  , 78.0  , 91.0  , 84.0  , 84.0  , 84.0  , 74.0  , 69.0  , 64.0  , 57.0  , 44.0  , 33.0  , 18.0  , 12.0  , 8.0  , 5.0  , 9.0  , 15.0  , 12.0  , 12.0  , 9.0  , 12.0  , 7.0  , 3.0  , 10.0  , 12.0  , 6.0  , 3.0  , 0.0  , 0.0  , 0.0  , 0.0  , 6.0  , 3.0  , 6.0  , 10.0  , 10.0  , 8.0  , 3.0  , 9.0  , 9.0  , 9.0  , 8.0  , 9.0  , 9.0  , 11.0  , 3.0  , 8.0  , 9.0  , 8.0  , 8.0  , 8.0  , 6.0  , 7.0  , 3.0  , 3.0  , 8.0  , 5.0  , 3.0  , 0.0  , 3.0  , 0.0  , 0.0 );

                    // ...many other kernels computations...

                    decisions[0] = 0.722587775297
                                   + kernels[1] * 3.35855e-07
                                   + kernels[2] * 1.64612e-07
                                   + kernels[4] * 6.00056e-07
                                   + kernels[5] * 3.5195e-08
                                   + kernels[7] * -4.2079e-08
                                   + kernels[8] * -4.2843e-08
                                   + kernels[9] * -9.994e-09
                                   + kernels[10] * -5.11065e-07
                                   + kernels[11] * -5.979e-09
                                   + kernels[12] * -4.4672e-08
                                   + kernels[13] * -1.5606e-08
                                   + kernels[14] * -1.2941e-08
                                   + kernels[15] * -2.18903e-07
                                   + kernels[17] * -2.31635e-07
                            ;
                    decisions[1] = -1.658344586719
                                   + kernels[0] * 2.45018e-07
                                   + kernels[1] * 4.30223e-07
                                   + kernels[3] * 1.00277e-07
                                   + kernels[4] * 2.16524e-07
                                   + kernels[18] * -4.81187e-07
                                   + kernels[20] * -5.10856e-07
                            ;
                    decisions[2] = -1.968607562265
                                   + kernels[0] * 3.001833e-06
                                   + kernels[3] * 4.5201e-08
                                   + kernels[4] * 1.54493e-06
                                   + kernels[5] * 2.81834e-07
                                   + kernels[25] * -5.93581e-07
                                   + kernels[26] * -2.89779e-07
                                   + kernels[27] * -1.73958e-06
                                   + kernels[28] * -1.09552e-07
                                   + kernels[30] * -3.09126e-07
                                   + kernels[31] * -1.294219e-06
                                   + kernels[32] * -5.37961e-07
                            ;
                    decisions[3] = -0.720663029823
                                   + kernels[6] * 1.4362e-08
                                   + kernels[7] * 6.177e-09
                                   + kernels[9] * 1.25e-08
                                   + kernels[10] * 2.05478e-07
                                   + kernels[12] * 2.501e-08
                                   + kernels[15] * 4.363e-07
                                   + kernels[16] * 9.147e-09
                                   + kernels[18] * -1.82182e-07
                                   + kernels[20] * -4.93707e-07
                                   + kernels[21] * -3.3084e-08
                            ;
                    decisions[4] = -1.605747746589
                                   + kernels[6] * 6.182e-09
                                   + kernels[7] * 1.3853e-08
                                   + kernels[8] * 2.12e-10
                                   + kernels[9] * 1.1243e-08
                                   + kernels[10] * 7.80681e-07
                                   + kernels[15] * 8.347e-07
                                   + kernels[17] * 1.64985e-07
                                   + kernels[23] * -4.25014e-07
                                   + kernels[25] * -1.134803e-06
                                   + kernels[34] * -2.52038e-07
                            ;
                    decisions[5] = -0.934328303475
                                   + kernels[19] * 3.3529e-07
                                   + kernels[20] * 1.121946e-06
                                   + kernels[21] * 3.44683e-07
                                   + kernels[22] * -6.23056e-07
                                   + kernels[24] * -1.4612e-07
                                   + kernels[28] * -1.24025e-07
                                   + kernels[29] * -4.31701e-07
                                   + kernels[31] * -9.2146e-08
                                   + kernels[33] * -3.8487e-07
                            ;
                    votes[decisions[0] > 0 ? 0 : 1] += 1;
                    votes[decisions[1] > 0 ? 0 : 2] += 1;
                    votes[decisions[2] > 0 ? 0 : 3] += 1;
                    votes[decisions[3] > 0 ? 1 : 2] += 1;
                    votes[decisions[4] > 0 ? 1 : 3] += 1;
                    votes[decisions[5] > 0 ? 2 : 3] += 1;
                    int val = votes[0];
                    int idx = 0;

                    for (int i = 1; i < 4; i++) {
                        if (votes[i] > val) {
                            val = votes[i];
                            idx = i;
                        }
                    }

                    return idx;
                }

                /**
                * Convert class idx to readable name
                */
                const char* predictLabel(float *x) {
                    switch (predict(x)) {
                        case 0:
                            return "no";
                        case 1:
                            return "stop";
                        case 2:
                            return "play";
                        case 3:
                            return "yes";
                        default:
                            return "Houston we have a problem";
                    }
                }

            protected:
                /**
                * Compute kernel between feature vector and support vector.
                * Kernel type: poly
                */
                float compute_kernel(float *x, ...) {
                    va_list w;
                    va_start(w, 64);
                    float kernel = 0.0;

                    for (uint16_t i = 0; i < 64; i++) {
                        kernel += x[i] * va_arg(w, double);
                    }

                    return pow((0.1 * kernel) + 0.0, 2);
                }
            };
        }
    }
}

Step 3. Deploy to your microcontroller

Now we have all the pieces we need to perform word classification on our Arduino board.

Download the Arduino Nano 33 BLE Sense - Audio classification sketch, open it in the Arduino IDE and paste the plain C code you got in the console inside the Classifier.h file (delete all its contents before!).

Fine: it's time to deploy!

Hit the upload button: if everything went fine, open the serial monitor and pronounce one of the words you recorded during Step 1.

Hopefully, you will read the word on the serial monitor.

Here's a quick demo (please forgive me for the bad video quality).

Troubleshooting

It can happen that when running micromlgen.port(clf) you get a TemplateNotFound error. To solve the problem, first of all uninstall micromlgen.

pip uninstall micromlgen

Then head to Github, download the package as zip and extract the micromlgen folder into your project.


If you liked this tutorial and it helped you successfully implement word classification on your Arduino Nano 33 BLE Sense, please share it on your social media so others can benefit too.

If you have troubles or questions, don't hesitate to leave a comment: I will be happy to help you.

Help the blow grow