Painless TinyML Convolutional Neural Network on your Arduino and STM32 boards: the MNIST dataset example!

Are you fascinated by TinyML and Tensorflow for microcontrollers?

Do you want to run a CNN (Convolutional Neural Network) on your Arduino and STM32 boards?

Do you want to do it without pain?

EloquentTinyML is the library for you!

CNN topology

EloquentTinyML, my library to easily run Tensorflow Lite neural networks on Arduino microcontrollers, is gaining some popularity so I think it's time for a good tutorial on the topic.

If you're a seasoned follower of my blog, you may know that I don't really like Tensorflow on microcontrollers, because it is often "over-sized" for the project at hand and there are leaner, faster alternatives.

Nonetheless, Tensorflow is gaining much popularity in the embedded world so I'll try to give my contribute too.

In this tutorial, I'm going to show you step by step how to train a CNN in Tensorflow and deploy it to you board: I tested the code both on the Arduino Nano 33 BLE Sense and the STM32 Nucleus L432KC.

How to train a CNN in Tensorflow

I'm not an expert either in Tensorflow nor Convolutional Neural Networks, so I kept the project as simple as possible. I used an image-like dataset to create a setup where CNN should perform well: the dataset is the MNIST handwritten digits one.

MNIST dataset example

It is composed by 8x8 images of handwritten digits, from 0 to 9 and can be easily imported via the scikit-learn Python package.

Regarding the CNN topology, I wanted to stay as lean as possible: the goal of this tutorial is to teach you how to deploy your own network, not about achieving 100% accuracy.

Let's see step by step how to produce a usable model.

Step 1. Import the libraries

We will need numpy and Tensorflow, of course, plus scikit-learn to load the dataset and tinymlgen to port the CNN to plain C.

import numpy as np
from sklearn.datasets import load_digits
import tensorflow as tf
from tensorflow.keras import layers
from tinymlgen import port

Step 2. Generate train, validation and test data

To train the network, we need:

  • training data: this is the data the network uses to learn its weights
  • validation data: this is the data the network uses to understand if it's doing well during learning
  • test data: this is the data we use to test the network accuracy once it's done learning
def get_data():
    x_values, y_values = load_digits(return_X_y=True)
    x_values /= x_values.max()
    # reshape to (8 x 8 x 1)
    x_values = x_values.reshape((len(x_values), 8, 8, 1))

    # split into train, validation, test
    TRAIN_SPLIT = int(0.6 * len(x_values))
    TEST_SPLIT = int(0.2 * len(x_values) + TRAIN_SPLIT)
    x_train, x_test, x_validate = np.split(x_values, [TRAIN_SPLIT, TEST_SPLIT])
    y_train, y_test, y_validate = np.split(y_values, [TRAIN_SPLIT, TEST_SPLIT])

    return x_train, x_test, x_validate, y_train, y_test, y_validate

Step 3. Create and train the model

Now we have to create our network topology.

As I stated earlier, I wanted to keep this as simple as possible (also considering that we're using a toy dataset): I added a single convolution layer (without even max pooling) followed by the output layer.

def get_model():
    x_train, x_test, x_validate, y_train, y_test, y_validate = get_data()

    # create a CNN
    model = tf.keras.Sequential()
    model.add(layers.Conv2D(8, (3, 3), activation='relu', input_shape=(8, 8, 1)))
    # model.add(layers.MaxPooling2D((2, 2)))

    model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']), y_train, epochs=50, batch_size=16,
                        validation_data=(x_validate, y_validate))
    return model, x_test, y_test

Step 4. Testing the model accuracy

Do you think this topology is too simple to learn something useful in so few epochs?

Think again: it achieved 97% accuracy!

Not bad.

def test_model(model, x_test, y_test):
    x_test = (x_test / x_test.max()).reshape((len(x_test), 8, 8, 1))
    y_pred = model.predict(x_test).argmax(axis=1)

    print('ACCURACY', (y_pred == y_test).sum() / len(y_test))

Step 5. Exporting the model

Once we have a trained model that performs well, we want to deploy it to our microcontroller. Thanks to the tinymlgen packages, is as easy as a one-liner.

if __name__ == '__main__':
    model, x_test, y_test = get_model()
    test_model(model, x_test, y_test)
    c_code = port(model, variable_name='digits_model', pretty_print=True)

How to run a CNN on Arduino and STM32 boards with EloquentTinyML

Ok, now we have the content we need to create an Arduino sketch to run the CNN on our microcontroller.

We will use the EloquentTinyML library to do this without pain.

This is a library to run TinyML models on your microcontroller without messing around with complex compilation procedures and esoteric errors.

You must first install the library at its latest version (0.0.5 or 0.0.4 if not available), either via the Library Manager or directly from Github.

#include <EloquentTinyML.h>

// copy the printed code from tinymlgen into this file
#include "digits_model.h"

#define TENSOR_ARENA_SIZE 8*1024


void setup() {

void loop() {
    // a random sample from the MNIST dataset (precisely the last one)
    float x_test[64] = { 0., 0. , 0.625 , 0.875 , 0.5   , 0.0625, 0. , 0. ,
                    0. , 0.125 , 1. , 0.875 , 0.375 , 0.0625, 0. , 0. ,
                    0. , 0. , 0.9375, 0.9375, 0.5   , 0.9375, 0. , 0. ,
                    0. , 0. , 0.3125, 1. , 1. , 0.625 , 0. , 0. ,
                    0. , 0. , 0.75  , 0.9375, 0.9375, 0.75  , 0. , 0. ,
                    0. , 0.25  , 1. , 0.375 , 0.25  , 1. , 0.375 , 0. ,
                    0. , 0.5   , 1. , 0.625 , 0.5   , 1. , 0.5   , 0. ,
                    0. , 0.0625, 0.5   , 0.75  , 0.875 , 0.75  , 0.0625, 0. };
    // the output vector for the model predictions
    float y_pred[10] = {0};
    // the actual class of the sample
    int y_test = 8;

    // let's see how long it takes to classify the sample
    uint32_t start = micros();

    ml.predict(x_test, y_pred);

    uint32_t timeit = micros() - start;

    Serial.print("It took ");
    Serial.println(" micros to run inference");

    // let's print the raw predictions for all the classes
    // these values are not directly interpretable as probabilities!
    Serial.print("Test output is: ");
    Serial.print("Predicted proba are: ");

    for (int i = 0; i < 10; i++) {
        Serial.print(i == 9 ? '\n' : ',');

    // let's print the "most probable" class
    // you can either use probaToClass() if you also want to use all the probabilities
    Serial.print("Predicted class is: ");
    // or you can skip the predict() method and call directly predictClass()
    Serial.print("Sanity check: ");


That's it: if everything went fine, you should see that the predicted class is 8.

CNN on Arduino and STM32 figures

I'll report the figures I get for compiling and running this project on the two boards I used.

BoardFlashRAMInference time
Nucleus L432KC154560not available*7187
Arduino Nano 33 BLE Sense197656561609400

I used the Grumpyoldpizza compiler for the Nucleus, which doesn't report back the RAM usage

And you?

Were you able to deploy a CNN to your microcontroller thanks to this tutorial? Or are you having troubles?

Let me know in the comment and I will help you or share your experience with us.

You can find the whole code on Github.

Help the blow grow