Ever wandered how fast are the major microcontroller boards to run Tensorflow Lite neural networks? In this post we'll find it out for the case of Fully Connected networks.

Fully connected benchmarks

** 11 April 2021: added Raspberry Pi Pico with Arduino Mbed Core **

In a previous post about TinyML benchmarks for traditional Machine Learning models I benchmarked many different classifiers from the scikit-learn package in terms of resources and execution speed.

In this post I'm going to do something very similar, except that I'll compare different boards on the task of running Tensorflow Lite Neural Networks.

The boards on the list are:

  • Arduino Nano 33 BLE Sense (Cortex M4 @ 64 MHz)
  • ESP32 (Xtensa dual-core @ 240 MHz)
  • Feather M4 Express (Cortex M4F @ 200 MHz)
  • STM32 Nucleo H743ZI2 (Cortex M7 @ 480 MHz)
  • Arduino Portenta (Cortex M7 @ 480 MHz)
  • Teensy 4.0 (Cortex M7 @ 600 MHz)
  • Raspberry Pi Pico (Rp2040 / Cortex M0+ @ 125 MHz)

As you can see, they differ in terms of CPU and clock frequency. I discarded less powerful boards for now (Cortex M0 based), but maybe I'll add them in the future.

** update: I added the Raspberry Pi Pico to the benchmark because of the hype it created. **

The benchmarked networks topology are 3 types of full-connected networks:

  • 1 layer with 10 neurons
  • 2 layers, one with 10 neurons, the other with 50 neurons
  • 10 layers, each with 10 neurons

Inference times

The following charts show the inference time (in microseconds) of the different networks for each board, in linear and logarithmic scale.

Fully connected benchmarks Inference time linear scale w Rpi Pico

Fully connected benchmarks Inference time linear scale w/o Rpi Pico

Fully connected benchmarks Inference time linear scale w/o slowest

Fully connected benchmarks Inference time log scale w Rpi Pico

What's the verdict?

  • Teensy 4.0 is the fastest, as you can expect from its faster clock
  • Arduino Portenta and Nucleo H743ZI2 are on quite par, since they share two CPUs from the same family, but the Nucleo is faster over all the topologies
  • ESP32 still has a greate performance / price ratio, if you consider that I paid mine less than 4 $
  • Raspberry Pi Pico is the slowest, despite not having the slowest clock (but the Arduino Nano 33 BLE Sense has a Cortex M4 CPU)

Do these benchmarks match with your conjectures?

Are you surpised from some of these numbers?

Would you like to see other boards benchmarked?

Let me know in the comments.

Raw data

board_namedatasetclfinference_time
Arduino Nano 33Breast cancerFC 1 x 10138.71
Arduino Nano 33Breast cancerFC 10 x 10472.11
Arduino Nano 33Breast cancerFC 10+50286.86
Arduino Nano 33DigitsFC 1 x 10390.25
Arduino Nano 33DigitsFC 10 x 10719.08
Arduino Nano 33DigitsFC 10+50589.75
Arduino Nano 33IrisFC 1 x 10113.61
Arduino Nano 33IrisFC 10 x 10442.75
Arduino Nano 33IrisFC 10+50266.54
Arduino Nano 33WineFC 1 x 10130.1
Arduino Nano 33WineFC 10 x 10460.02
Arduino Nano 33WineFC 10+50283.82
Arduino Portenta M7Breast cancerFC 1 x 1013.75
Arduino Portenta M7Breast cancerFC 10 x 1055.16
Arduino Portenta M7Breast cancerFC 10+5031.72
Arduino Portenta M7DigitsFC 1 x 1026.96
Arduino Portenta M7DigitsFC 10 x 1069.54
Arduino Portenta M7DigitsFC 10+5051.56
Arduino Portenta M7IrisFC 1 x 108.71
Arduino Portenta M7IrisFC 10 x 1049.85
Arduino Portenta M7IrisFC 10+5027.35
Arduino Portenta M7WineFC 1 x 1010.94
Arduino Portenta M7WineFC 10 x 1052.11
Arduino Portenta M7WineFC 10+5029.55
ESP32 Dev ModuleBreast cancerFC 1 x 1036.31
ESP32 Dev ModuleBreast cancerFC 10 x 10125.03
ESP32 Dev ModuleBreast cancerFC 10+5074.86
ESP32 Dev ModuleDigitsFC 1 x 1077.25
ESP32 Dev ModuleDigitsFC 10 x 10172.94
ESP32 Dev ModuleDigitsFC 10+50130.61
ESP32 Dev ModuleIrisFC 1 x 1020.83
ESP32 Dev ModuleIrisFC 10 x 10109.23
ESP32 Dev ModuleIrisFC 10+5061.17
ESP32 Dev ModuleWineFC 1 x 1028.89
ESP32 Dev ModuleWineFC 10 x 10117.95
ESP32 Dev ModuleWineFC 10+5069.28
Feather M4 Express {opt=fastest,speed=200}Breast cancerFC 1 x 1031.81
Feather M4 Express {opt=fastest,speed=200}Breast cancerFC 10 x 10132.66
Feather M4 Express {opt=fastest,speed=200}Breast cancerFC 10+5079.13
Feather M4 Express {opt=fastest,speed=200}DigitsFC 1 x 1069.89
Feather M4 Express {opt=fastest,speed=200}DigitsFC 10 x 10167.29
Feather M4 Express {opt=fastest,speed=200}DigitsFC 10+50132.14
Feather M4 Express {opt=fastest,speed=200}IrisFC 1 x 1017.79
Feather M4 Express {opt=fastest,speed=200}IrisFC 10 x 10118.9
Feather M4 Express {opt=fastest,speed=200}IrisFC 10+5067.17
Feather M4 Express {opt=fastest,speed=200}WineFC 1 x 1023.84
Feather M4 Express {opt=fastest,speed=200}WineFC 10 x 10124.46
Feather M4 Express {opt=fastest,speed=200}WineFC 10+5072.93
NUCLEO H743ZI2 {opt=o3}Breast cancerFC 1 x 108.5
NUCLEO H743ZI2 {opt=o3}Breast cancerFC 10 x 1034.19
NUCLEO H743ZI2 {opt=o3}Breast cancerFC 10+5020.18
NUCLEO H743ZI2 {opt=o3}DigitsFC 1 x 1018.08
NUCLEO H743ZI2 {opt=o3}DigitsFC 10 x 1044.16
NUCLEO H743ZI2 {opt=o3}DigitsFC 10+5033.8
NUCLEO H743ZI2 {opt=o3}IrisFC 10 x 1031.51
NUCLEO H743ZI2 {opt=o3}IrisFC 10+5017.8
NUCLEO H743ZI2 {opt=o3}WineFC 10 x 1032.57
NUCLEO H743ZI2 {opt=o3}WineFC 10+5019.06
Raspberry Pi PicoBreast cancerFC 1 x 10872.85
Raspberry Pi PicoBreast cancerFC 10 x 103369.54
Raspberry Pi PicoBreast cancerFC 10+502413.44
Raspberry Pi PicoDigitsFC 1 x 101982.31
Raspberry Pi PicoDigitsFC 10 x 104503.25
Raspberry Pi PicoDigitsFC 10+504314.19
Raspberry Pi PicoIrisFC 1 x 10313.77
Raspberry Pi PicoIrisFC 10 x 102801.82
Raspberry Pi PicoIrisFC 10+501953.96
Raspberry Pi PicoWineFC 1 x 10509.76
Raspberry Pi PicoWineFC 10 x 103021.03
Raspberry Pi PicoWineFC 10+502176.92
Teensy 4.0Breast cancerFC 1 x 105.16
Teensy 4.0Breast cancerFC 10 x 1020.15
Teensy 4.0Breast cancerFC 10+5012.32
Teensy 4.0DigitsFC 10 x 1026.09
Teensy 4.0DigitsFC 10+5021.01
Teensy 4.0IrisFC 1 x 103.14
Teensy 4.0IrisFC 10 x 1018.12
Teensy 4.0IrisFC 10+5011.13
Teensy 4.0WineFC 1 x 103.86
Teensy 4.0WineFC 10 x 1018.92
Teensy 4.0WineFC 10+5011.43

Fully connected benchmarks

Help the blow grow