A few days ago I asked a poll on my Twitter for who do you think would be the fastest board for TinyML among Arduino Portenta H7, Teensy 4.0 and STM32 Nucleo H743ZI2. Both Portenta and Nucleo ranked on par at first position, leaving Teensy behind.
This post will answer that poll with real-world numbers: all of them share an ARM Cortex M7 cpu, but which one is the winner?
Let's check it out (it includes a lot of charts)!
This post will be mainly plot-based: for each of the following 3 categories, there's a plot collage that explicates the benchmark.
Table of contents
Benchmark #1: Benchmarks per Dataset
For each dataset in the benchmarking suite, you see a bar plot: each bar represents the inference time (in microseconds, in log scale) of how much it takes for a given classifier to predict samples from the dataset. The lower the bar, the better.
Benchmark #2: Rankings per Classifier
For each classifier type in the benchmarking suite, you see 3 bar plot: each plot represents how many times the given board ranked at each position w.r.t the others two.
Let's look at the very first plot, "Ranking of Teensy for Decision Tree": it displays that the Teensy board ranked 9 times at first position for the Decision Tree classifier, 0 times at second and 0 times at third position.
So Teensy should be the clear winner here! Well, not really, beacuse Arduino Portenta H7 and STM32 Nucleo H743ZI2 ranked the same.
How is it possible?
It's possible because all inference times are rounded to a 5 microseconds resolution: since Decision Tree took 1-2 microseconds on each board to run, their inference times get all rounded to the same value, ending in a tie.
A more informative case is "XGBoost": you can clearly see that Teensy always ranked first, Arduino Portenta H7 always ranked second and STM32 Nucleo H743ZI2 always ranked third.
Benchmark #3: Global rankings
At the end of the day, summing all the positions for all datasets for all classifiers, we get a global ranking of the boards.
What conclusions can we draw from these benchmarks?
- Teensy 4.0 is a clear winner, since it's the fastest on most of the cases. Is it surprising? Not at all, if you consider it is clocked at 600 MHz vs 480 MHz of Arduino Portenta H7 and STM32 Nucleo H743ZI2
- STM32 Nucleo H743ZI2 is fast on Decision Tree, Random Forest and Gaussian NB, but not so fast on the other classifiers
- Arduino Portenta H7 is slower than Teensy 4.0, but generally faster than STM32 Nucleo H743ZI2, ranking most of the times at position 1st and 2nd.
All in all, I can say that all the three boards are good at TinyML and have different purposes:
- Arduino Portenta H7 is a dual core board with some nice integrations with very sweet piece of hardwares (think the Vision shield), but it's priced at almost 100$
- Teensy 4.0 is probably the fastest MCU out there, considering you can overclock it up to 1 GHz!
- STM32 Nucleo H7432ZI2 is a dev board with many IO pins and dev-related features (STLink debugger, for example)
Did you expected these results?
Do you want to suggest other boards you'd like to see in the benchmarks?
Let me know in the comments.