camera – Eloquent Arduino Blog http://eloquentarduino.github.io/ Machine learning on Arduino, programming & electronics Wed, 16 Dec 2020 20:29:52 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.6 Easy ESP32 camera HTTP video streaming server https://eloquentarduino.github.io/2020/06/easy-esp32-camera-http-video-streaming-server/ Wed, 24 Jun 2020 17:27:33 +0000 https://eloquentarduino.github.io/?p=1203 This will be a short post where I introduce a new addition to the Arduino Eloquent library aimed to make video streaming from an ESP32 camera over HTTP super easy. It will be the first component of a larger project I'm going to implement. If you Google "esp32 video streaming" you will get a bunch […]

L'articolo Easy ESP32 camera HTTP video streaming server proviene da Eloquent Arduino Blog.

]]>
This will be a short post where I introduce a new addition to the Arduino Eloquent library aimed to make video streaming from an ESP32 camera over HTTP super easy. It will be the first component of a larger project I'm going to implement.

If you Google "esp32 video streaming" you will get a bunch of results that are essentialy copy-pasted from the official Espressif repo: many of them neither copy-pasted the code, just tell you to load the example sketch.

And if you try to read it and try to modify just a bit for your own use-case, you won't understand much.

This is the exact environment for an Eloquent component to live: make it painfully easy what's messy.

I still have to find a good naming scheme for my libraries since Arduino IDE doesn't allow nested imports, so forgive me if "ESP32CameraHTTPVideoStreamingServer.h" was the best that came to mind.

How easy is it to use?

1 line of code if used in conjuction with my other library EloquentVision.

#define CAMERA_MODEL_M5STACK_WIDE
#include "WiFi.h"
#include "EloquentVision.h"
#include "ESP32CameraHTTPVideoStreamingServer.h"

using namespace Eloquent::Vision;
using namespace Eloquent::Vision::Camera;

ESP32Camera camera;
HTTPVideoStreamingServer server(81);

/**
 *
 */
void setup() {
    Serial.begin(115200);
    WiFi.softAP("ESP32", "12345678");

    camera.begin(FRAMESIZE_QVGA, PIXFORMAT_JPEG);
    server.start();

    Serial.print("Camera Ready! Use 'http://");
    Serial.print(WiFi.softAPIP());
    Serial.println(":81' to stream");
}

void loop() {
}

HTTPVideoStreamingServer assumes you already initialized your camera. You can achieve this task in the way you prefer: ESP32Camera class makes this a breeze.

81 in the server constructor is the port you want the server to be listening to.

Once connected to WiFi or started in AP mode, all you have to do is call start(): that's it!

Finding this content useful?

What else is it good for?

The main reason I wrote this piece of library is because one of you reader commented on the motion detection post asking if it would be possible to start the video streaming once motion is detected.

Of course it is.

It's just a matter of composing the Eloquent pieces.

// not workings AS-IS, needs refactoring

#define CAMERA_MODEL_M5STACK_WIDE
#include "WiFi.h"
#include "EloquentVision.h"
#include "ESP32CameraHTTPVideoStreamingServer.h"

#define FRAME_SIZE FRAMESIZE_QVGA
#define SOURCE_WIDTH 320
#define SOURCE_HEIGHT 240
#define CHANNELS 1
#define DEST_WIDTH 32
#define DEST_HEIGHT 24
#define BLOCK_VARIATION_THRESHOLD 0.3
#define MOTION_THRESHOLD 0.2

// we're using the Eloquent::Vision namespace a lot!
using namespace Eloquent::Vision;
using namespace Eloquent::Vision::Camera;
using namespace Eloquent::Vision::ImageProcessing;
using namespace Eloquent::Vision::ImageProcessing::Downscale;
using namespace Eloquent::Vision::ImageProcessing::DownscaleStrategies;

ESP32Camera camera;
HTTPVideoStreamingServer server(81);
// the buffer to store the downscaled version of the image
uint8_t resized[DEST_HEIGHT][DEST_WIDTH];
// the downscaler algorithm
// for more details see https://eloquentarduino.github.io/2020/05/easier-faster-pure-video-esp32-cam-motion-detection
Cross<SOURCE_WIDTH, SOURCE_HEIGHT, DEST_WIDTH, DEST_HEIGHT> crossStrategy;
// the downscaler container
Downscaler<SOURCE_WIDTH, SOURCE_HEIGHT, CHANNELS, DEST_WIDTH, DEST_HEIGHT> downscaler(&crossStrategy);
// the motion detection algorithm
MotionDetection<DEST_WIDTH, DEST_HEIGHT> motion;

/**
 *
 */
void setup() {
    Serial.begin(115200);
    WiFi.softAP("ESP32", "12345678");

    camera.begin(FRAMESIZE_QVGA, PIXFORMAT_GRAYSCALE);
    motion.setBlockVariationThreshold(BLOCK_VARIATION_THRESHOLD);

    Serial.print("Camera Ready! Use 'http://");
    Serial.print(WiFi.softAPIP());
    Serial.println(":81' to stream");
}

void loop() {
    camera_fb_t *frame = camera.capture();

    // resize image and detect motion
    downscaler.downscale(frame->buf, resized);
    motion.update(resized);
    motion.detect();

    if (motion.ratio() > MOTION_THRESHOLD) {
        Serial.print("Motion detected");
        // start the streaming server when motion is detected
        // shutdown after 20 seconds if no one connects
        camera.begin(FRAMESIZE_QVGA, PIXFORMAT_JPEG);
        delay(2000);
        Serial.print("Camera Server ready! Use 'http://");
        Serial.print(WiFi.softAPIP());
        Serial.println(":81' to stream");
        server.start();
        delay(20000);
        server.stop();
        camera.begin(FRAMESIZE_QVGA, PIXFORMAT_GRAYSCALE);
        delay(2000);
    }

    // probably we don't need 30 fps, save some power
    delay(300);
}

Does it look good?

Now the rationale behind Eloquent components should be starting to be clear to you: easy to use objects you can compose the way it fits to achieve the result you want.

Would you suggest me more piece of functionality you would like to see wrapped in an Eloquent component?


You can find the class code and the example sketch on the Github repo.

L'articolo Easy ESP32 camera HTTP video streaming server proviene da Eloquent Arduino Blog.

]]>
Easier, faster pure video ESP32 cam motion detection https://eloquentarduino.com/projects/esp32-arduino-motion-detection Sun, 10 May 2020 19:26:08 +0000 https://eloquentarduino.github.io/?p=1110 If you liked my post about ESP32 cam motion detection, you'll love this updated version: it's easier to use and blazing fast! The post about pure video ESP32 cam motion detection without an external PIR is my most successful post at the moment. Many of you are interested about this topic. One of my readers, […]

L'articolo Easier, faster pure video ESP32 cam motion detection proviene da Eloquent Arduino Blog.

]]>
If you liked my post about ESP32 cam motion detection, you'll love this updated version: it's easier to use and blazing fast!

Faster motion detection

The post about pure video ESP32 cam motion detection without an external PIR is my most successful post at the moment. Many of you are interested about this topic.

One of my readers, though, pointed out my implementation was quite slow and he only achieved bare 5 fps in his project. So he asked for a better alternative.

Since the post was of great interest for many people, I took the time to revisit the code and make improvements.

I came up with a 100% re-writing that is both easier to use and faster. Actually, it is blazing fast!.

Let's see how it works.

Downsampling

In the original post I introduced the idea of downsampling the image from the camera for a faster and more robust motion detection. I wrote the code in the main sketch to keep it self-contained.

Looking back now it was a poor choice, since it cluttered the project and distracted from the main purpose, which is motion detection.

Moreover, I thought that scanning the image buffer in sequential order would be the fastest approach.

It turns out I was wrong.

This time I scan the image buffer following the blocks that will compose the resulting image and the results are... much faster.

Also, I decided to inject some more efficiency that will further speedup the computation: using different strategies for downsampling.

The idea of downsampling is that you have to "collapse" a block of NxN from the original image to just one pixel of the resulting image.

Now, there are a variety of ways you can accomplish this. The first two I present here are the most obvious, the other two are of my "invention": nothing fancy nor new, but they're fast and serve the purpose well.

Nearest neighbor

You can just pick the center of the NxN block and use its value for the output.
Of course it is fast (possibly the fastest approach), but wouldn't be very accurate. One pixel out of NxN wouldn't be representative of the overall region and will heavily suffer from noise.

Nearest diagram

Nearest neighbor block averaging

Full block average

This is the most intuitive alternative: use the average of all the pixels in the block as the ouput value. This is arguabily the "proper" way to do it, since you're using all the pixels in the source image to compute the new one.

Full diagram
Full block averaging

Core block average

As a faster alternative, I thought that averaging only the "core" (the most internal part) of the block would have been a good-enough solution. It has no theoretical proof that this yields true, but our task here is to create a smaller representation of the original image, not producing an accurate smaller version.

Core diagram
Core block averaging

I'll stress this point: the only reason we do downsampling is to compare two sequential frame and detect if they differ above a certain threshold. This downsampling doesn't have to mimic the actual image: it can transform the source in any fancy way, as long as it stays consistent and captures the variations over time.

Cross block average

This time we consider all the pixels along the vertical and horizontal central axes. The idea is that you will capture a good portion of the variation along both the axis, given quite accurate results.

Cross diagram
Cross block averaging

Diagonal block average

This alternative too came to my mind from nowhere, really. I just think it is a good alternative to capture all the block's variation, probably even better than vertical and horizontal directions.

Diagonal diagram
Diagonal block averaging

Implement your own

Not satisfied from the methods above? No problem, you can still implement your own.

The ones presented above are just some algorithms that came to my mind: I'm not telling you they're the best.

They worked for me, that's it.

If you think you found a better solution, I encourage you implement it and even share it with me and the other readers, so we can all make progress on this together.

Finding this content useful?

Benchmarks

So, at the very beginning I said this new implementation is blazingly fast.

How much fast?

As fast as it can be, arguably.

I mean, so fast it won't alter your fps.

Look at the results I got on my M5Stack camera.

Algorithm Time to execute (micros) FPS
None 0 25
Nearest neighbor 160 25
Cross block 700 25
Core block 800 25
Diagonal block 950 25
Full block 4900 12

As you can see, only the full block creates a delay in the process (quite a bit of delay even): the other methods won't slow down your program in any noticeable way.

If you test Nearest neighbor and it works for you, then you'll be extremely light on computation resources with only 160 microseconds of delay.

This is what I mean by blazing fast.

Motion detection

The motion detection part hasn't changed, so I point you to the original post to read more about the Block difference threshold and the Image difference threshold.

Full code

#define CAMERA_MODEL_M5STACK_WIDE
#include "EloquentVision.h"

#define FRAME_SIZE FRAMESIZE_QVGA
#define SOURCE_WIDTH 320
#define SOURCE_HEIGHT 240
#define BLOCK_SIZE 10
#define DEST_WIDTH (SOURCE_WIDTH / BLOCK_SIZE)
#define DEST_HEIGHT (SOURCE_HEIGHT / BLOCK_SIZE)
#define BLOCK_DIFF_THRESHOLD 0.2
#define IMAGE_DIFF_THRESHOLD 0.1
#define DEBUG 0

using namespace Eloquent::Vision;

ESP32Camera camera;
uint8_t prevFrame[DEST_WIDTH * DEST_HEIGHT] = { 0 };
uint8_t currentFrame[DEST_WIDTH * DEST_HEIGHT] = { 0 };

// function prototypes
bool motionDetect();
void updateFrame();

/**
 *
 */
void setup() {
    Serial.begin(115200);
    camera.begin(FRAME_SIZE, PIXFORMAT_GRAYSCALE);
}

/**
 *
 */
void loop() {
    /**
     * Algorithm:
     *  1. grab frame
     *  2. compare with previous to detect motion
     *  3. update previous frame
     */

    time_t start = millis();
    camera_fb_t *frame = camera.capture();

    downscaleImage(frame->buf, currentFrame, nearest, SOURCE_WIDTH, SOURCE_HEIGHT, BLOCK_SIZE);

    if (motionDetect()) {
        Serial.print("Motion detected @ ");
        Serial.print(floor(1000.0f / (millis() - start)));
        Serial.println(" FPS");
    }

    updateFrame();
}

/**
 * Compute the number of different blocks
 * If there are enough, then motion happened
 */
bool motionDetect() {
    uint16_t changes = 0;
    const uint16_t blocks = DEST_WIDTH * DEST_HEIGHT;

    for (int y = 0; y < DEST_HEIGHT; y++) {
        for (int x = 0; x < DEST_WIDTH; x++) {
            float current = currentFrame[y * DEST_WIDTH + x];
            float prev = prevFrame[y * DEST_WIDTH + x];
            float delta = abs(current - prev) / prev;

            if (delta >= BLOCK_DIFF_THRESHOLD)
                changes += 1;
        }
    }

    return (1.0 * changes / blocks) > IMAGE_DIFF_THRESHOLD;
}

/**
 * Copy current frame to previous
 */
void updateFrame() {
    memcpy(prevFrame, currentFrame, DEST_WIDTH * DEST_HEIGHT);
}

Check the full project code on Github and remember to star!

Finding this content useful?

L'articolo Easier, faster pure video ESP32 cam motion detection proviene da Eloquent Arduino Blog.

]]>
Handwritten digit classification with Arduino and MicroML https://eloquentarduino.github.io/2020/02/handwritten-digit-classification-with-arduino-and-microml/ Sun, 23 Feb 2020 10:53:03 +0000 https://eloquentarduino.github.io/?p=931 We continue exploring the endless possibilities on the MicroML (Machine Learning for Microcontrollers) framework on Arduino and ESP32 boards: in this post we're back to image classification. In particular, we'll distinguish handwritten digits using an ESP32 camera. If this is the first time you're reading my blog, you may have missed that I'm on a […]

L'articolo Handwritten digit classification with Arduino and MicroML proviene da Eloquent Arduino Blog.

]]>
We continue exploring the endless possibilities on the MicroML (Machine Learning for Microcontrollers) framework on Arduino and ESP32 boards: in this post we're back to image classification. In particular, we'll distinguish handwritten digits using an ESP32 camera.

Arduino handwritten digit classification

If this is the first time you're reading my blog, you may have missed that I'm on a journey to push the limits of Machine learning on embedded devices like the Arduino boards and ESP32.

I started with accelerometer data classification, then did Wifi indoor positioning as a proof of concept.

In the last weeks, though, I undertook a more difficult path that is image classification.

Image classification is where Convolutional Neural Networks really shine, but I'm here to question this settlement and demostrate that it is possible to come up with much lighter alternatives.

In this post we continue with the examples, replicating a "benchmark" dataset in Machine learning: the handwritten digits classification.

If you are curious about a specific image classification task you would like to see implemented, let me know in the comments: I'm always open to new ideas

The task

The objective of this example is to be able to tell what an handwritten digit is, taking as input a photo from the ESP32 camera.

In particular, we have 3 handwritten numbers and the task of our model will be to distinguish which image is what number.

Handwritten digits example

I only have a single image per digit, but you're free to draw as many samples as you like: it should help improve the performance of you're classifier.

1. Feature extraction

When dealing with images, if you use a CNN this step is often overlooked: CNNs are made on purpose to handle raw pixel values, so you just throw the image in and it is handled properly.

When using other types of classifiers, it could help add a bit of feature engineering to help the classifier doing its job and achieve high accuracy.

But not this time.

I wanted to be as "light" as possible in this demo, so I only took a couple steps during the feature acquisition:

  1. use a grayscale image
  2. downsample to a manageable size
  3. convert it to black/white with a threshold

I would hardly call this feature engineering.

This is an example of the result of this pipeline.

Handwritten digit feature extraction

The code for this pipeline is really simple and is almost the same from the example on motion detection.

#include "esp_camera.h"

#define PWDN_GPIO_NUM     -1
#define RESET_GPIO_NUM    15
#define XCLK_GPIO_NUM     27
#define SIOD_GPIO_NUM     22
#define SIOC_GPIO_NUM     23
#define Y9_GPIO_NUM       19
#define Y8_GPIO_NUM       36
#define Y7_GPIO_NUM       18
#define Y6_GPIO_NUM       39
#define Y5_GPIO_NUM        5
#define Y4_GPIO_NUM       34
#define Y3_GPIO_NUM       35
#define Y2_GPIO_NUM       32
#define VSYNC_GPIO_NUM    25
#define HREF_GPIO_NUM     26
#define PCLK_GPIO_NUM     21

#define FRAME_SIZE FRAMESIZE_QQVGA
#define WIDTH 160
#define HEIGHT 120
#define BLOCK_SIZE 5
#define W (WIDTH / BLOCK_SIZE)
#define H (HEIGHT / BLOCK_SIZE)
#define THRESHOLD 127

double features[H*W] = { 0 };

void setup() {
    Serial.begin(115200);
    Serial.println(setup_camera(FRAME_SIZE) ? "OK" : "ERR INIT");
    delay(3000);
}

void loop() {
    if (!capture_still()) {
        Serial.println("Failed capture");
        delay(2000);
        return;
    }

    print_features();
    delay(3000);
}

bool setup_camera(framesize_t frameSize) {
    camera_config_t config;

    config.ledc_channel = LEDC_CHANNEL_0;
    config.ledc_timer = LEDC_TIMER_0;
    config.pin_d0 = Y2_GPIO_NUM;
    config.pin_d1 = Y3_GPIO_NUM;
    config.pin_d2 = Y4_GPIO_NUM;
    config.pin_d3 = Y5_GPIO_NUM;
    config.pin_d4 = Y6_GPIO_NUM;
    config.pin_d5 = Y7_GPIO_NUM;
    config.pin_d6 = Y8_GPIO_NUM;
    config.pin_d7 = Y9_GPIO_NUM;
    config.pin_xclk = XCLK_GPIO_NUM;
    config.pin_pclk = PCLK_GPIO_NUM;
    config.pin_vsync = VSYNC_GPIO_NUM;
    config.pin_href = HREF_GPIO_NUM;
    config.pin_sscb_sda = SIOD_GPIO_NUM;
    config.pin_sscb_scl = SIOC_GPIO_NUM;
    config.pin_pwdn = PWDN_GPIO_NUM;
    config.pin_reset = RESET_GPIO_NUM;
    config.xclk_freq_hz = 20000000;
    config.pixel_format = PIXFORMAT_GRAYSCALE;
    config.frame_size = frameSize;
    config.jpeg_quality = 12;
    config.fb_count = 1;

    bool ok = esp_camera_init(&config) == ESP_OK;

    sensor_t *sensor = esp_camera_sensor_get();
    sensor->set_framesize(sensor, frameSize);

    return ok;
}

bool capture_still() {
    camera_fb_t *frame = esp_camera_fb_get();

    if (!frame)
        return false;

    // reset all the features
    for (size_t i = 0; i < H * W; i++)
      features[i] = 0;

    // for each pixel, compute the position in the downsampled image
    for (size_t i = 0; i < frame->len; i++) {
      const uint16_t x = i % WIDTH;
      const uint16_t y = floor(i / WIDTH);
      const uint8_t block_x = floor(x / BLOCK_SIZE);
      const uint8_t block_y = floor(y / BLOCK_SIZE);
      const uint16_t j = block_y * W + block_x;

      features[j] += frame->buf[i];
    }

    // apply threshold
    for (size_t i = 0; i < H * W; i++) {
      features[i] = (features[i] / (BLOCK_SIZE * BLOCK_SIZE) > THRESHOLD) ? 1 : 0;
    }

    return true;
}

void print_features() {
    for (size_t i = 0; i < H * W; i++) {
        Serial.print(features[i]);

        if (i != H * W - 1)
          Serial.print(',');
    }

    Serial.println();
}

2. Samples recording

To create your own dataset, you need a collection of handwritten digits.

You can do this part as you like, by using pieces of paper or a monitor. I used a tablet because it was well illuminated and I could open a bunch of tabs to keep a record of my samples.

As in the apple vs orange, keep in mind that you should be consistent during both the training phase and the inference phase.

This is why I used tape to fix my ESP32 camera to the desk and kept the tablet in the exact same position.

If you desire, you could experiment varying slightly the capturing setup during the training and see if your classifier still achieves good accuracy: this is a test I didn't make.

3. Train and export the classifier

For a detailed guide refer to the tutorial

from sklearn.ensemble import RandomForestClassifier
from micromlgen import port

# put your samples in the dataset folder
# one class per file
# one feature vector per line, in CSV format
features, classmap = load_features('dataset/')
X, y = features[:, :-1], features[:, -1]
classifier = RandomForestClassifier(n_estimators=30, max_depth=10).fit(X, y)
c_code = port(classifier, classmap=classmap)
print(c_code)

At this point you have to copy the printed code and import it in your Arduino project, in a file called model.h.

4. The result

Okay, at this point you should have all the working pieces to do handwritten digit image classification on your ESP32 camera. Include your model in the sketch and run the classification.

#include "model.h"

void loop() {
    if (!capture_still()) {
        Serial.println("Failed capture");
        delay(2000);

        return;
    }

    Serial.print("Number: ");
    Serial.println(classIdxToName(predict(features)));
    delay(3000);
}

Done.

You can see a demo of my results in the video below.

Project figures

My dataset is composed of 25 training samples in total and the SVM with linear kernel produced 17 support vectors.

On my M5Stick camera board, the overhead for the model is 6.8 Kb of flash and the inference takes 7ms: not that bad!


Check the full project code on Github

L'articolo Handwritten digit classification with Arduino and MicroML proviene da Eloquent Arduino Blog.

]]>
Apple or Orange? Image recognition with ESP32 and Arduino https://eloquentarduino.github.io/2020/01/image-recognition-with-esp32-and-arduino/ Sun, 12 Jan 2020 10:32:08 +0000 https://eloquentarduino.github.io/?p=820 Do you have an ESP32 camera? Want to do image recognition directly on your ESP32, without a PC? In this post we'll look into a very basic image recognition task: distinguish apples from oranges with machine learning. Image recognition is a very hot topic these days in the AI/ML landscape. Convolutional Neural Networks really shines […]

L'articolo Apple or Orange? Image recognition with ESP32 and Arduino proviene da Eloquent Arduino Blog.

]]>
Do you have an ESP32 camera?

Want to do image recognition directly on your ESP32, without a PC?

In this post we'll look into a very basic image recognition task: distinguish apples from oranges with machine learning.

Apple vs Orange

Image recognition is a very hot topic these days in the AI/ML landscape. Convolutional Neural Networks really shines in this task and can achieve almost perfect accuracy on many scenarios.

Sadly, you can't run CNN on your ESP32, they're just too large for a microcontroller.

Since in this series about Machine Learning on Microcontrollers we're exploring the potential of Support Vector Machines (SVMs) at solving different classification tasks, we'll take a look into image classification too.

What we're going to do

In a previous post about color identification with Machine learning, we used an Arduino to detect the object we were pointing at with a color sensor (TCS3200) by its color: if we detected yellow, for example, we knew we had a banana in front of us.

Of course such a process is not object recognition at all: yellow may be a banane, or a lemon, or an apple.

Object inference, in that case, works only if you have exactly one object for a given color.

The objective of this post, instead, is to investigate if we can use the MicroML framework to do simple image recognition on the images from an ESP32 camera.

This is much more similar to the tasks you do on your PC with CNN or any other form of NN you are comfortable with. Sure, we will still apply some restrictions to fit the problem on a microcontroller, but this is a huge step forward compared to the simple color identification.

In this context, image recognition means deciding which class (from the trained ones) the current image belongs to. This algorithm can't locate interesting objects in the image, neither detect if an object is present in the frame. It will classify the current image based on the samples recorded during training.

As any beginning machine learning project about image classification worth of respect, our task will be to distinguish an orange from an apple.

Features definition

I have to admit that I rarely use NN, so I may be wrong here, but from the examples I read online it looks to me that features engineering is not a fundamental task with NN.

Those few times I used CNN, I always used the whole image as input, as-is. I didn't extracted any feature from them (e.g. color histogram): the CNN worked perfectly fine with raw images.

I don't think this will work best with SVM, but in this first post we're starting as simple as possible, so we'll be using the RGB components of the image as our features. In a future post, we'll introduce additional features to try to improve our results.

I said we're using the RGB components of the image. But not all of them.

Even at the lowest resolution of 160x120 pixels, a raw RGB image from the camera would generate 160x120x3 = 57600 features: way too much.

We need to reduce this number to the bare minimum.

How much pixels do you think are necessary to get reasonable results in this task of classifying apples from oranges?

You would be surprised to know that I got 90% accuracy with an RGB image of 8x6!

You actually need very few pixels to do image classification

Yes, that's all we really need to do a good enough classification.


You can distinguish apples from oranges on ESP32 with 8x6 pixels only!
Click To Tweet


Of course this is a tradeoff: you can't expect to achieve 99% accuracy while mantaining the model size small enough to fit on a microcontroller. 90% is an acceptable accuracy for me in this context.

You have to keep in mind, moreover, that the features vector size grows quadratically with the image size (if you keep the aspect ratio). A raw RGB image of 8x6 generates 144 features: an image of 16x12 generates 576 features. This was already causing random crashes on my ESP32.

So we'll stick to 8x6 images.

Now, how do you compact a 160x120 image to 8x6? With downsampling.

This is the same tecnique we've used in the post about motion detection on ESP32: we define a block size and average all the pixels inside the block to get a single value (you can refer to that post for more details).

Image downsampling example

This time, though, we're working with RGB images instead of grayscale, so we'll repeat the exact same process 3 times, one for each channel.

This is the code excerpt that does the downsampling.

uint16_t rgb_frame[HEIGHT / BLOCK_SIZE][WIDTH / BLOCK_SIZE][3] = { 0 };

void grab_image() {
    for (size_t i = 0; i < len; i += 2) {
        // get r, g, b from the buffer
        // see later

        const size_t j = i / 2;
        // transform x, y in the original image to x, y in the downsampled image
        // by dividing by BLOCK_SIZE
        const uint16_t x = j % WIDTH;
        const uint16_t y = floor(j / WIDTH);
        const uint8_t block_x = floor(x / BLOCK_SIZE);
        const uint8_t block_y = floor(y / BLOCK_SIZE);

        // average pixels in block (accumulate)
        rgb_frame[block_y][block_x][0] += r;
        rgb_frame[block_y][block_x][1] += g;
        rgb_frame[block_y][block_x][2] += b;
    }
}

Finding this content useful?

Extracting RGB components

The ESP32 camera can store the image in different formats (of our interest — there are a couple more available):

  1. grayscale: no color information, just the intensity is stored. The buffer has size HEIGHT*WIDTH
  2. RGB565: stores each RGB pixel in two bytes, with 5 bit for red, 6 for green and 5 for blue. The buffer has size HEIGHT * WIDTH * 2
  3. JPEG: encodes (in hardware?) the image to jpeg. The buffer has a variable length, based on the encoding results

For our purpose, we'll use the RGB565 format and extract the 3 components from the 2 bytes with the following code.

taken from https://www.theimagingsource.com/support/documentation/ic-imaging-control-cpp/PixelformatRGB565.htm

config.pixel_format = PIXFORMAT_RGB565;

for (size_t i = 0; i < len; i += 2) {
    const uint8_t high = buf[i];
    const uint8_t low  = buf[i+1];
    const uint16_t pixel = (high << 8) | low;

    const uint8_t r = (pixel & 0b1111100000000000) >> 11;
    const uint8_t g = (pixel & 0b0000011111100000) >> 6;
    const uint8_t b = (pixel & 0b0000000000011111);
}

Record samples image

Now that we can grab the images from the camera, we'll need to take a few samples of each object we want to racognize.

Before doing so, we'll linearize the image matrix to a 1-dimensional vector, because that's what our prediction function expects.

#define H (HEIGHT / BLOCK_SIZE)
#define W (WIDTH / BLOCK_SIZE)

void linearize_features() {
  size_t i = 0;
  double features[H*W*3] = {0};

  for (int y = 0; y < H; y++) {
    for (int x = 0; x < W; x++) {
      features[i++] = rgb_frame[y][x][0];
      features[i++] = rgb_frame[y][x][1];
      features[i++] = rgb_frame[y][x][2];
    }
  }

  // print to serial
  for (size_t i = 0; i < H*W*3; i++) {
    Serial.print(features[i]);
    Serial.print('\t');
  }

  Serial.println();
}

Now you can setup your acquisition environment and take the samples: 15-20 of each object will do the job.

Image acquisition is a very noisy process: even keeping the camera still, you will get fluctuating values.
You need to be very accurate during this phase if you want to achieve good results.
I suggest you immobilize your camera with tape to a flat surface or use some kind of photographic easel.

Training the classifier

To train the classifier, save the features for each object in a file, one features vector per line. Then follow the steps on how to train a ML classifier for Arduino to get the exported model.

You can experiment with different classifier configurations.

My features were well distinguishable, so I had great results (100% accuracy) with any kernel (even linear).

One odd thing happened with the RBF kernel: I had to use an extremely low gamma value (0.0000001). Does anyone can explain me why? I usually go with a default value of 0.001.

The model produced 13 support vectors.

I did no features scaling: you could try it if classifying more than 2 classes and having poor results.

Apple vs Orange decision boundaries

Real world example

If you followed all the steps above, you should now have a model capable of detecting if your camera is shotting an apple or an orange, as you can see in the following video.

The little white object you see at the bottom of the image is the camera, taped to the desk.

Did you think it was possible to do simple image classification on your ESP32?

Disclaimer

This is not full-fledged object recognition: it can't label objects while you walk as Tensorflow can do, for example.

You have to carefully craft your setup and be as consistent as possible between training and inferencing.

Still, I think this is a fun proof-of-concept that can have useful applications in simple scenarios where you can live with a fixed camera and don't want to use a full Raspberry Pi.

In the next weeks I settled to finally try TensorFlow Lite for Microcontrollers on my ESP32, so I'll try to do a comparison between them and this example and report my results.

Now that you can do image classification on your ESP32, can you think of a use case you will be able to apply this code to?

Let me know in the comments, we could even try realize it together if you need some help.


Check the full project code on Github

L'articolo Apple or Orange? Image recognition with ESP32 and Arduino proviene da Eloquent Arduino Blog.

]]>
Motion detection with ESP32 cam only (Arduino version) https://eloquentarduino.com/projects/esp32-arduino-motion-detection Sun, 05 Jan 2020 11:08:08 +0000 https://eloquentarduino.github.io/?p=779 Do you have an ESP32 camera? Do you want to do motion detection WITHOUT ANY external hardware? Here's a tutorial made just for you: 30 lines of code and you will know when something changes in your video stream 🎥 ** See the updated version of this project: it's easier to use and waaay faster: […]

L'articolo Motion detection with ESP32 cam only (Arduino version) proviene da Eloquent Arduino Blog.

]]>
Do you have an ESP32 camera? Do you want to do motion detection WITHOUT ANY external hardware?

Here's a tutorial made just for you: 30 lines of code and you will know when something changes in your video stream 🎥

ESP32 camera motion detection example

** See the updated version of this project: it's easier to use and waaay faster: Easier, faster, pure video ESP32 cam motion detection **

What is (naive) motion detection?

Quoting from Wikipedia

Motion detection is the process of detecting a change in the position of an object relative to its surroundings or a change in the surroundings relative to an object

In this project, we're implementing what I call naive motion detection: that is, we're not focusing on a particular object and following its motion.

We'll only detect if any considerable portion of the image changed from one frame to the next.

We won't identify the location of motion (that's the subject for a next project), neither what caused it. We will analyze video stream in (almost) real-time and compare frame by frame: if lots of pixels changed, we'll call it motion.

Can't I use an external PIR?

Several projects on the internet about motion detection with an ESP32 cam use an external PIR sensor to trigger the video recording.

What's the problem with that approach?

1. External hardware

First of all, you need external hardware. If you're using a breadboard, no problem, you just need a couple more wires and you're good to go. But I have a nice M5stick camera (no affiliate link), that's already well packaged, so it won't be that easy to add a PIR sensor.

2. Field of View

PIR sensors have a limited FOV (field of view), so you will need more than one to cover the whole range of the camera.

My camera, for example, has fish-eye lens which give me 160° of view. Most cheap PIR sensors have a 120° field of view, so one will not suffice. This adds even more space to my project.

3. Cold objects

PIR sensors gets triggered by infrared light. Infrared light gets emitted by hot bodies (like people and animals).

But motion in a video stream can happen for a variety of reasons, not necessarily due to hot bodies, for example if you want to monitor a street for cars passing by.

A PIR sensor can't do this: video motion detection can.


ESP32 cam pure video motion detection can detect motion due to cold objects
Click To Tweet


Do you like the motion effect at the beginning of the post? Check it out on Github

What do you need?

All you need for this project is a board with a camera sensor. As I said, I have a M5Stick Camera with fish-eye lens, but any ESP32 based camera should work out of the box:

  • ESP32 cam
  • ESP32 eye
  • TTGO camera
  • ... any other flavor of ESP32 camera

ESP32 camera models

How does it work?

Ok, let's go to the "technical" stuff.

Simply put, the algorithm counts the number of different pixels from one frame to the next: if many pixels changed, it will detect motion.

Well, it's almost like this.

Of course such an algorithm will be very sensitive to noise (which is quite high on these low-cost cameras). We need to mitigate false-positive triggers.

Downsampling

One super-simple and super-effective way of doing this is to work with blocks, instead of pixels. A block is simply an N x N square, whose value is the average of the pixels it contains.

This greatly reduces sensitivity to noise, providing a more robust detection. Here's an example of what the the "block-ing" operation does to an image.

Image downsampling example

It's really a "pixelating" effect: you take the orginal image (let's say 320x240 pixels) and resize it to 10x smaller, 32x24.

This has the added benefit that it's much more lightweight to work with 32x24 matrix instead of 320x240 matrix: if you want to do real-time detection, this is a MUST.

How should you choose the scale factor?

Well, it depends.

It depends on the sensitivity you want to achieve. The higher the downsampling, the less sensitive your detection will be.

If you want to detect a person passing 50cm away from the camera, you can increase this number without any problem. If you want to detect a dog 10m away, you should keep it in the 5-10 range.

Experiment with your own use case a tweak with trial-and-error.

Blocks difference threshold

Once we've defined the block size, we need to detect if a block changed from one frame to the next.

Of course, just testing for difference (current != prev) would be again too sensitive to noise. A block can change for a variety of reasons, the first of which is the bad camera quality.

So we instead define a percent threshold above which we can say for sure the block actually changed. A good starting point could be 10-20%, but again you need to tweak this to your needs.

The higher the threshold, the less sensitive the algorithm will be.

In code it is calculated as

float delta = abs(currentBlockValue - prevBlockValue) / prevBlockValue;

which indicates the relative increment/decrement from the previous value.

Image difference threshold

Now that we can detect if a block changed from one frame to the next, we can actually detect if the image changed.

You could decide to trigger motion even if a single block changed, but I suggest you to set an higher value here.

Let's return to the 320x240 image example. With a 10x10 block, you'll be working with 32x24 = 768 blocks: will you call it "motion" if 1 out of 768 blocks changed value?

I don't think so. You want something more robust. You want 50 blocks to change. Or at least 20 blocks. If you do the math, 20 blocks out of 768 is only the 2.5% of change, which is hardly noticeable.

If you want to be robust, don't set this threshold to a too low value. Again, tweak with real world experimenting.

In code it is calculated as:

float changedBlocksPercent = changedBlocks / totalBlocks

Combining all together

Recapping: when running the motion detection algorithm you have 3 parameters to set:

  1. the block size
  2. the block difference threshold
  3. the image differerence threshold

Let's pick 3 sensible defaults: block size = 10, block threshold = 15%, image threshold = 20%.

What does these parameters translate to in the practice?

They mean that motion will be detected if 20% of the image, averaged in blocks of 10x10, changed its value by at least 15% from one frame to the next.

ESP32 camera motion example

As you can see, you don't need high-definition images to (naively) detect if something happened to the image. Large area of motion will be easily detectable, even at very low resolution.

Real world example

Now the fun part. I'll show you how it performs on a real-world scenario.

To keep it simple, I wrote a sketch that does only motion detection, not video streaming over HTTP.

This means you won't be able to see the original image recorded from the camera. Nevertheless, I have kept the block size to a minimum to allow for the best quality possible.

This is me passing my arm in front of the camera a few times.

The grid you see represents the actual pixels used for the computation. Each cell corresponds to one pixel of the downscaled image.

The orange cells highlight the pixels that the algorithm sees as "different" from one frame to the next. As you can see, some pixels are detected even if no motion is happening. That's the noise I talked about multiple times during the post.

When I move my arm in the frame, you see lots of pixels become activated, so the "Motion" text appears.

While moving the arm, you may notice what I call the "ghost" effect. You actually see 2 regions of motion: one is where my arm is now, which of course changed. The other is the region where my arm was in the previous frame, which returned to its original content.

This is why I suggest you keep the image difference threshold to a high value: if some real motion happens, you will notice it for sure because the activated region of the image will be actually bigger than the actual object moving.

Do you like the grid effect of the sample video? Let me know in the comment if you want me to share it.

Or even better: subscribe to the newsletter I you will get it directly in your inbox with my next mail.

Finding this content useful?


Check the full project code on Github

Check out also the gist for the visualization tool

L'articolo Motion detection with ESP32 cam only (Arduino version) proviene da Eloquent Arduino Blog.

]]>