{ "version": "https://jsonfeed.org/version/1.1", "user_comment": "This feed allows you to read the posts from this site in any feed reader that supports the JSON Feed format. To add this feed to your reader, copy the following URL -- https://eloquentarduino.github.io/tag/camera/feed/json/ -- and add it your reader.", "home_page_url": "https://eloquentarduino.github.io/tag/camera/", "feed_url": "https://eloquentarduino.github.io/tag/camera/feed/json/", "language": "en-US", "title": "camera – Eloquent Arduino Blog", "description": "Machine learning on Arduino, programming & electronics", "items": [ { "id": "https://eloquentarduino.github.io/?p=1203", "url": "https://eloquentarduino.github.io/2020/06/easy-esp32-camera-http-video-streaming-server/", "title": "Easy ESP32 camera HTTP video streaming server", "content_html": "

This will be a short post where I introduce a new addition to the Arduino Eloquent library aimed to make video streaming from an ESP32 camera over HTTP super easy. It will be the first component of a larger project I'm going to implement.

\n

\n

If you Google "esp32 video streaming" you will get a bunch of results that are essentialy copy-pasted from the official Espressif repo: many of them neither copy-pasted the code, just tell you to load the example sketch.

\n

And if you try to read it and try to modify just a bit for your own use-case, you won't understand much.

\n

This is the exact environment for an Eloquent component to live: make it painfully easy what's messy.

\n

I still have to find a good naming scheme for my libraries since Arduino IDE doesn't allow nested imports, so forgive me if "ESP32CameraHTTPVideoStreamingServer.h" was the best that came to mind.

\n

How easy is it to use?

\n

1 line of code if used in conjuction with my other library EloquentVision.

\n
#define CAMERA_MODEL_M5STACK_WIDE\n#include "WiFi.h"\n#include "EloquentVision.h"\n#include "ESP32CameraHTTPVideoStreamingServer.h"\n\nusing namespace Eloquent::Vision;\nusing namespace Eloquent::Vision::Camera;\n\nESP32Camera camera;\nHTTPVideoStreamingServer server(81);\n\n/**\n *\n */\nvoid setup() {\n    Serial.begin(115200);\n    WiFi.softAP("ESP32", "12345678");\n\n    camera.begin(FRAMESIZE_QVGA, PIXFORMAT_JPEG);\n    server.start();\n\n    Serial.print("Camera Ready! Use 'http://");\n    Serial.print(WiFi.softAPIP());\n    Serial.println(":81' to stream");\n}\n\nvoid loop() {\n}
\n

HTTPVideoStreamingServer assumes you already initialized your camera. You can achieve this task in the way you prefer: ESP32Camera class makes this a breeze.

\n

81 in the server constructor is the port you want the server to be listening to.

\n

Once connected to WiFi or started in AP mode, all you have to do is call start(): that's it!

\n\r\n
\r\n
\r\n
\r\n\t

Finding this content useful?

\r\n
\r\n\t\r\n
\r\n\t
\r\n\t\t
\r\n\t\t
\r\n\t
\r\n
\r\n
\r\n
\r\n
\r\n
\r\n\r\n\n

What else is it good for?

\n

The main reason I wrote this piece of library is because one of you reader commented on the motion detection post asking if it would be possible to start the video streaming once motion is detected.

\n

Of course it is.

\n

It's just a matter of composing the Eloquent pieces.

\n
// not workings AS-IS, needs refactoring\n\n#define CAMERA_MODEL_M5STACK_WIDE\n#include "WiFi.h"\n#include "EloquentVision.h"\n#include "ESP32CameraHTTPVideoStreamingServer.h"\n\n#define FRAME_SIZE FRAMESIZE_QVGA\n#define SOURCE_WIDTH 320\n#define SOURCE_HEIGHT 240\n#define CHANNELS 1\n#define DEST_WIDTH 32\n#define DEST_HEIGHT 24\n#define BLOCK_VARIATION_THRESHOLD 0.3\n#define MOTION_THRESHOLD 0.2\n\n// we're using the Eloquent::Vision namespace a lot!\nusing namespace Eloquent::Vision;\nusing namespace Eloquent::Vision::Camera;\nusing namespace Eloquent::Vision::ImageProcessing;\nusing namespace Eloquent::Vision::ImageProcessing::Downscale;\nusing namespace Eloquent::Vision::ImageProcessing::DownscaleStrategies;\n\nESP32Camera camera;\nHTTPVideoStreamingServer server(81);\n// the buffer to store the downscaled version of the image\nuint8_t resized[DEST_HEIGHT][DEST_WIDTH];\n// the downscaler algorithm\n// for more details see https://eloquentarduino.github.io/2020/05/easier-faster-pure-video-esp32-cam-motion-detection\nCross<SOURCE_WIDTH, SOURCE_HEIGHT, DEST_WIDTH, DEST_HEIGHT> crossStrategy;\n// the downscaler container\nDownscaler<SOURCE_WIDTH, SOURCE_HEIGHT, CHANNELS, DEST_WIDTH, DEST_HEIGHT> downscaler(&crossStrategy);\n// the motion detection algorithm\nMotionDetection<DEST_WIDTH, DEST_HEIGHT> motion;\n\n/**\n *\n */\nvoid setup() {\n    Serial.begin(115200);\n    WiFi.softAP("ESP32", "12345678");\n\n    camera.begin(FRAMESIZE_QVGA, PIXFORMAT_GRAYSCALE);\n    motion.setBlockVariationThreshold(BLOCK_VARIATION_THRESHOLD);\n\n    Serial.print("Camera Ready! Use 'http://");\n    Serial.print(WiFi.softAPIP());\n    Serial.println(":81' to stream");\n}\n\nvoid loop() {\n    camera_fb_t *frame = camera.capture();\n\n    // resize image and detect motion\n    downscaler.downscale(frame->buf, resized);\n    motion.update(resized);\n    motion.detect();\n\n    if (motion.ratio() > MOTION_THRESHOLD) {\n        Serial.print("Motion detected");\n        // start the streaming server when motion is detected\n        // shutdown after 20 seconds if no one connects\n        camera.begin(FRAMESIZE_QVGA, PIXFORMAT_JPEG);\n        delay(2000);\n        Serial.print("Camera Server ready! Use 'http://");\n        Serial.print(WiFi.softAPIP());\n        Serial.println(":81' to stream");\n        server.start();\n        delay(20000);\n        server.stop();\n        camera.begin(FRAMESIZE_QVGA, PIXFORMAT_GRAYSCALE);\n        delay(2000);\n    }\n\n    // probably we don't need 30 fps, save some power\n    delay(300);\n}
\n

Does it look good?

\n

Now the rationale behind Eloquent components should be starting to be clear to you: easy to use objects you can compose the way it fits to achieve the result you want.

\n

Would you suggest me more piece of functionality you would like to see wrapped in an Eloquent component?

\n
\n

You can find the class code and the example sketch on the Github repo.

\n

L'articolo Easy ESP32 camera HTTP video streaming server proviene da Eloquent Arduino Blog.

\n", "content_text": "This will be a short post where I introduce a new addition to the Arduino Eloquent library aimed to make video streaming from an ESP32 camera over HTTP super easy. It will be the first component of a larger project I'm going to implement.\n\nIf you Google "esp32 video streaming" you will get a bunch of results that are essentialy copy-pasted from the official Espressif repo: many of them neither copy-pasted the code, just tell you to load the example sketch.\nAnd if you try to read it and try to modify just a bit for your own use-case, you won't understand much.\nThis is the exact environment for an Eloquent component to live: make it painfully easy what's messy.\nI still have to find a good naming scheme for my libraries since Arduino IDE doesn't allow nested imports, so forgive me if "ESP32CameraHTTPVideoStreamingServer.h" was the best that came to mind.\nHow easy is it to use?\n1 line of code if used in conjuction with my other library EloquentVision.\n#define CAMERA_MODEL_M5STACK_WIDE\n#include "WiFi.h"\n#include "EloquentVision.h"\n#include "ESP32CameraHTTPVideoStreamingServer.h"\n\nusing namespace Eloquent::Vision;\nusing namespace Eloquent::Vision::Camera;\n\nESP32Camera camera;\nHTTPVideoStreamingServer server(81);\n\n/**\n *\n */\nvoid setup() {\n Serial.begin(115200);\n WiFi.softAP("ESP32", "12345678");\n\n camera.begin(FRAMESIZE_QVGA, PIXFORMAT_JPEG);\n server.start();\n\n Serial.print("Camera Ready! Use 'http://");\n Serial.print(WiFi.softAPIP());\n Serial.println(":81' to stream");\n}\n\nvoid loop() {\n}\nHTTPVideoStreamingServer assumes you already initialized your camera. You can achieve this task in the way you prefer: ESP32Camera class makes this a breeze.\n81 in the server constructor is the port you want the server to be listening to.\nOnce connected to WiFi or started in AP mode, all you have to do is call start(): that's it!\n\r\n\r\n\r\n \r\n\tFinding this content useful?\r\n\r\n\t\r\n\r\n\t\r\n\t\t\r\n\t\t\r\n\t \r\n \r\n \r\n \r\n\r\n\r\n\r\n\nWhat else is it good for?\nThe main reason I wrote this piece of library is because one of you reader commented on the motion detection post asking if it would be possible to start the video streaming once motion is detected.\nOf course it is.\nIt's just a matter of composing the Eloquent pieces.\n// not workings AS-IS, needs refactoring\n\n#define CAMERA_MODEL_M5STACK_WIDE\n#include "WiFi.h"\n#include "EloquentVision.h"\n#include "ESP32CameraHTTPVideoStreamingServer.h"\n\n#define FRAME_SIZE FRAMESIZE_QVGA\n#define SOURCE_WIDTH 320\n#define SOURCE_HEIGHT 240\n#define CHANNELS 1\n#define DEST_WIDTH 32\n#define DEST_HEIGHT 24\n#define BLOCK_VARIATION_THRESHOLD 0.3\n#define MOTION_THRESHOLD 0.2\n\n// we're using the Eloquent::Vision namespace a lot!\nusing namespace Eloquent::Vision;\nusing namespace Eloquent::Vision::Camera;\nusing namespace Eloquent::Vision::ImageProcessing;\nusing namespace Eloquent::Vision::ImageProcessing::Downscale;\nusing namespace Eloquent::Vision::ImageProcessing::DownscaleStrategies;\n\nESP32Camera camera;\nHTTPVideoStreamingServer server(81);\n// the buffer to store the downscaled version of the image\nuint8_t resized[DEST_HEIGHT][DEST_WIDTH];\n// the downscaler algorithm\n// for more details see https://eloquentarduino.github.io/2020/05/easier-faster-pure-video-esp32-cam-motion-detection\nCross<SOURCE_WIDTH, SOURCE_HEIGHT, DEST_WIDTH, DEST_HEIGHT> crossStrategy;\n// the downscaler container\nDownscaler<SOURCE_WIDTH, SOURCE_HEIGHT, CHANNELS, DEST_WIDTH, DEST_HEIGHT> downscaler(&crossStrategy);\n// the motion detection algorithm\nMotionDetection<DEST_WIDTH, DEST_HEIGHT> motion;\n\n/**\n *\n */\nvoid setup() {\n Serial.begin(115200);\n WiFi.softAP("ESP32", "12345678");\n\n camera.begin(FRAMESIZE_QVGA, PIXFORMAT_GRAYSCALE);\n motion.setBlockVariationThreshold(BLOCK_VARIATION_THRESHOLD);\n\n Serial.print("Camera Ready! Use 'http://");\n Serial.print(WiFi.softAPIP());\n Serial.println(":81' to stream");\n}\n\nvoid loop() {\n camera_fb_t *frame = camera.capture();\n\n // resize image and detect motion\n downscaler.downscale(frame->buf, resized);\n motion.update(resized);\n motion.detect();\n\n if (motion.ratio() > MOTION_THRESHOLD) {\n Serial.print("Motion detected");\n // start the streaming server when motion is detected\n // shutdown after 20 seconds if no one connects\n camera.begin(FRAMESIZE_QVGA, PIXFORMAT_JPEG);\n delay(2000);\n Serial.print("Camera Server ready! Use 'http://");\n Serial.print(WiFi.softAPIP());\n Serial.println(":81' to stream");\n server.start();\n delay(20000);\n server.stop();\n camera.begin(FRAMESIZE_QVGA, PIXFORMAT_GRAYSCALE);\n delay(2000);\n }\n\n // probably we don't need 30 fps, save some power\n delay(300);\n}\nDoes it look good?\nNow the rationale behind Eloquent components should be starting to be clear to you: easy to use objects you can compose the way it fits to achieve the result you want.\nWould you suggest me more piece of functionality you would like to see wrapped in an Eloquent component?\n\nYou can find the class code and the example sketch on the Github repo.\nL'articolo Easy ESP32 camera HTTP video streaming server proviene da Eloquent Arduino Blog.", "date_published": "2020-06-24T19:27:33+02:00", "date_modified": "2020-12-16T21:29:52+01:00", "authors": [ { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" } ], "author": { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" }, "tags": [ "camera", "esp32", "Eloquent library" ] }, { "id": "https://eloquentarduino.github.io/?p=1110", "url": "https://eloquentarduino.com/projects/esp32-arduino-motion-detection", "title": "Easier, faster pure video ESP32 cam motion detection", "content_html": "

If you liked my post about ESP32 cam motion detection, you'll love this updated version: it's easier to use and blazing fast!

\n

\"Faster

\n

\n

The post about pure video ESP32 cam motion detection without an external PIR is my most successful post at the moment. Many of you are interested about this topic.

\n

One of my readers, though, pointed out my implementation was quite slow and he only achieved bare 5 fps in his project. So he asked for a better alternative.

\n

Since the post was of great interest for many people, I took the time to revisit the code and make improvements.

\n

I came up with a 100% re-writing that is both easier to use and faster. Actually, it is blazing fast!.

\n

Let's see how it works.

\n

Table of contents
  1. Downsampling
    1. Nearest neighbor
    2. Full block average
    3. Core block average
    4. Cross block average
    5. Diagonal block average
    6. Implement your own
  2. Benchmarks
  3. Motion detection
  4. Full code

\n

Downsampling

\n

In the original post I introduced the idea of downsampling the image from the camera for a faster and more robust motion detection. I wrote the code in the main sketch to keep it self-contained.

\n

Looking back now it was a poor choice, since it cluttered the project and distracted from the main purpose, which is motion detection.

\n

Moreover, I thought that scanning the image buffer in sequential order would be the fastest approach.

\n

It turns out I was wrong.

\n

This time I scan the image buffer following the blocks that will compose the resulting image and the results are... much faster.

\n

Also, I decided to inject some more efficiency that will further speedup the computation: using different strategies for downsampling.

\n

The idea of downsampling is that you have to "collapse" a block of NxN from the original image to just one pixel of the resulting image.

\n

Now, there are a variety of ways you can accomplish this. The first two I present here are the most obvious, the other two are of my "invention": nothing fancy nor new, but they're fast and serve the purpose well.

\n

Nearest neighbor

\n

You can just pick the center of the NxN block and use its value for the output.
\nOf course it is fast (possibly the fastest approach), but wouldn't be very accurate. One pixel out of NxN wouldn't be representative of the overall region and will heavily suffer from noise.

\n

\"Nearest

\n

\"Nearest

\n

Full block average

\n

This is the most intuitive alternative: use the average of all the pixels in the block as the ouput value. This is arguabily the "proper" way to do it, since you're using all the pixels in the source image to compute the new one.

\n

\"Full
\n\"Full

\n

Core block average

\n

As a faster alternative, I thought that averaging only the "core" (the most internal part) of the block would have been a good-enough solution. It has no theoretical proof that this yields true, but our task here is to create a smaller representation of the original image, not producing an accurate smaller version.

\n

\"Core
\n\"Core

\n

I'll stress this point: the only reason we do downsampling is to compare two sequential frame and detect if they differ above a certain threshold. This downsampling doesn't have to mimic the actual image: it can transform the source in any fancy way, as long as it stays consistent and captures the variations over time.

\n

Cross block average

\n

This time we consider all the pixels along the vertical and horizontal central axes. The idea is that you will capture a good portion of the variation along both the axis, given quite accurate results.

\n

\"Cross
\n\"Cross

\n

Diagonal block average

\n

This alternative too came to my mind from nowhere, really. I just think it is a good alternative to capture all the block's variation, probably even better than vertical and horizontal directions.

\n

\"Diagonal
\n\"Diagonal

\n

Implement your own

\n

Not satisfied from the methods above? No problem, you can still implement your own.

\n

The ones presented above are just some algorithms that came to my mind: I'm not telling you they're the best.

\n

They worked for me, that's it.

\n

If you think you found a better solution, I encourage you implement it and even share it with me and the other readers, so we can all make progress on this together.

\n\r\n
\r\n
\r\n
\r\n\t

Finding this content useful?

\r\n
\r\n\t\r\n
\r\n\t
\r\n\t\t
\r\n\t\t
\r\n\t
\r\n
\r\n
\r\n
\r\n
\r\n
\r\n\r\n\n

Benchmarks

\n

So, at the very beginning I said this new implementation is blazingly fast.

\n

How much fast?

\n

As fast as it can be, arguably.

\n

I mean, so fast it won't alter your fps.

\n

Look at the results I got on my M5Stack camera.

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
AlgorithmTime to execute (micros)FPS
None025
Nearest neighbor16025
Cross block70025
Core block80025
Diagonal block95025
Full block490012
\n

As you can see, only the full block creates a delay in the process (quite a bit of delay even): the other methods won't slow down your program in any noticeable way.

\n

If you test Nearest neighbor and it works for you, then you'll be extremely light on computation resources with only 160 microseconds of delay.

\n

This is what I mean by blazing fast.

\n

Motion detection

\n

The motion detection part hasn't changed, so I point you to the original post to read more about the Block difference threshold and the Image difference threshold.

\n

Full code

\n
#define CAMERA_MODEL_M5STACK_WIDE\n#include "EloquentVision.h"\n\n#define FRAME_SIZE FRAMESIZE_QVGA\n#define SOURCE_WIDTH 320\n#define SOURCE_HEIGHT 240\n#define BLOCK_SIZE 10\n#define DEST_WIDTH (SOURCE_WIDTH / BLOCK_SIZE)\n#define DEST_HEIGHT (SOURCE_HEIGHT / BLOCK_SIZE)\n#define BLOCK_DIFF_THRESHOLD 0.2\n#define IMAGE_DIFF_THRESHOLD 0.1\n#define DEBUG 0\n\nusing namespace Eloquent::Vision;\n\nESP32Camera camera;\nuint8_t prevFrame[DEST_WIDTH * DEST_HEIGHT] = { 0 };\nuint8_t currentFrame[DEST_WIDTH * DEST_HEIGHT] = { 0 };\n\n// function prototypes\nbool motionDetect();\nvoid updateFrame();\n\n/**\n *\n */\nvoid setup() {\n    Serial.begin(115200);\n    camera.begin(FRAME_SIZE, PIXFORMAT_GRAYSCALE);\n}\n\n/**\n *\n */\nvoid loop() {\n    /**\n     * Algorithm:\n     *  1. grab frame\n     *  2. compare with previous to detect motion\n     *  3. update previous frame\n     */\n\n    time_t start = millis();\n    camera_fb_t *frame = camera.capture();\n\n    downscaleImage(frame->buf, currentFrame, nearest, SOURCE_WIDTH, SOURCE_HEIGHT, BLOCK_SIZE);\n\n    if (motionDetect()) {\n        Serial.print("Motion detected @ ");\n        Serial.print(floor(1000.0f / (millis() - start)));\n        Serial.println(" FPS");\n    }\n\n    updateFrame();\n}\n\n/**\n * Compute the number of different blocks\n * If there are enough, then motion happened\n */\nbool motionDetect() {\n    uint16_t changes = 0;\n    const uint16_t blocks = DEST_WIDTH * DEST_HEIGHT;\n\n    for (int y = 0; y < DEST_HEIGHT; y++) {\n        for (int x = 0; x < DEST_WIDTH; x++) {\n            float current = currentFrame[y * DEST_WIDTH + x];\n            float prev = prevFrame[y * DEST_WIDTH + x];\n            float delta = abs(current - prev) / prev;\n\n            if (delta >= BLOCK_DIFF_THRESHOLD)\n                changes += 1;\n        }\n    }\n\n    return (1.0 * changes / blocks) > IMAGE_DIFF_THRESHOLD;\n}\n\n/**\n * Copy current frame to previous\n */\nvoid updateFrame() {\n    memcpy(prevFrame, currentFrame, DEST_WIDTH * DEST_HEIGHT);\n}
\n
\n

Check the full project code on Github and remember to star!

\n\r\n
\r\n
\r\n
\r\n\t

Finding this content useful?

\r\n
\r\n\t\r\n
\r\n\t
\r\n\t\t
\r\n\t\t
\r\n\t
\r\n
\r\n
\r\n
\r\n
\r\n
\r\n\r\n\n

L'articolo Easier, faster pure video ESP32 cam motion detection proviene da Eloquent Arduino Blog.

\n", "content_text": "If you liked my post about ESP32 cam motion detection, you'll love this updated version: it's easier to use and blazing fast!\n\n\nThe post about pure video ESP32 cam motion detection without an external PIR is my most successful post at the moment. Many of you are interested about this topic.\nOne of my readers, though, pointed out my implementation was quite slow and he only achieved bare 5 fps in his project. So he asked for a better alternative.\nSince the post was of great interest for many people, I took the time to revisit the code and make improvements.\nI came up with a 100% re-writing that is both easier to use and faster. Actually, it is blazing fast!.\nLet's see how it works.\nTable of contentsDownsamplingNearest neighborFull block averageCore block averageCross block averageDiagonal block averageImplement your ownBenchmarksMotion detectionFull code\nDownsampling\nIn the original post I introduced the idea of downsampling the image from the camera for a faster and more robust motion detection. I wrote the code in the main sketch to keep it self-contained.\nLooking back now it was a poor choice, since it cluttered the project and distracted from the main purpose, which is motion detection.\nMoreover, I thought that scanning the image buffer in sequential order would be the fastest approach.\nIt turns out I was wrong.\nThis time I scan the image buffer following the blocks that will compose the resulting image and the results are... much faster.\nAlso, I decided to inject some more efficiency that will further speedup the computation: using different strategies for downsampling.\nThe idea of downsampling is that you have to "collapse" a block of NxN from the original image to just one pixel of the resulting image.\nNow, there are a variety of ways you can accomplish this. The first two I present here are the most obvious, the other two are of my "invention": nothing fancy nor new, but they're fast and serve the purpose well.\nNearest neighbor\nYou can just pick the center of the NxN block and use its value for the output.\nOf course it is fast (possibly the fastest approach), but wouldn't be very accurate. One pixel out of NxN wouldn't be representative of the overall region and will heavily suffer from noise.\n\n\nFull block average\nThis is the most intuitive alternative: use the average of all the pixels in the block as the ouput value. This is arguabily the "proper" way to do it, since you're using all the pixels in the source image to compute the new one.\n\n\nCore block average\nAs a faster alternative, I thought that averaging only the "core" (the most internal part) of the block would have been a good-enough solution. It has no theoretical proof that this yields true, but our task here is to create a smaller representation of the original image, not producing an accurate smaller version.\n\n\nI'll stress this point: the only reason we do downsampling is to compare two sequential frame and detect if they differ above a certain threshold. This downsampling doesn't have to mimic the actual image: it can transform the source in any fancy way, as long as it stays consistent and captures the variations over time.\nCross block average\nThis time we consider all the pixels along the vertical and horizontal central axes. The idea is that you will capture a good portion of the variation along both the axis, given quite accurate results.\n\n\nDiagonal block average\nThis alternative too came to my mind from nowhere, really. I just think it is a good alternative to capture all the block's variation, probably even better than vertical and horizontal directions.\n\n\nImplement your own\nNot satisfied from the methods above? No problem, you can still implement your own.\nThe ones presented above are just some algorithms that came to my mind: I'm not telling you they're the best.\nThey worked for me, that's it.\nIf you think you found a better solution, I encourage you implement it and even share it with me and the other readers, so we can all make progress on this together.\n\r\n\r\n\r\n \r\n\tFinding this content useful?\r\n\r\n\t\r\n\r\n\t\r\n\t\t\r\n\t\t\r\n\t \r\n \r\n \r\n \r\n\r\n\r\n\r\n\nBenchmarks\nSo, at the very beginning I said this new implementation is blazingly fast. \nHow much fast?\nAs fast as it can be, arguably.\nI mean, so fast it won't alter your fps.\nLook at the results I got on my M5Stack camera.\n\n\n\nAlgorithm\nTime to execute (micros)\nFPS\n\n\n\n\nNone\n0\n25\n\n\nNearest neighbor\n160\n25\n\n\nCross block\n700\n25\n\n\nCore block\n800\n25\n\n\nDiagonal block\n950\n25\n\n\nFull block\n4900\n12\n\n\n\nAs you can see, only the full block creates a delay in the process (quite a bit of delay even): the other methods won't slow down your program in any noticeable way.\nIf you test Nearest neighbor and it works for you, then you'll be extremely light on computation resources with only 160 microseconds of delay.\nThis is what I mean by blazing fast.\nMotion detection\nThe motion detection part hasn't changed, so I point you to the original post to read more about the Block difference threshold and the Image difference threshold.\nFull code\n#define CAMERA_MODEL_M5STACK_WIDE\n#include "EloquentVision.h"\n\n#define FRAME_SIZE FRAMESIZE_QVGA\n#define SOURCE_WIDTH 320\n#define SOURCE_HEIGHT 240\n#define BLOCK_SIZE 10\n#define DEST_WIDTH (SOURCE_WIDTH / BLOCK_SIZE)\n#define DEST_HEIGHT (SOURCE_HEIGHT / BLOCK_SIZE)\n#define BLOCK_DIFF_THRESHOLD 0.2\n#define IMAGE_DIFF_THRESHOLD 0.1\n#define DEBUG 0\n\nusing namespace Eloquent::Vision;\n\nESP32Camera camera;\nuint8_t prevFrame[DEST_WIDTH * DEST_HEIGHT] = { 0 };\nuint8_t currentFrame[DEST_WIDTH * DEST_HEIGHT] = { 0 };\n\n// function prototypes\nbool motionDetect();\nvoid updateFrame();\n\n/**\n *\n */\nvoid setup() {\n Serial.begin(115200);\n camera.begin(FRAME_SIZE, PIXFORMAT_GRAYSCALE);\n}\n\n/**\n *\n */\nvoid loop() {\n /**\n * Algorithm:\n * 1. grab frame\n * 2. compare with previous to detect motion\n * 3. update previous frame\n */\n\n time_t start = millis();\n camera_fb_t *frame = camera.capture();\n\n downscaleImage(frame->buf, currentFrame, nearest, SOURCE_WIDTH, SOURCE_HEIGHT, BLOCK_SIZE);\n\n if (motionDetect()) {\n Serial.print("Motion detected @ ");\n Serial.print(floor(1000.0f / (millis() - start)));\n Serial.println(" FPS");\n }\n\n updateFrame();\n}\n\n/**\n * Compute the number of different blocks\n * If there are enough, then motion happened\n */\nbool motionDetect() {\n uint16_t changes = 0;\n const uint16_t blocks = DEST_WIDTH * DEST_HEIGHT;\n\n for (int y = 0; y < DEST_HEIGHT; y++) {\n for (int x = 0; x < DEST_WIDTH; x++) {\n float current = currentFrame[y * DEST_WIDTH + x];\n float prev = prevFrame[y * DEST_WIDTH + x];\n float delta = abs(current - prev) / prev;\n\n if (delta >= BLOCK_DIFF_THRESHOLD)\n changes += 1;\n }\n }\n\n return (1.0 * changes / blocks) > IMAGE_DIFF_THRESHOLD;\n}\n\n/**\n * Copy current frame to previous\n */\nvoid updateFrame() {\n memcpy(prevFrame, currentFrame, DEST_WIDTH * DEST_HEIGHT);\n}\n\nCheck the full project code on Github and remember to star!\n\r\n\r\n\r\n \r\n\tFinding this content useful?\r\n\r\n\t\r\n\r\n\t\r\n\t\t\r\n\t\t\r\n\t \r\n \r\n \r\n \r\n\r\n\r\n\r\n\nL'articolo Easier, faster pure video ESP32 cam motion detection proviene da Eloquent Arduino Blog.", "date_published": "2020-05-10T21:26:08+02:00", "date_modified": "2020-05-13T21:19:35+02:00", "authors": [ { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" } ], "author": { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" }, "tags": [ "camera", "esp32", "Computer vision" ] }, { "id": "https://eloquentarduino.github.io/?p=931", "url": "https://eloquentarduino.github.io/2020/02/handwritten-digit-classification-with-arduino-and-microml/", "title": "Handwritten digit classification with Arduino and MicroML", "content_html": "

We continue exploring the endless possibilities on the MicroML (Machine Learning for Microcontrollers) framework on Arduino and ESP32 boards: in this post we're back to image classification. In particular, we'll distinguish handwritten digits using an ESP32 camera.

\n

\"Arduino

\n

\n

If this is the first time you're reading my blog, you may have missed that I'm on a journey to push the limits of Machine learning on embedded devices like the Arduino boards and ESP32.

\n

I started with accelerometer data classification, then did Wifi indoor positioning as a proof of concept.

\n

In the last weeks, though, I undertook a more difficult path that is image classification.

\n

Image classification is where Convolutional Neural Networks really shine, but I'm here to question this settlement and demostrate that it is possible to come up with much lighter alternatives.

\n

In this post we continue with the examples, replicating a "benchmark" dataset in Machine learning: the handwritten digits classification.

\n
\nIf you are curious about a specific image classification task you would like to see implemented, let me know in the comments: I'm always open to new ideas\n
\n

The task

\n

The objective of this example is to be able to tell what an handwritten digit is, taking as input a photo from the ESP32 camera.

\n

In particular, we have 3 handwritten numbers and the task of our model will be to distinguish which image is what number.

\n

\"Handwritten

\n

I only have a single image per digit, but you're free to draw as many samples as you like: it should help improve the performance of you're classifier.

\n

1. Feature extraction

\n

When dealing with images, if you use a CNN this step is often overlooked: CNNs are made on purpose to handle raw pixel values, so you just throw the image in and it is handled properly.

\n

When using other types of classifiers, it could help add a bit of feature engineering to help the classifier doing its job and achieve high accuracy.

\n

But not this time.

\n

I wanted to be as "light" as possible in this demo, so I only took a couple steps during the feature acquisition:

\n
    \n
  1. use a grayscale image
  2. \n
  3. downsample to a manageable size
  4. \n
  5. convert it to black/white with a threshold
  6. \n
\n

I would hardly call this feature engineering.

\n

This is an example of the result of this pipeline.

\n

\"Handwritten

\n

The code for this pipeline is really simple and is almost the same from the example on motion detection.

\n
#include "esp_camera.h"\n\n#define PWDN_GPIO_NUM     -1\n#define RESET_GPIO_NUM    15\n#define XCLK_GPIO_NUM     27\n#define SIOD_GPIO_NUM     22\n#define SIOC_GPIO_NUM     23\n#define Y9_GPIO_NUM       19\n#define Y8_GPIO_NUM       36\n#define Y7_GPIO_NUM       18\n#define Y6_GPIO_NUM       39\n#define Y5_GPIO_NUM        5\n#define Y4_GPIO_NUM       34\n#define Y3_GPIO_NUM       35\n#define Y2_GPIO_NUM       32\n#define VSYNC_GPIO_NUM    25\n#define HREF_GPIO_NUM     26\n#define PCLK_GPIO_NUM     21\n\n#define FRAME_SIZE FRAMESIZE_QQVGA\n#define WIDTH 160\n#define HEIGHT 120\n#define BLOCK_SIZE 5\n#define W (WIDTH / BLOCK_SIZE)\n#define H (HEIGHT / BLOCK_SIZE)\n#define THRESHOLD 127\n\ndouble features[H*W] = { 0 };\n\nvoid setup() {\n    Serial.begin(115200);\n    Serial.println(setup_camera(FRAME_SIZE) ? "OK" : "ERR INIT");\n    delay(3000);\n}\n\nvoid loop() {\n    if (!capture_still()) {\n        Serial.println("Failed capture");\n        delay(2000);\n        return;\n    }\n\n    print_features();\n    delay(3000);\n}\n\nbool setup_camera(framesize_t frameSize) {\n    camera_config_t config;\n\n    config.ledc_channel = LEDC_CHANNEL_0;\n    config.ledc_timer = LEDC_TIMER_0;\n    config.pin_d0 = Y2_GPIO_NUM;\n    config.pin_d1 = Y3_GPIO_NUM;\n    config.pin_d2 = Y4_GPIO_NUM;\n    config.pin_d3 = Y5_GPIO_NUM;\n    config.pin_d4 = Y6_GPIO_NUM;\n    config.pin_d5 = Y7_GPIO_NUM;\n    config.pin_d6 = Y8_GPIO_NUM;\n    config.pin_d7 = Y9_GPIO_NUM;\n    config.pin_xclk = XCLK_GPIO_NUM;\n    config.pin_pclk = PCLK_GPIO_NUM;\n    config.pin_vsync = VSYNC_GPIO_NUM;\n    config.pin_href = HREF_GPIO_NUM;\n    config.pin_sscb_sda = SIOD_GPIO_NUM;\n    config.pin_sscb_scl = SIOC_GPIO_NUM;\n    config.pin_pwdn = PWDN_GPIO_NUM;\n    config.pin_reset = RESET_GPIO_NUM;\n    config.xclk_freq_hz = 20000000;\n    config.pixel_format = PIXFORMAT_GRAYSCALE;\n    config.frame_size = frameSize;\n    config.jpeg_quality = 12;\n    config.fb_count = 1;\n\n    bool ok = esp_camera_init(&config) == ESP_OK;\n\n    sensor_t *sensor = esp_camera_sensor_get();\n    sensor->set_framesize(sensor, frameSize);\n\n    return ok;\n}\n\nbool capture_still() {\n    camera_fb_t *frame = esp_camera_fb_get();\n\n    if (!frame)\n        return false;\n\n    // reset all the features\n    for (size_t i = 0; i < H * W; i++)\n      features[i] = 0;\n\n    // for each pixel, compute the position in the downsampled image\n    for (size_t i = 0; i < frame->len; i++) {\n      const uint16_t x = i % WIDTH;\n      const uint16_t y = floor(i / WIDTH);\n      const uint8_t block_x = floor(x / BLOCK_SIZE);\n      const uint8_t block_y = floor(y / BLOCK_SIZE);\n      const uint16_t j = block_y * W + block_x;\n\n      features[j] += frame->buf[i];\n    }\n\n    // apply threshold\n    for (size_t i = 0; i < H * W; i++) {\n      features[i] = (features[i] / (BLOCK_SIZE * BLOCK_SIZE) > THRESHOLD) ? 1 : 0;\n    }\n\n    return true;\n}\n\nvoid print_features() {\n    for (size_t i = 0; i < H * W; i++) {\n        Serial.print(features[i]);\n\n        if (i != H * W - 1)\n          Serial.print(',');\n    }\n\n    Serial.println();\n}
\n

2. Samples recording

\n

To create your own dataset, you need a collection of handwritten digits.

\n

You can do this part as you like, by using pieces of paper or a monitor. I used a tablet because it was well illuminated and I could open a bunch of tabs to keep a record of my samples.

\n

As in the apple vs orange, keep in mind that you should be consistent during both the training phase and the inference phase.

\n

This is why I used tape to fix my ESP32 camera to the desk and kept the tablet in the exact same position.

\n

If you desire, you could experiment varying slightly the capturing setup during the training and see if your classifier still achieves good accuracy: this is a test I didn't make.

\n

3. Train and export the classifier

\r\n\r\n

For a detailed guide refer to the tutorial

\r\n\r\n

\r\n

from sklearn.ensemble import RandomForestClassifier\r\nfrom micromlgen import port\r\n\r\n# put your samples in the dataset folder\r\n# one class per file\r\n# one feature vector per line, in CSV format\r\nfeatures, classmap = load_features('dataset/')\r\nX, y = features[:, :-1], features[:, -1]\r\nclassifier = RandomForestClassifier(n_estimators=30, max_depth=10).fit(X, y)\r\nc_code = port(classifier, classmap=classmap)\r\nprint(c_code)
\r\n\r\n

At this point you have to copy the printed code and import it in your Arduino project, in a file called model.h.

\n

4. The result

\n

Okay, at this point you should have all the working pieces to do handwritten digit image classification on your ESP32 camera. Include your model in the sketch and run the classification.

\n
#include "model.h"\n\nvoid loop() {\n    if (!capture_still()) {\n        Serial.println("Failed capture");\n        delay(2000);\n\n        return;\n    }\n\n    Serial.print("Number: ");\n    Serial.println(classIdxToName(predict(features)));\n    delay(3000);\n}
\n

Done.

\n

You can see a demo of my results in the video below.

\n
\n
\n

Project figures

\n

My dataset is composed of 25 training samples in total and the SVM with linear kernel produced 17 support vectors.

\n

On my M5Stick camera board, the overhead for the model is 6.8 Kb of flash and the inference takes 7ms: not that bad!

\n
\r\n

Check the full project code on Github

\n

L'articolo Handwritten digit classification with Arduino and MicroML proviene da Eloquent Arduino Blog.

\n", "content_text": "We continue exploring the endless possibilities on the MicroML (Machine Learning for Microcontrollers) framework on Arduino and ESP32 boards: in this post we're back to image classification. In particular, we'll distinguish handwritten digits using an ESP32 camera.\n\n\nIf this is the first time you're reading my blog, you may have missed that I'm on a journey to push the limits of Machine learning on embedded devices like the Arduino boards and ESP32.\nI started with accelerometer data classification, then did Wifi indoor positioning as a proof of concept.\nIn the last weeks, though, I undertook a more difficult path that is image classification.\nImage classification is where Convolutional Neural Networks really shine, but I'm here to question this settlement and demostrate that it is possible to come up with much lighter alternatives.\nIn this post we continue with the examples, replicating a "benchmark" dataset in Machine learning: the handwritten digits classification.\n\nIf you are curious about a specific image classification task you would like to see implemented, let me know in the comments: I'm always open to new ideas\n\nThe task\nThe objective of this example is to be able to tell what an handwritten digit is, taking as input a photo from the ESP32 camera.\nIn particular, we have 3 handwritten numbers and the task of our model will be to distinguish which image is what number.\n\nI only have a single image per digit, but you're free to draw as many samples as you like: it should help improve the performance of you're classifier.\n1. Feature extraction\nWhen dealing with images, if you use a CNN this step is often overlooked: CNNs are made on purpose to handle raw pixel values, so you just throw the image in and it is handled properly.\nWhen using other types of classifiers, it could help add a bit of feature engineering to help the classifier doing its job and achieve high accuracy.\nBut not this time.\nI wanted to be as "light" as possible in this demo, so I only took a couple steps during the feature acquisition:\n\nuse a grayscale image\ndownsample to a manageable size\nconvert it to black/white with a threshold\n\nI would hardly call this feature engineering.\nThis is an example of the result of this pipeline.\n\nThe code for this pipeline is really simple and is almost the same from the example on motion detection.\n#include "esp_camera.h"\n\n#define PWDN_GPIO_NUM -1\n#define RESET_GPIO_NUM 15\n#define XCLK_GPIO_NUM 27\n#define SIOD_GPIO_NUM 22\n#define SIOC_GPIO_NUM 23\n#define Y9_GPIO_NUM 19\n#define Y8_GPIO_NUM 36\n#define Y7_GPIO_NUM 18\n#define Y6_GPIO_NUM 39\n#define Y5_GPIO_NUM 5\n#define Y4_GPIO_NUM 34\n#define Y3_GPIO_NUM 35\n#define Y2_GPIO_NUM 32\n#define VSYNC_GPIO_NUM 25\n#define HREF_GPIO_NUM 26\n#define PCLK_GPIO_NUM 21\n\n#define FRAME_SIZE FRAMESIZE_QQVGA\n#define WIDTH 160\n#define HEIGHT 120\n#define BLOCK_SIZE 5\n#define W (WIDTH / BLOCK_SIZE)\n#define H (HEIGHT / BLOCK_SIZE)\n#define THRESHOLD 127\n\ndouble features[H*W] = { 0 };\n\nvoid setup() {\n Serial.begin(115200);\n Serial.println(setup_camera(FRAME_SIZE) ? "OK" : "ERR INIT");\n delay(3000);\n}\n\nvoid loop() {\n if (!capture_still()) {\n Serial.println("Failed capture");\n delay(2000);\n return;\n }\n\n print_features();\n delay(3000);\n}\n\nbool setup_camera(framesize_t frameSize) {\n camera_config_t config;\n\n config.ledc_channel = LEDC_CHANNEL_0;\n config.ledc_timer = LEDC_TIMER_0;\n config.pin_d0 = Y2_GPIO_NUM;\n config.pin_d1 = Y3_GPIO_NUM;\n config.pin_d2 = Y4_GPIO_NUM;\n config.pin_d3 = Y5_GPIO_NUM;\n config.pin_d4 = Y6_GPIO_NUM;\n config.pin_d5 = Y7_GPIO_NUM;\n config.pin_d6 = Y8_GPIO_NUM;\n config.pin_d7 = Y9_GPIO_NUM;\n config.pin_xclk = XCLK_GPIO_NUM;\n config.pin_pclk = PCLK_GPIO_NUM;\n config.pin_vsync = VSYNC_GPIO_NUM;\n config.pin_href = HREF_GPIO_NUM;\n config.pin_sscb_sda = SIOD_GPIO_NUM;\n config.pin_sscb_scl = SIOC_GPIO_NUM;\n config.pin_pwdn = PWDN_GPIO_NUM;\n config.pin_reset = RESET_GPIO_NUM;\n config.xclk_freq_hz = 20000000;\n config.pixel_format = PIXFORMAT_GRAYSCALE;\n config.frame_size = frameSize;\n config.jpeg_quality = 12;\n config.fb_count = 1;\n\n bool ok = esp_camera_init(&config) == ESP_OK;\n\n sensor_t *sensor = esp_camera_sensor_get();\n sensor->set_framesize(sensor, frameSize);\n\n return ok;\n}\n\nbool capture_still() {\n camera_fb_t *frame = esp_camera_fb_get();\n\n if (!frame)\n return false;\n\n // reset all the features\n for (size_t i = 0; i < H * W; i++)\n features[i] = 0;\n\n // for each pixel, compute the position in the downsampled image\n for (size_t i = 0; i < frame->len; i++) {\n const uint16_t x = i % WIDTH;\n const uint16_t y = floor(i / WIDTH);\n const uint8_t block_x = floor(x / BLOCK_SIZE);\n const uint8_t block_y = floor(y / BLOCK_SIZE);\n const uint16_t j = block_y * W + block_x;\n\n features[j] += frame->buf[i];\n }\n\n // apply threshold\n for (size_t i = 0; i < H * W; i++) {\n features[i] = (features[i] / (BLOCK_SIZE * BLOCK_SIZE) > THRESHOLD) ? 1 : 0;\n }\n\n return true;\n}\n\nvoid print_features() {\n for (size_t i = 0; i < H * W; i++) {\n Serial.print(features[i]);\n\n if (i != H * W - 1)\n Serial.print(',');\n }\n\n Serial.println();\n}\n2. Samples recording\nTo create your own dataset, you need a collection of handwritten digits.\nYou can do this part as you like, by using pieces of paper or a monitor. I used a tablet because it was well illuminated and I could open a bunch of tabs to keep a record of my samples.\nAs in the apple vs orange, keep in mind that you should be consistent during both the training phase and the inference phase.\nThis is why I used tape to fix my ESP32 camera to the desk and kept the tablet in the exact same position.\nIf you desire, you could experiment varying slightly the capturing setup during the training and see if your classifier still achieves good accuracy: this is a test I didn't make.\n3. Train and export the classifier\r\n\r\nFor a detailed guide refer to the tutorial\r\n\r\n\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom micromlgen import port\r\n\r\n# put your samples in the dataset folder\r\n# one class per file\r\n# one feature vector per line, in CSV format\r\nfeatures, classmap = load_features('dataset/')\r\nX, y = features[:, :-1], features[:, -1]\r\nclassifier = RandomForestClassifier(n_estimators=30, max_depth=10).fit(X, y)\r\nc_code = port(classifier, classmap=classmap)\r\nprint(c_code)\r\n\r\nAt this point you have to copy the printed code and import it in your Arduino project, in a file called model.h.\n4. The result\nOkay, at this point you should have all the working pieces to do handwritten digit image classification on your ESP32 camera. Include your model in the sketch and run the classification.\n#include "model.h"\n\nvoid loop() {\n if (!capture_still()) {\n Serial.println("Failed capture");\n delay(2000);\n\n return;\n }\n\n Serial.print("Number: ");\n Serial.println(classIdxToName(predict(features)));\n delay(3000);\n}\nDone.\nYou can see a demo of my results in the video below.\n\nhttps://eloquentarduino.github.io/wp-content/uploads/2020/02/MNIST-mute.mp4\nProject figures\nMy dataset is composed of 25 training samples in total and the SVM with linear kernel produced 17 support vectors.\nOn my M5Stick camera board, the overhead for the model is 6.8 Kb of flash and the inference takes 7ms: not that bad!\n\r\nCheck the full project code on Github\nL'articolo Handwritten digit classification with Arduino and MicroML proviene da Eloquent Arduino Blog.", "date_published": "2020-02-23T11:53:03+01:00", "date_modified": "2020-05-31T18:50:44+02:00", "authors": [ { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" } ], "author": { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" }, "tags": [ "camera", "esp32", "microml", "svm", "Arduino Machine learning", "Computer vision" ], "attachments": [ { "url": "https://eloquentarduino.github.io/wp-content/uploads/2020/02/MNIST-mute.mp4", "mime_type": "video/mp4", "size_in_bytes": 6424809 } ] }, { "id": "https://eloquentarduino.github.io/?p=820", "url": "https://eloquentarduino.github.io/2020/01/image-recognition-with-esp32-and-arduino/", "title": "Apple or Orange? Image recognition with ESP32 and Arduino", "content_html": "

Do you have an ESP32 camera?

\n

Want to do image recognition directly on your ESP32, without a PC?

\n

In this post we'll look into a very basic image recognition task: distinguish apples from oranges with machine learning.

\n

\"Apple

\n

\n

Image recognition is a very hot topic these days in the AI/ML landscape. Convolutional Neural Networks really shines in this task and can achieve almost perfect accuracy on many scenarios.

\n

Sadly, you can't run CNN on your ESP32, they're just too large for a microcontroller.

\n

Since in this series about Machine Learning on Microcontrollers we're exploring the potential of Support Vector Machines (SVMs) at solving different classification tasks, we'll take a look into image classification too.

\n

Table of contents
  1. What we're going to do
  2. Features definition
  3. Extracting RGB components
  4. Record samples image
  5. Training the classifier
  6. Real world example
    1. Disclaimer

\n

What we're going to do

\n

In a previous post about color identification with Machine learning, we used an Arduino to detect the object we were pointing at with a color sensor (TCS3200) by its color: if we detected yellow, for example, we knew we had a banana in front of us.

\n

Of course such a process is not object recognition at all: yellow may be a banane, or a lemon, or an apple.

\n

Object inference, in that case, works only if you have exactly one object for a given color.

\n

The objective of this post, instead, is to investigate if we can use the MicroML framework to do simple image recognition on the images from an ESP32 camera.

\n

This is much more similar to the tasks you do on your PC with CNN or any other form of NN you are comfortable with. Sure, we will still apply some restrictions to fit the problem on a microcontroller, but this is a huge step forward compared to the simple color identification.

\n
\nIn this context, image recognition means deciding which class (from the trained ones) the current image belongs to. This algorithm can't locate interesting objects in the image, neither detect if an object is present in the frame. It will classify the current image based on the samples recorded during training.\n
\n

As any beginning machine learning project about image classification worth of respect, our task will be to distinguish an orange from an apple.

\n

Features definition

\n

I have to admit that I rarely use NN, so I may be wrong here, but from the examples I read online it looks to me that features engineering is not a fundamental task with NN.

\n

Those few times I used CNN, I always used the whole image as input, as-is. I didn't extracted any feature from them (e.g. color histogram): the CNN worked perfectly fine with raw images.

\n

I don't think this will work best with SVM, but in this first post we're starting as simple as possible, so we'll be using the RGB components of the image as our features. In a future post, we'll introduce additional features to try to improve our results.

\n

I said we're using the RGB components of the image. But not all of them.

\n

Even at the lowest resolution of 160x120 pixels, a raw RGB image from the camera would generate 160x120x3 = 57600 features: way too much.

\n

We need to reduce this number to the bare minimum.

\n

How much pixels do you think are necessary to get reasonable results in this task of classifying apples from oranges?

\n

You would be surprised to know that I got 90% accuracy with an RGB image of 8x6!

\n

\"You

\n

Yes, that's all we really need to do a good enough classification.

\n

You can distinguish apples from oranges on ESP32 with 8x6 pixels only!
Click To Tweet


\n

Of course this is a tradeoff: you can't expect to achieve 99% accuracy while mantaining the model size small enough to fit on a microcontroller. 90% is an acceptable accuracy for me in this context.

\n

You have to keep in mind, moreover, that the features vector size grows quadratically with the image size (if you keep the aspect ratio). A raw RGB image of 8x6 generates 144 features: an image of 16x12 generates 576 features. This was already causing random crashes on my ESP32.

\n

So we'll stick to 8x6 images.

\n

Now, how do you compact a 160x120 image to 8x6? With downsampling.

\n

This is the same tecnique we've used in the post about motion detection on ESP32: we define a block size and average all the pixels inside the block to get a single value (you can refer to that post for more details).

\n

\"Image

\n

This time, though, we're working with RGB images instead of grayscale, so we'll repeat the exact same process 3 times, one for each channel.

\n

This is the code excerpt that does the downsampling.

\n
uint16_t rgb_frame[HEIGHT / BLOCK_SIZE][WIDTH / BLOCK_SIZE][3] = { 0 };\n\nvoid grab_image() {\n    for (size_t i = 0; i < len; i += 2) {\n        // get r, g, b from the buffer\n        // see later\n\n        const size_t j = i / 2;\n        // transform x, y in the original image to x, y in the downsampled image\n        // by dividing by BLOCK_SIZE\n        const uint16_t x = j % WIDTH;\n        const uint16_t y = floor(j / WIDTH);\n        const uint8_t block_x = floor(x / BLOCK_SIZE);\n        const uint8_t block_y = floor(y / BLOCK_SIZE);\n\n        // average pixels in block (accumulate)\n        rgb_frame[block_y][block_x][0] += r;\n        rgb_frame[block_y][block_x][1] += g;\n        rgb_frame[block_y][block_x][2] += b;\n    }\n}
\n\r\n
\r\n
\r\n
\r\n\t

Finding this content useful?

\r\n
\r\n\t\r\n
\r\n\t
\r\n\t\t
\r\n\t\t
\r\n\t
\r\n
\r\n
\r\n
\r\n
\r\n
\r\n\r\n\n

Extracting RGB components

\n

The ESP32 camera can store the image in different formats (of our interest \u2014 there are a couple more available):

\n
    \n
  1. grayscale: no color information, just the intensity is stored. The buffer has size HEIGHT*WIDTH
  2. \n
  3. RGB565: stores each RGB pixel in two bytes, with 5 bit for red, 6 for green and 5 for blue. The buffer has size HEIGHT * WIDTH * 2
  4. \n
  5. JPEG: encodes (in hardware?) the image to jpeg. The buffer has a variable length, based on the encoding results
  6. \n
\n

For our purpose, we'll use the RGB565 format and extract the 3 components from the 2 bytes with the following code.

\n

\"taken

\n
config.pixel_format = PIXFORMAT_RGB565;\n\nfor (size_t i = 0; i < len; i += 2) {\n    const uint8_t high = buf[i];\n    const uint8_t low  = buf[i+1];\n    const uint16_t pixel = (high << 8) | low;\n\n    const uint8_t r = (pixel & 0b1111100000000000) >> 11;\n    const uint8_t g = (pixel & 0b0000011111100000) >> 6;\n    const uint8_t b = (pixel & 0b0000000000011111);\n}
\n

Record samples image

\n

Now that we can grab the images from the camera, we'll need to take a few samples of each object we want to racognize.

\n

Before doing so, we'll linearize the image matrix to a 1-dimensional vector, because that's what our prediction function expects.

\n
#define H (HEIGHT / BLOCK_SIZE)\n#define W (WIDTH / BLOCK_SIZE)\n\nvoid linearize_features() {\n  size_t i = 0;\n  double features[H*W*3] = {0};\n\n  for (int y = 0; y < H; y++) {\n    for (int x = 0; x < W; x++) {\n      features[i++] = rgb_frame[y][x][0];\n      features[i++] = rgb_frame[y][x][1];\n      features[i++] = rgb_frame[y][x][2];\n    }\n  }\n\n  // print to serial\n  for (size_t i = 0; i < H*W*3; i++) {\n    Serial.print(features[i]);\n    Serial.print('\\t');\n  }\n\n  Serial.println();\n}
\n

Now you can setup your acquisition environment and take the samples: 15-20 of each object will do the job.

\n
\nImage acquisition is a very noisy process: even keeping the camera still, you will get fluctuating values.
You need to be very accurate during this phase if you want to achieve good results.
I suggest you immobilize your camera with tape to a flat surface or use some kind of photographic easel.\n
\n

Training the classifier

\n

To train the classifier, save the features for each object in a file, one features vector per line. Then follow the steps on how to train a ML classifier for Arduino to get the exported model.

\n

You can experiment with different classifier configurations.

\n

My features were well distinguishable, so I had great results (100% accuracy) with any kernel (even linear).

\n

One odd thing happened with the RBF kernel: I had to use an extremely low gamma value (0.0000001). Does anyone can explain me why? I usually go with a default value of 0.001.

\n

The model produced 13 support vectors.

\n

I did no features scaling: you could try it if classifying more than 2 classes and having poor results.

\n

\"Apple

\n

Real world example

\n

If you followed all the steps above, you should now have a model capable of detecting if your camera is shotting an apple or an orange, as you can see in the following video.

\n
\n

\n

The little white object you see at the bottom of the image is the camera, taped to the desk.

\n

Did you think it was possible to do simple image classification on your ESP32?

\n

Disclaimer

\n

This is not full-fledged object recognition: it can't label objects while you walk as Tensorflow can do, for example.

\n

You have to carefully craft your setup and be as consistent as possible between training and inferencing.

\n

Still, I think this is a fun proof-of-concept that can have useful applications in simple scenarios where you can live with a fixed camera and don't want to use a full Raspberry Pi.

\n

In the next weeks I settled to finally try TensorFlow Lite for Microcontrollers on my ESP32, so I'll try to do a comparison between them and this example and report my results.

\n

Now that you can do image classification on your ESP32, can you think of a use case you will be able to apply this code to?

\n

Let me know in the comments, we could even try realize it together if you need some help.

\n
\r\n

Check the full project code on Github

\n

L'articolo Apple or Orange? Image recognition with ESP32 and Arduino proviene da Eloquent Arduino Blog.

\n", "content_text": "Do you have an ESP32 camera? \nWant to do image recognition directly on your ESP32, without a PC?\nIn this post we'll look into a very basic image recognition task: distinguish apples from oranges with machine learning.\n\n\nImage recognition is a very hot topic these days in the AI/ML landscape. Convolutional Neural Networks really shines in this task and can achieve almost perfect accuracy on many scenarios.\nSadly, you can't run CNN on your ESP32, they're just too large for a microcontroller.\nSince in this series about Machine Learning on Microcontrollers we're exploring the potential of Support Vector Machines (SVMs) at solving different classification tasks, we'll take a look into image classification too.\nTable of contentsWhat we're going to doFeatures definitionExtracting RGB componentsRecord samples imageTraining the classifierReal world exampleDisclaimer\nWhat we're going to do\nIn a previous post about color identification with Machine learning, we used an Arduino to detect the object we were pointing at with a color sensor (TCS3200) by its color: if we detected yellow, for example, we knew we had a banana in front of us.\nOf course such a process is not object recognition at all: yellow may be a banane, or a lemon, or an apple.\nObject inference, in that case, works only if you have exactly one object for a given color.\nThe objective of this post, instead, is to investigate if we can use the MicroML framework to do simple image recognition on the images from an ESP32 camera.\nThis is much more similar to the tasks you do on your PC with CNN or any other form of NN you are comfortable with. Sure, we will still apply some restrictions to fit the problem on a microcontroller, but this is a huge step forward compared to the simple color identification.\n\nIn this context, image recognition means deciding which class (from the trained ones) the current image belongs to. This algorithm can't locate interesting objects in the image, neither detect if an object is present in the frame. It will classify the current image based on the samples recorded during training.\n\nAs any beginning machine learning project about image classification worth of respect, our task will be to distinguish an orange from an apple.\nFeatures definition\nI have to admit that I rarely use NN, so I may be wrong here, but from the examples I read online it looks to me that features engineering is not a fundamental task with NN.\nThose few times I used CNN, I always used the whole image as input, as-is. I didn't extracted any feature from them (e.g. color histogram): the CNN worked perfectly fine with raw images.\nI don't think this will work best with SVM, but in this first post we're starting as simple as possible, so we'll be using the RGB components of the image as our features. In a future post, we'll introduce additional features to try to improve our results.\nI said we're using the RGB components of the image. But not all of them.\nEven at the lowest resolution of 160x120 pixels, a raw RGB image from the camera would generate 160x120x3 = 57600 features: way too much.\nWe need to reduce this number to the bare minimum.\nHow much pixels do you think are necessary to get reasonable results in this task of classifying apples from oranges?\nYou would be surprised to know that I got 90% accuracy with an RGB image of 8x6!\n\nYes, that's all we really need to do a good enough classification.\nYou can distinguish apples from oranges on ESP32 with 8x6 pixels only!Click To Tweet\nOf course this is a tradeoff: you can't expect to achieve 99% accuracy while mantaining the model size small enough to fit on a microcontroller. 90% is an acceptable accuracy for me in this context.\nYou have to keep in mind, moreover, that the features vector size grows quadratically with the image size (if you keep the aspect ratio). A raw RGB image of 8x6 generates 144 features: an image of 16x12 generates 576 features. This was already causing random crashes on my ESP32.\nSo we'll stick to 8x6 images.\nNow, how do you compact a 160x120 image to 8x6? With downsampling.\nThis is the same tecnique we've used in the post about motion detection on ESP32: we define a block size and average all the pixels inside the block to get a single value (you can refer to that post for more details).\n\nThis time, though, we're working with RGB images instead of grayscale, so we'll repeat the exact same process 3 times, one for each channel.\nThis is the code excerpt that does the downsampling.\nuint16_t rgb_frame[HEIGHT / BLOCK_SIZE][WIDTH / BLOCK_SIZE][3] = { 0 };\n\nvoid grab_image() {\n for (size_t i = 0; i < len; i += 2) {\n // get r, g, b from the buffer\n // see later\n\n const size_t j = i / 2;\n // transform x, y in the original image to x, y in the downsampled image\n // by dividing by BLOCK_SIZE\n const uint16_t x = j % WIDTH;\n const uint16_t y = floor(j / WIDTH);\n const uint8_t block_x = floor(x / BLOCK_SIZE);\n const uint8_t block_y = floor(y / BLOCK_SIZE);\n\n // average pixels in block (accumulate)\n rgb_frame[block_y][block_x][0] += r;\n rgb_frame[block_y][block_x][1] += g;\n rgb_frame[block_y][block_x][2] += b;\n }\n}\n\r\n\r\n\r\n \r\n\tFinding this content useful?\r\n\r\n\t\r\n\r\n\t\r\n\t\t\r\n\t\t\r\n\t \r\n \r\n \r\n \r\n\r\n\r\n\r\n\nExtracting RGB components\nThe ESP32 camera can store the image in different formats (of our interest \u2014 there are a couple more available):\n\ngrayscale: no color information, just the intensity is stored. The buffer has size HEIGHT*WIDTH\nRGB565: stores each RGB pixel in two bytes, with 5 bit for red, 6 for green and 5 for blue. The buffer has size HEIGHT * WIDTH * 2\nJPEG: encodes (in hardware?) the image to jpeg. The buffer has a variable length, based on the encoding results\n\nFor our purpose, we'll use the RGB565 format and extract the 3 components from the 2 bytes with the following code.\n\nconfig.pixel_format = PIXFORMAT_RGB565;\n\nfor (size_t i = 0; i < len; i += 2) {\n const uint8_t high = buf[i];\n const uint8_t low = buf[i+1];\n const uint16_t pixel = (high << 8) | low;\n\n const uint8_t r = (pixel & 0b1111100000000000) >> 11;\n const uint8_t g = (pixel & 0b0000011111100000) >> 6;\n const uint8_t b = (pixel & 0b0000000000011111);\n}\nRecord samples image\nNow that we can grab the images from the camera, we'll need to take a few samples of each object we want to racognize.\nBefore doing so, we'll linearize the image matrix to a 1-dimensional vector, because that's what our prediction function expects.\n#define H (HEIGHT / BLOCK_SIZE)\n#define W (WIDTH / BLOCK_SIZE)\n\nvoid linearize_features() {\n size_t i = 0;\n double features[H*W*3] = {0};\n\n for (int y = 0; y < H; y++) {\n for (int x = 0; x < W; x++) {\n features[i++] = rgb_frame[y][x][0];\n features[i++] = rgb_frame[y][x][1];\n features[i++] = rgb_frame[y][x][2];\n }\n }\n\n // print to serial\n for (size_t i = 0; i < H*W*3; i++) {\n Serial.print(features[i]);\n Serial.print('\\t');\n }\n\n Serial.println();\n}\nNow you can setup your acquisition environment and take the samples: 15-20 of each object will do the job.\n\nImage acquisition is a very noisy process: even keeping the camera still, you will get fluctuating values. You need to be very accurate during this phase if you want to achieve good results. I suggest you immobilize your camera with tape to a flat surface or use some kind of photographic easel.\n\nTraining the classifier\nTo train the classifier, save the features for each object in a file, one features vector per line. Then follow the steps on how to train a ML classifier for Arduino to get the exported model.\nYou can experiment with different classifier configurations. \nMy features were well distinguishable, so I had great results (100% accuracy) with any kernel (even linear).\nOne odd thing happened with the RBF kernel: I had to use an extremely low gamma value (0.0000001). Does anyone can explain me why? I usually go with a default value of 0.001.\nThe model produced 13 support vectors.\nI did no features scaling: you could try it if classifying more than 2 classes and having poor results.\n\nReal world example\nIf you followed all the steps above, you should now have a model capable of detecting if your camera is shotting an apple or an orange, as you can see in the following video.\nhttps://eloquentarduino.github.io/wp-content/uploads/2020/01/Apple-vs-Orange.mp4\n\nThe little white object you see at the bottom of the image is the camera, taped to the desk.\nDid you think it was possible to do simple image classification on your ESP32?\nDisclaimer\nThis is not full-fledged object recognition: it can't label objects while you walk as Tensorflow can do, for example.\nYou have to carefully craft your setup and be as consistent as possible between training and inferencing.\nStill, I think this is a fun proof-of-concept that can have useful applications in simple scenarios where you can live with a fixed camera and don't want to use a full Raspberry Pi.\nIn the next weeks I settled to finally try TensorFlow Lite for Microcontrollers on my ESP32, so I'll try to do a comparison between them and this example and report my results.\nNow that you can do image classification on your ESP32, can you think of a use case you will be able to apply this code to? \nLet me know in the comments, we could even try realize it together if you need some help.\n\r\nCheck the full project code on Github\nL'articolo Apple or Orange? Image recognition with ESP32 and Arduino proviene da Eloquent Arduino Blog.", "date_published": "2020-01-12T11:32:08+01:00", "date_modified": "2020-05-31T18:51:27+02:00", "authors": [ { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" } ], "author": { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" }, "tags": [ "camera", "esp32", "microml", "svm", "Arduino Machine learning", "Computer vision" ], "attachments": [ { "url": "https://eloquentarduino.github.io/wp-content/uploads/2020/01/Apple-vs-Orange.mp4", "mime_type": "video/mp4", "size_in_bytes": 1642079 } ] }, { "id": "https://eloquentarduino.github.io/?p=779", "url": "https://eloquentarduino.com/projects/esp32-arduino-motion-detection", "title": "Motion detection with ESP32 cam only (Arduino version)", "content_html": "

Do you have an ESP32 camera? Do you want to do motion detection WITHOUT ANY external hardware?

\n

Here's a tutorial made just for you: 30 lines of code and you will know when something changes in your video stream \"\ud83c\udfa5\"

\n

\"ESP32

\n

\n

** See the updated version of this project: it's easier to use and waaay faster: Easier, faster, pure video ESP32 cam motion detection **

\n

Table of contents
  1. What is (naive) motion detection?
  2. Can't I use an external PIR?
    1. External hardware
    2. Field of View
    3. Cold objects
  3. What do you need?
  4. How does it work?
    1. Downsampling
    2. Blocks difference threshold
    3. Image difference threshold
    4. Combining all together
  5. Real world example

\n

What is (naive) motion detection?

\n

Quoting from Wikipedia

\n
\n

Motion detection is the process of detecting a change in the position of an object relative to its surroundings or a change in the surroundings relative to an object

\n
\n

In this project, we're implementing what I call naive motion detection: that is, we're not focusing on a particular object and following its motion.

\n

We'll only detect if any considerable portion of the image changed from one frame to the next.

\n

We won't identify the location of motion (that's the subject for a next project), neither what caused it. We will analyze video stream in (almost) real-time and compare frame by frame: if lots of pixels changed, we'll call it motion.

\n

Can't I use an external PIR?

\n

Several projects on the internet about motion detection with an ESP32 cam use an external PIR sensor to trigger the video recording.

\n

What's the problem with that approach?

\n

1. External hardware

\n

First of all, you need external hardware. If you're using a breadboard, no problem, you just need a couple more wires and you're good to go. But I have a nice M5stick camera (no affiliate link), that's already well packaged, so it won't be that easy to add a PIR sensor.

\n

2. Field of View

\n

PIR sensors have a limited FOV (field of view), so you will need more than one to cover the whole range of the camera.

\n

My camera, for example, has fish-eye lens which give me 160\u00b0 of view. Most cheap PIR sensors have a 120\u00b0 field of view, so one will not suffice. This adds even more space to my project.

\n

3. Cold objects

\n

PIR sensors gets triggered by infrared light. Infrared light gets emitted by hot bodies (like people and animals).

\n

But motion in a video stream can happen for a variety of reasons, not necessarily due to hot bodies, for example if you want to monitor a street for cars passing by.

\n

A PIR sensor can't do this: video motion detection can.

\n

ESP32 cam pure video motion detection can detect motion due to cold objects
Click To Tweet


\n
Do you like the motion effect at the beginning of the post? Check it out on Github
\n

What do you need?

\n

All you need for this project is a board with a camera sensor. As I said, I have a M5Stick Camera with fish-eye lens, but any ESP32 based camera should work out of the box:

\n\n

\"ESP32

\n

How does it work?

\n

Ok, let's go to the "technical" stuff.

\n

Simply put, the algorithm counts the number of different pixels from one frame to the next: if many pixels changed, it will detect motion.

\n

Well, it's almost like this.

\n

Of course such an algorithm will be very sensitive to noise (which is quite high on these low-cost cameras). We need to mitigate false-positive triggers.

\n

Downsampling

\n

One super-simple and super-effective way of doing this is to work with blocks, instead of pixels. A block is simply an N x N square, whose value is the average of the pixels it contains.

\n

This greatly reduces sensitivity to noise, providing a more robust detection. Here's an example of what the the "block-ing" operation does to an image.

\n

\"Image

\n

It's really a "pixelating" effect: you take the orginal image (let's say 320x240 pixels) and resize it to 10x smaller, 32x24.

\n

This has the added benefit that it's much more lightweight to work with 32x24 matrix instead of 320x240 matrix: if you want to do real-time detection, this is a MUST.

\n

How should you choose the scale factor?

\n

Well, it depends.

\n

It depends on the sensitivity you want to achieve. The higher the downsampling, the less sensitive your detection will be.

\n

If you want to detect a person passing 50cm away from the camera, you can increase this number without any problem. If you want to detect a dog 10m away, you should keep it in the 5-10 range.

\n

Experiment with your own use case a tweak with trial-and-error.

\n

Blocks difference threshold

\n

Once we've defined the block size, we need to detect if a block changed from one frame to the next.

\n

Of course, just testing for difference (current != prev) would be again too sensitive to noise. A block can change for a variety of reasons, the first of which is the bad camera quality.

\n

So we instead define a percent threshold above which we can say for sure the block actually changed. A good starting point could be 10-20%, but again you need to tweak this to your needs.

\n

The higher the threshold, the less sensitive the algorithm will be.

\n

In code it is calculated as

\n
float delta = abs(currentBlockValue - prevBlockValue) / prevBlockValue;
\n

which indicates the relative increment/decrement from the previous value.

\n

Image difference threshold

\n

Now that we can detect if a block changed from one frame to the next, we can actually detect if the image changed.

\n

You could decide to trigger motion even if a single block changed, but I suggest you to set an higher value here.

\n

Let's return to the 320x240 image example. With a 10x10 block, you'll be working with 32x24 = 768 blocks: will you call it "motion" if 1 out of 768 blocks changed value?

\n

I don't think so. You want something more robust. You want 50 blocks to change. Or at least 20 blocks. If you do the math, 20 blocks out of 768 is only the 2.5% of change, which is hardly noticeable.

\n

If you want to be robust, don't set this threshold to a too low value. Again, tweak with real world experimenting.

\n

In code it is calculated as:

\n
float changedBlocksPercent = changedBlocks / totalBlocks
\n

Combining all together

\n

Recapping: when running the motion detection algorithm you have 3 parameters to set:

\n
    \n
  1. the block size
  2. \n
  3. the block difference threshold
  4. \n
  5. the image differerence threshold
  6. \n
\n

Let's pick 3 sensible defaults: block size = 10, block threshold = 15%, image threshold = 20%.

\n

What does these parameters translate to in the practice?

\n

They mean that motion will be detected if 20% of the image, averaged in blocks of 10x10, changed its value by at least 15% from one frame to the next.

\n

\"ESP32

\n

As you can see, you don't need high-definition images to (naively) detect if something happened to the image. Large area of motion will be easily detectable, even at very low resolution.

\n

Real world example

\n

Now the fun part. I'll show you how it performs on a real-world scenario.

\n

To keep it simple, I wrote a sketch that does only motion detection, not video streaming over HTTP.

\n

This means you won't be able to see the original image recorded from the camera. Nevertheless, I have kept the block size to a minimum to allow for the best quality possible.

\n
\n

This is me passing my arm in front of the camera a few times.

\n

The grid you see represents the actual pixels used for the computation. Each cell corresponds to one pixel of the downscaled image.

\n

The orange cells highlight the pixels that the algorithm sees as "different" from one frame to the next. As you can see, some pixels are detected even if no motion is happening. That's the noise I talked about multiple times during the post.

\n

When I move my arm in the frame, you see lots of pixels become activated, so the "Motion" text appears.

\n

While moving the arm, you may notice what I call the "ghost" effect. You actually see 2 regions of motion: one is where my arm is now, which of course changed. The other is the region where my arm was in the previous frame, which returned to its original content.

\n

This is why I suggest you keep the image difference threshold to a high value: if some real motion happens, you will notice it for sure because the activated region of the image will be actually bigger than the actual object moving.

\n

Do you like the grid effect of the sample video? Let me know in the comment if you want me to share it.

\n

Or even better: subscribe to the newsletter I you will get it directly in your inbox with my next mail.

\n\r\n
\r\n
\r\n
\r\n\t

Finding this content useful?

\r\n
\r\n\t\r\n
\r\n\t
\r\n\t\t
\r\n\t\t
\r\n\t
\r\n
\r\n
\r\n
\r\n
\r\n
\r\n\r\n\n
\r\n

Check the full project code on Github

\n

Check out also the gist for the visualization tool

\n

L'articolo Motion detection with ESP32 cam only (Arduino version) proviene da Eloquent Arduino Blog.

\n", "content_text": "Do you have an ESP32 camera? Do you want to do motion detection WITHOUT ANY external hardware?\nHere's a tutorial made just for you: 30 lines of code and you will know when something changes in your video stream \n\n\n ** See the updated version of this project: it's easier to use and waaay faster: Easier, faster, pure video ESP32 cam motion detection **\nTable of contentsWhat is (naive) motion detection?Can't I use an external PIR?External hardwareField of ViewCold objectsWhat do you need?How does it work?DownsamplingBlocks difference thresholdImage difference thresholdCombining all togetherReal world example\nWhat is (naive) motion detection?\nQuoting from Wikipedia\n\nMotion detection is the process of detecting a change in the position of an object relative to its surroundings or a change in the surroundings relative to an object\n\nIn this project, we're implementing what I call naive motion detection: that is, we're not focusing on a particular object and following its motion.\nWe'll only detect if any considerable portion of the image changed from one frame to the next.\nWe won't identify the location of motion (that's the subject for a next project), neither what caused it. We will analyze video stream in (almost) real-time and compare frame by frame: if lots of pixels changed, we'll call it motion.\nCan't I use an external PIR?\nSeveral projects on the internet about motion detection with an ESP32 cam use an external PIR sensor to trigger the video recording.\nWhat's the problem with that approach? \n1. External hardware\nFirst of all, you need external hardware. If you're using a breadboard, no problem, you just need a couple more wires and you're good to go. But I have a nice M5stick camera (no affiliate link), that's already well packaged, so it won't be that easy to add a PIR sensor.\n2. Field of View\nPIR sensors have a limited FOV (field of view), so you will need more than one to cover the whole range of the camera. \nMy camera, for example, has fish-eye lens which give me 160\u00b0 of view. Most cheap PIR sensors have a 120\u00b0 field of view, so one will not suffice. This adds even more space to my project.\n3. Cold objects\nPIR sensors gets triggered by infrared light. Infrared light gets emitted by hot bodies (like people and animals).\nBut motion in a video stream can happen for a variety of reasons, not necessarily due to hot bodies, for example if you want to monitor a street for cars passing by.\nA PIR sensor can't do this: video motion detection can.\nESP32 cam pure video motion detection can detect motion due to cold objectsClick To Tweet\n Do you like the motion effect at the beginning of the post? Check it out on Github\nWhat do you need?\nAll you need for this project is a board with a camera sensor. As I said, I have a M5Stick Camera with fish-eye lens, but any ESP32 based camera should work out of the box:\n\nESP32 cam\nESP32 eye\nTTGO camera\n... any other flavor of ESP32 camera\n\n\nHow does it work?\nOk, let's go to the "technical" stuff.\nSimply put, the algorithm counts the number of different pixels from one frame to the next: if many pixels changed, it will detect motion.\nWell, it's almost like this.\nOf course such an algorithm will be very sensitive to noise (which is quite high on these low-cost cameras). We need to mitigate false-positive triggers.\nDownsampling\nOne super-simple and super-effective way of doing this is to work with blocks, instead of pixels. A block is simply an N x N square, whose value is the average of the pixels it contains.\nThis greatly reduces sensitivity to noise, providing a more robust detection. Here's an example of what the the "block-ing" operation does to an image.\n\nIt's really a "pixelating" effect: you take the orginal image (let's say 320x240 pixels) and resize it to 10x smaller, 32x24. \nThis has the added benefit that it's much more lightweight to work with 32x24 matrix instead of 320x240 matrix: if you want to do real-time detection, this is a MUST.\nHow should you choose the scale factor?\nWell, it depends.\nIt depends on the sensitivity you want to achieve. The higher the downsampling, the less sensitive your detection will be. \nIf you want to detect a person passing 50cm away from the camera, you can increase this number without any problem. If you want to detect a dog 10m away, you should keep it in the 5-10 range.\nExperiment with your own use case a tweak with trial-and-error.\nBlocks difference threshold\nOnce we've defined the block size, we need to detect if a block changed from one frame to the next.\nOf course, just testing for difference (current != prev) would be again too sensitive to noise. A block can change for a variety of reasons, the first of which is the bad camera quality.\nSo we instead define a percent threshold above which we can say for sure the block actually changed. A good starting point could be 10-20%, but again you need to tweak this to your needs.\nThe higher the threshold, the less sensitive the algorithm will be.\nIn code it is calculated as\nfloat delta = abs(currentBlockValue - prevBlockValue) / prevBlockValue;\nwhich indicates the relative increment/decrement from the previous value.\nImage difference threshold\nNow that we can detect if a block changed from one frame to the next, we can actually detect if the image changed.\nYou could decide to trigger motion even if a single block changed, but I suggest you to set an higher value here.\nLet's return to the 320x240 image example. With a 10x10 block, you'll be working with 32x24 = 768 blocks: will you call it "motion" if 1 out of 768 blocks changed value?\nI don't think so. You want something more robust. You want 50 blocks to change. Or at least 20 blocks. If you do the math, 20 blocks out of 768 is only the 2.5% of change, which is hardly noticeable.\nIf you want to be robust, don't set this threshold to a too low value. Again, tweak with real world experimenting.\nIn code it is calculated as:\nfloat changedBlocksPercent = changedBlocks / totalBlocks\nCombining all together\nRecapping: when running the motion detection algorithm you have 3 parameters to set:\n\nthe block size\nthe block difference threshold\nthe image differerence threshold\n\nLet's pick 3 sensible defaults: block size = 10, block threshold = 15%, image threshold = 20%.\nWhat does these parameters translate to in the practice?\nThey mean that motion will be detected if 20% of the image, averaged in blocks of 10x10, changed its value by at least 15% from one frame to the next.\n\nAs you can see, you don't need high-definition images to (naively) detect if something happened to the image. Large area of motion will be easily detectable, even at very low resolution.\nReal world example\nNow the fun part. I'll show you how it performs on a real-world scenario.\nTo keep it simple, I wrote a sketch that does only motion detection, not video streaming over HTTP. \nThis means you won't be able to see the original image recorded from the camera. Nevertheless, I have kept the block size to a minimum to allow for the best quality possible.\nhttps://eloquentarduino.github.io/wp-content/uploads/2020/01/ESP32-camera-motion-detection-example.mp4\nThis is me passing my arm in front of the camera a few times.\nThe grid you see represents the actual pixels used for the computation. Each cell corresponds to one pixel of the downscaled image.\nThe orange cells highlight the pixels that the algorithm sees as "different" from one frame to the next. As you can see, some pixels are detected even if no motion is happening. That's the noise I talked about multiple times during the post.\nWhen I move my arm in the frame, you see lots of pixels become activated, so the "Motion" text appears. \nWhile moving the arm, you may notice what I call the "ghost" effect. You actually see 2 regions of motion: one is where my arm is now, which of course changed. The other is the region where my arm was in the previous frame, which returned to its original content.\nThis is why I suggest you keep the image difference threshold to a high value: if some real motion happens, you will notice it for sure because the activated region of the image will be actually bigger than the actual object moving.\nDo you like the grid effect of the sample video? Let me know in the comment if you want me to share it.\nOr even better: subscribe to the newsletter I you will get it directly in your inbox with my next mail.\n\r\n\r\n\r\n \r\n\tFinding this content useful?\r\n\r\n\t\r\n\r\n\t\r\n\t\t\r\n\t\t\r\n\t \r\n \r\n \r\n \r\n\r\n\r\n\r\n\n\r\nCheck the full project code on Github\nCheck out also the gist for the visualization tool\nL'articolo Motion detection with ESP32 cam only (Arduino version) proviene da Eloquent Arduino Blog.", "date_published": "2020-01-05T12:08:08+01:00", "date_modified": "2020-06-03T13:17:09+02:00", "authors": [ { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" } ], "author": { "name": "simone", "url": "https://eloquentarduino.github.io/author/simone/", "avatar": "http://1.gravatar.com/avatar/d670eb91ca3b1135f213ffad83cb8de4?s=512&d=mm&r=g" }, "tags": [ "camera", "esp32", "Computer vision" ], "attachments": [ { "url": "https://eloquentarduino.github.io/wp-content/uploads/2020/01/ESP32-camera-motion-detection-example.mp4", "mime_type": "video/mp4", "size_in_bytes": 1673368 } ] } ] }