SIMD-accelerated computer vision on a $2 microcontroller

DeathArrow · 2024-06-25T07:30:41 1719300641

>For silicon that's cheaper than the average coffee, that's pretty cool.

Maybe it's not the chip that it's too cheap. Maybe it's the coffee that's too expensive.

mppm · 2024-06-25T10:19:38 1719310778

OTOH, I've been waiting for disposable coffee cups with OLED-based video ads ever since Minority Report. But tech progress is just too damn slow :P

yjftsjthsd-h · 2024-06-25T14:45:05 1719326705

I dunno about OLED, but now that you say it the costs do make some sort of "smart" coffee disturbingly plausible.

TheAdamist · 2024-06-25T23:14:31 1719357271

Based on the recent post about the disposable Montreal subway tickets with a super cheap nfc chip (and amusingly on a paper ticket with a a printed on fake smart chip connection) it should be super cheap to have an automated kiosk that pairs your drink order to a paper cup that when a barista swipes shows your cup your order shows up or fills it automatically.

https://www.righto.com/2024/06/montreal-mifare-ultralight-nf... (It was linked from here but i don't have the HN link)

BizarroLand · 2024-06-28T17:11:10 1719594670

Seems like it would be cheaper to have a simple QR code be thermally printed on the cup. A fraction of a penny's worth of chemical spritz on the bottom of the cup would do the trick.

surfingdino · 2024-06-25T15:59:23 1719331163

Almost there... https://www.moveelectric.com/e-motorbikes/super-soco-aims-se...

jacoblambda · 2024-06-25T21:05:29 1719349529

I wish but tbh coffee is probably artificially cheaper than it really should be since larger corporations exploit local farms and effectively maintain local monopolies where farms have to sell to said corporations for a fraction of the price it's actually worth.

throwaway211 · 2024-06-25T10:46:29 1719312389

Drink more microcontrollers.

rldjbpin · 2024-06-26T08:09:21 1719389361

more like the labour to get one made for you.

rhelz · 2024-06-25T19:53:57 1719345237

> Maybe it's the coffee that's too expensive.

Ha, well, there is a disturbing reason why computer vision with ultra-cheap hardware is possible: countries all over the world are buying these by the billions in order to keep an eye on their citizens :-(

Big brother is enabling incredible economies of scale....

evanjrowley · 2024-06-25T04:24:13 1719289453

A comparable board is the ESP32-CAM, which is supported by this really practical computer vision project: https://github.com/jomjol/AI-on-the-edge-device?tab=readme-o...

hi-v-rocknroll · 2024-06-25T06:11:21 1719295881

In the CV department, I recently ordered a cheap FPGA + ARM Cortex-M3 + 64 Mbit SRAM + 32 Mbit flash that does camera input and HDMI output. Like a budget Zynq for CV.

https://wiki.sipeed.com/hardware/en/tang/Tang-Nano-4K/Nano-4...

https://www.aliexpress.us/item/3256806880637138.html

unwind · 2024-06-25T06:50:04 1719298204

Cool board!

Would any of the "retro" game/home computer firmwares fit in that FPGA? I find comparing capacity hard for stuff like that.

hi-v-rocknroll · 2024-06-25T07:04:56 1719299096

There's absolutely no reason ROMs have to waste scarce resources of a hybrid FPGA. Micro SD cards (called TF in China) and eMMC are the usual solutions.

Example: https://www.aliexpress.us/item/3256806498688867.html

londons_explore · 2024-06-25T07:09:58 1719299398

Yes, easily, but unless someone has done it already, 'porting' them to this board would be a lot of work.

qingcharles · 2024-06-27T05:50:52 1719467452

What an awesome amount of tech for so little money.

3abiton · 2024-06-25T22:18:51 1719353931

I wish I had the time to tinker with these bad boys

maven29 · 2024-06-25T05:34:42 1719293682

There is an ESP32-S3 version of this camera breakout board, which is presumably what OP might have used for prototyping.

The S3 variant easily justifies the slight additional cost, given that it's easily faster by an order of magnitude or greater, having SIMD and an FPU.

https://github.com/espressif/esp-dl/tree/master/examples/fac...

julius · 2024-06-25T09:29:36 1719307776

Oh wow TIL ESP32 can run TensorFlowLite. Person detection in 54ms! https://github.com/espressif/esp-tflite-micro?tab=readme-ov-...

amelius · 2024-06-25T07:42:03 1719301323

How many fps can that project do?

picture · 2024-06-25T06:25:31 1719296731

Also see this short post about SIMD on ESP32-S3, discussed previously. https://bitbanksoftware.blogspot.com/2024/01/surprise-esp32-...

dansitu · 2024-06-25T18:05:16 1719338716

If you're interested in this stuff and wanna try it yourself, check out our product, Edge Impulse:

https://edgeimpulse.com/ai-practitioners

We work directly with vendors to perform low level optimization of deep learning, computer vision, and DSP workloads for dozens of architectures of microcontrollers and CPUs, plus exotic accelerators (neuromorphic compute!) and edge GPUs. This includes ESP32:

https://docs.edgeimpulse.com/docs/edge-ai-hardware/mcu/espre...

You can upload a TensorFlow, PyTorch, or JAX model and receive an optimized C++ library direct from your notebook in a couple lines of Python. It's honestly pretty amazing.

And we also have a full Studio for training models, including architectures we've designed specifically to run well on various embedded hardware, plus hardware-aware hyperparameter optimization that will find the best model to fit your target device (in terms of latency and memory use).

RobotToaster · 2024-06-25T21:03:37 1719349417

I don't think the output from this can be used in any open source project due to the community plan restrictions, FYI.

dansitu · 2024-06-26T14:07:34 1719410854

That's definitely not our intention: the output of all Community projects is by default Apache 2.0 licensed, unless the developer specifies a different one.

The community plan does have commercial use restrictions; it's designed for education, demos, and research. We have a pretty good presence in the academic community with tons of papers, code, and projects developed using our community version.

Here's a Google Scholar search showing a bunch of papers:

https://scholar.google.com/scholar?start=0&q=%22edge+impulse...

We also have our own public sharing platform:

https://edgeimpulse.com/projects/overview

TheMagicHorsey · 2024-06-25T19:34:45 1719344085

Yo! This is awesome stuff!

dansitu · 2024-06-25T19:39:37 1719344377

Thank you! We're trying to bring embedded ML in reach of all engineering teams and domain experts.

Previously you needed a crazy mixture of ML knowledge and low-level embedded engineering skills even to get started, which is not a common occurrence!

qiqitori · 2024-06-26T00:25:11 1719361511

Why C++? Does the C++ code use any difficult C++ features or is it more C with classes?

dansitu · 2024-06-26T14:02:23 1719410543

We use C++11 because we depend on some source that relies on it, but you can of course compile the library and then link to a C program:

https://docs.edgeimpulse.com/docs/run-inference/cpp-library/...

westurner · 2024-06-25T03:35:33 1719286533

> As I've been really interested in computer vision lately, I decided on writing a SIMD-accelerated implementation of the FAST feature detector for the ESP32-S3 [...]

> In the end, I was able to improve the throughput of the FAST feature detector by about 220%, from 5.1MP/s to 11.2MP/s in my testing. This is well within the acceptable range of performance for realtime computer vision tasks, enabling the ESP32-S3 to easily process a 30fps VGA stream.

What are some use cases for FAST?

Features from accelerated segment test: https://en.wikipedia.org/wiki/Features_from_accelerated_segm...

Is there TPU-like functionality in anything in this price range of chips yet?

Neon is an optional SIMD instruction set extension for ARMv7 and ARMv8; so Pi Zero and larger have SIMD extensions

Orrin Nano have 40 TOPS, which is sufficient for Copilot+ AFAIU. "A PCIe Coral TPU Finally Works on Raspberry Pi 5" https://news.ycombinator.com/item?id=38310063

From https://phys.org/news/2024-06-infrared-visible-device-2d-mat... :

> Using this method, they were able to up-convert infrared light of wavelength around 1550 nm to 622 nm visible light. The output light wave can be detected using traditional silicon-based cameras.

> "This process is coherent—the properties of the input beam are preserved at the output. This means that if one imprints a particular pattern in the input infrared frequency, it automatically gets transferred to the new output frequency," explains Varun Raghunathan, Associate Professor in the Department of Electrical Communication Engineering (ECE) and corresponding author of the study published in Laser & Photonics Reviews.

"Show HN: PicoVGA Library – VGA/TV Display on Raspberry Pi Pico" https://news.ycombinator.com/item?id=35117847#35120403 https://news.ycombinator.com/item?id=40275530

"Designing a SIMD Algorithm from Scratch" https://news.ycombinator.com/item?id=38450374

shraiwi · 2024-06-25T04:01:22 1719288082

Thanks for reading!

> What are some use cases for FAST?

The FAST feature detector is an algorithm for finding regions of an image that are visually distinctive, which can be used as a first step in motion tracking and SLAM (simultaneous localization and mapping) algorithms typically seen in XR, robotics, etc.

> Is there TPU-like functionality in anything in this price range of chips yet?

I think that in the case of the ESP32-S3, its SIMD instructions are designed to accelerate the inference of quantized AI models (see: https://github.com/espressif/esp-dl), and also some signal processing like FFTs. I guess you could call the SIMD instructions TPU-like, in the sense that the chip has specific instructions that facilitates ML inference (EE.VRELU.Sx performs the ReLU operation). Using these instructions will still take away CPU time where TPUs are typically their own processing core, operating asynchronously. I’d say this is closer to ARM NEON.

westurner · 2024-06-25T22:44:11 1719355451

SimSIMD https://github.com/ashvardanian/SimSIMD :

> Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE

github.com/topics/simd: https://github.com/topics/simd

https://news.ycombinator.com/item?id=37805810#37808036

westurner · 2024-06-30T04:32:45 1719721965

From gh-topics/SIMD:

SIMDe: SIMD everywhere: https://github.com/simd-everywhere/simde :

> The SIMDe header-only library provides fast, portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM. There is no performance penalty if the hardware supports the native implementation (e.g., SSE/AVX runs at full speed on x86, NEON on ARM, etc.).

> This makes porting code to other architectures much easier in a few key ways:

implements · 2024-06-25T09:21:44 1719307304

> The FAST feature detector is an algorithm for finding regions of an image that are visually distinctive, …

Is that related to ‘Energy Function’ in any way?

(I ask because a long time ago I was involved in an Automated Numberplate Reading startup that was using an FPGA to quickly find the vehicle numberplate in an image)

ska · 2024-06-25T18:12:57 1719339177

What you are thinking of operates at a different level of abstraction. Energy functions are a general way of structuring a problem, used (sometimes abused) to apply an optimization algorithm to find a reasonable solution for it.

FAST is an algorithm for efficiently looking for "interesting" parts (basically, corners) of an image, so you can safely (in theory) ignore the rest of it. The output from a feature detector may end up contributing to an energy function later, directly or indirectly.

kylixz · 2024-06-25T04:29:44 1719289784

Interested in doing more of this type of work optimizing a SLAM/factorgraph pipeline?

Email in bio and would love to chat!

yatopifo · 2024-06-25T06:18:12 1719296292

> Is there TPU-like functionality in anything in this price range of chips yet?

Kendryte K210 supports 1x1 and 3x3 convolutions on the "TPU". It was pretty well supported in terms of software & documentation but sadly it hasn't become popular.

These days, you can easily find cheap RV1103 ("LuckFox"), BL808 ("Ox64/Pine64") and CV1800B/SG20002 ("MilkV") based dev boards, all of which have some sort of basic TPU. Unfortunately, they are designed to be linux boards meaning that all TPU related stuff is extremely abstracted with zero under-the-hood documentation. So it's absolutely unclear whether their TPUs are real or faked with clever code optimizations.

koerakoonlane · 2024-06-25T08:27:44 1719304064

> These days, you can easily find cheap RV1103 ("LuckFox"), BL808 ("Ox64/Pine64") and CV1800B/SG20002 ("MilkV") based dev boards, all of which have some sort of basic TPU. Unfortunately, they are designed to be linux boards meaning that all TPU related stuff is extremely abstracted with zero under-the-hood documentation. So it's absolutely unclear whether their TPUs are real or faked with clever code optimizations.

They all have TPU in hardware, my team has been verifying and benchmarking them. Documentation is only available for the high-level C APIs to the libraries that a programmer is expected to use, and even that tends to be extremely lacking.

rldjbpin · 2024-06-26T08:12:12 1719389532

tinyml fascinates me because its principles can be directly applied to web-based applications imho.

micropython seems pretty accessible from first glance. would it be easy to create a webassembly port of its code?

jononor · 2024-06-26T12:26:01 1719404761

MicroPython port for WASM/JS already exists. For device usecases, one can play with it at https://micropython.org/unicorn/

It can be used in PyScript for client side development. And has JavaScript/DOM bridge as well. https://pyscript.net/tech-preview/micropython/about.html

robxorb · 2024-06-25T19:20:40 1719343240

I wonder how hard it would be, presumably with some trade-off with detection windows, to use a few of these in parallel and process higher resolutions and frame rates?

ladyanita22 · 2024-06-25T10:34:59 1719311699

Anyone with experience on Rust for ESP32 controllers could chime in on whether this is feasible on rust as well?

Qwuke · 2024-06-25T14:51:34 1719327094

Compared to ESP8266, there's generally pretty good ESP32 support for Rust, but you'll likely need to use in your C++ toolchain if you want to use the standard library. no-std in Rust for ESP32 isn't terrible in my experience, though, just not as fleshed out - particularly for hooking into components like wifi/networking and probably a camera as well.

Like the other commenter said, there's plenty of support for SIMD and asm in Rust.

You might ask around on a Rust embedded or Rust ESP32 chatroom before making the dive.

the__alchemist · 2024-06-25T15:05:57 1719327957

You can actually use the IDF system in Rust to use the std lib, at least on ESP32-C3. Probably others too.

If you are on Windows, you will need to place the project folder at the top level drive directory, and there are other quirks as well, but it works.

Qwuke · 2024-06-27T12:11:59 1719490319

Haha yes, I misphrased a bit but that's what I meant when I said you'll need to use C++ to use the stdlib. It's not quite pure embedded Rust but yes it does work.

f_devd · 2024-06-25T13:23:44 1719321824

It is possible, mainly depends on LLVM/clang support as rust ASM is very easy to do

hoseja · 2024-06-26T11:19:40 1719400780

Am I reading wrong or is the penultimate part just basic two's complement?

sylware · 2024-06-25T11:34:10 1719315250

Yep, SIMD seems to win the race vs SMT for that type of processing.

hajile · 2024-06-26T17:18:33 1719422313

I don't think SIMD and SMT is an either/or proposition. SMT-4 or SMT-8 with a bunch of SIMD has the potential to get better perf/area due to the threads hiding the latency.

restricted_ptr · 2024-06-25T02:52:57 1719283977

I wonder if ESP32 has VLIW slots and a tighter instruction packaging is possible?

duskwuff · 2024-06-25T04:42:44 1719290564

Neither Xtensa nor RISC-V are VLIW architectures.

restricted_ptr · 2024-06-25T06:33:15 1719297195

Xtensa architecture is flexible and extendable by the user. Ability to define new instructions, hw features and VLIW configurations are some of the key features. You can find more details on the internet https://en.m.wikipedia.org/wiki/Tensilica

jki275 · 2024-06-26T02:01:41 1719367301

I don't think that applies to the ESP32 family of devices. I've never heard of DSP hardware onboard them.

I think the comment you're referring to is talking about the architecture in general, but not the silicon we're discussing here.

restricted_ptr · 2024-06-26T03:24:01 1719372241

ESP32 ee.* operations in assembly look pretty much like aliases for a VLIW bundles, on the same cycle issuing loads used in the next op while also doing multiplication on other operands. This is not a minimal Xtensa. They might not have the Tensilica toolchain for redistribution to use these features freely but apparently they exposed these extensions in their assembler in some form.

thrtythreeforty · 2024-06-25T16:51:14 1719334274

Generally speaking, this is not correct. Base Xtensa is not VLIW, but Xtensa's various vector extensions do allow VLIW instructions, collectively called "FLIX."

It is doubtful that ESP32's Xtensa is VLIW-capable, though. Presumably their compiler would emit FLIX instructions if it were.

rurban · 2024-06-25T06:48:26 1719298106

We prefer something more expensive and better: https://up-board.org/upsquared/specifications/

Intel UpSquared

rowanG077 · 2024-06-25T09:37:04 1719308224

More expensive sure. But better is pretty rich considering it is Intel. My money is on this platform just evaporating in the next 5 years. Esp32 has proven you can rely on supply and longevity.

c0balt · 2024-06-26T02:17:06 1719368226

Arguably the UP^2 is another class of device. Up to 8 GB of RAM and up to 128 GB of storage + a whole x86 CPU with dual gigabit LAN.

And the price, size and power consumption are also quite a bit higher but it will certainly grant a better general compute environment, if you want to run Linux or smth.

rurban · 2024-06-26T14:06:02 1719410762

It's very easy to use any pytorch or tensorflow packages, or open3d, pcl, librealsense or similar vision packages. Powerful enough to do realtime vision tasks, which you certainly cannot do with 2€ boards.