> ESP32-S31 is particularly well suited for edge AI and machine learning workloa...

mattalex · 2026-06-03T20:53:57 1780520037

Regarding specifically depth anything: You're not running this on a microcontroller. In general, CNNs still reign supreme on microcontrollers since you have a way lower peak memory demand which is what usually kills you. Here in this case you have a couple of _kilobytes_ of SRAM, potentially extendable to a couple of megabytes of PSRAM.

Even for small CNNs you often need to do some quite complex interleaving of layers (i.e. running parts of layer 1 and layer 2 in parallel interleaved to take advantage of the downsampling of CNNs) to keep performance and memory impact reasonable (see e.g. https://openreview.net/pdf?id=2O8qbyxH6X).

Think more "image classifier" less "run an image to image transformer". For depth anything, a single layer's activation is probably significantly larger than the available SRAM (I think it is (224/16)^2 patches each with activations [48, 96, 192, 384] for depth anything small: You aren't running this.)

otterdude · 2026-06-03T17:53:49 1780509229

I was wondering this as well. What exactly makes this a good AI chip vs others.

Unless they're not listing a major feature in their spec, a dual core 320Mhz microcontroller is not bad but youre not going to be running any kind of vision model on it, at least very fast.

kcb · 2026-06-03T22:18:18 1780525098

A real example https://github.com/OHF-Voice/micro-wake-word

porridgeraisin · 2026-06-03T19:50:00 1780516200

Memory is the main constraint. You have what, 8mb of psram.

Compute wise you can manage. You can do quantisation and run a small 10-15 layer CNN perhaps. Image classification is possible. Keep in mind the channel count and input resolution cannot be high since memory will be a problem. You can maybe do face _detection_, "is my cat on my keyboard" classification as well maybe.

Audio, you can do a lot more. Wake word detection happens on _much_ smaller accelerators inside iphones. In this one you can do slightly heavier classifications. Maybe speaker identification "which member of family" or maybe "which dog is barking"

asadm · 2026-06-03T20:29:48 1780518588

nope. not happening. at most YOLO or mayyybe FastDepth