The pitch for edge ML used to be a hand-wave: lower latency, better privacy, less bandwidth. The pitch is now a spec sheet. An ESP32-S3 with vector instructions and 8 MB of PSRAM can run a quantized convolutional model in tens of milliseconds while sipping single-digit milliamps on average. That changes what you can put in a battery-powered sensor, a yacht cabin panel, or a hotel access fob. This article is a working engineer’s guide to deploying TensorFlow Lite Micro on ESP32-S3, with two real examples – keyword spotting and accelerometer anomaly detection – and benchmarks you can compare against.
Four constraints push inference to the edge. Latency, because round-tripping audio or vibration data to a cloud model adds 200-2000 ms that you cannot have for closed-loop control. Privacy, because the cleanest way to comply with GDPR or hospitality data norms is to never move the raw signal. Bandwidth, because a fleet of ten thousand vibration sensors streaming raw IMU data at 1 kHz will saturate any reasonable backhaul. And offline operation, because boats, factories, and remote installations cannot assume connectivity.
The counter-pressure has always been compute. ESP32 classic could do toy models. ESP32-S3 changed the math. The S3 added a vector instruction set (PIE – Processor Instruction Extension) that accelerates the multiply-accumulate operations at the heart of every neural net, and supports up to 8 MB of octal PSRAM for model and tensor arena storage. We now routinely deploy 200-500 kB models that would have been laughable on the original ESP32, and we run them fast enough that the radio is the dominant power draw, not the compute.
The dual cores matter. We typically pin TFLite Micro inference to core 1 and leave core 0 for networking, sensor sampling, and the OS. That separation prevents inference jitter from blowing your TLS handshakes or your MQTT keepalives. If you need a refresher on choosing between ESP32 family members and STM32, our ESP32 vs STM32 guide covers the tradeoffs in depth.
Three serious options for ESP32-class targets. They are not mutually exclusive – we have shipped products that use Edge Impulse for development and TFLite Micro for production deployment.
.tflite model, link the runtime, register your op resolver, and call Invoke(). Most control, most boilerplate, smallest footprint. This is what we use in production for performance-critical paths.For new projects with internal ML capacity, we default to TFLite Micro. For teams without dedicated ML engineers, Edge Impulse pays for itself. The deployment artifact in either case is a static library you link into your firmware build, the same way you would link any other component in your connected devices project.
Three numbers govern what you can ship: flash for the model, RAM for the tensor arena, and milliseconds per inference. Realistic budgets on ESP32-S3 with PSRAM:
MicroInterpreter::arena_used_bytes() after first Invoke() to size precisely.The tensor arena is the variable that surprises people. Allocate it in PSRAM, not internal SRAM, unless inference time is dominated by memory bandwidth and you have measured it. PSRAM access is slower than SRAM but vastly cheaper in terms of available capacity. The penalty is real but predictable, and it almost always loses to running out of internal RAM at the worst possible moment.
Float32 models are not deployable on this class of hardware. INT8 quantization, done with representative data, typically costs 1-3 percentage points of accuracy in exchange for 4x smaller models and 2-4x faster inference. We use TensorFlow’s post-training quantization with a representative dataset of 100-500 samples drawn from the actual deployment distribution.
# Python: post-training INT8 quantization with representative data
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = lambda: (
[sample.astype(np.float32)] for sample in calibration_samples
)
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
open("model_int8.tflite", "wb").write(tflite_model)
INT4 quantization is real and shipping in 2026, but operator coverage on TFLite Micro is still partial. We use INT4 selectively for very large models where the size win is worth the engineering cost. For the keyword spotter and anomaly detector below, INT8 is the right answer.
Wake-word detection is the canonical TinyML demo because it exercises the full pipeline – audio capture, feature extraction, neural network, debouncing – in a footprint that fits everywhere. Our reference architecture for a hotel-room voice control puck:
The inference call itself is straightforward once the model and ops are registered:
// Setup, called once
static constexpr int kArenaSize = 96 * 1024;
static uint8_t *arena = (uint8_t*)heap_caps_malloc(kArenaSize, MALLOC_CAP_SPIRAM);
const tflite::Model* model = tflite::GetModel(g_kws_model);
static tflite::MicroMutableOpResolver<8> resolver;
resolver.AddDepthwiseConv2D();
resolver.AddConv2D();
resolver.AddFullyConnected();
resolver.AddSoftmax();
resolver.AddReshape();
resolver.AddAveragePool2D();
resolver.AddQuantize();
resolver.AddDequantize();
static tflite::MicroInterpreter interp(model, resolver, arena, kArenaSize);
interp.AllocateTensors();
TfLiteTensor* input = interp.input(0);
TfLiteTensor* output = interp.output(0);
// Hot loop, called every 200 ms
void run_inference(const int8_t* mfcc) {
memcpy(input->data.int8, mfcc, input->bytes);
int64_t t0 = esp_timer_get_time();
if (interp.Invoke() != kTfLiteOk) return;
int64_t t1 = esp_timer_get_time();
ESP_LOGD(TAG, "infer %lld us", t1 - t0);
int8_t score = output->data.int8[KEYWORD_INDEX];
if (score > kThresholdInt8) on_keyword_detected();
}
Measured on an ESP32-S3 at 240 MHz with PSRAM at 80 MHz: inference takes 18-22 ms per window, MFCC extraction another 4-6 ms, leaving more than 170 ms of slack per hop for radio, app logic, and OS overhead. Average current is dominated by the microphone and amplifier, not the compute.
For predictive maintenance on a yacht’s water pump or a hotel’s chiller, the canonical model is a 1D convolutional autoencoder trained on healthy vibration signatures. Anomalies present as elevated reconstruction error. The deployment is similar to keyword spotting but smaller and faster.
Inference comes in around 6-8 ms per window. The whole detector – sampling, normalization, inference, threshold logic – uses under 5% of one core, leaving headroom for the device to also publish raw telemetry to the cloud for retraining. Detected anomalies trigger a richer payload: the reconstruction error trace, the raw window, and a model-version stamp so the cloud team can retroactively assess true positives. That feedback loop is what turns a one-shot deployment into a system that improves; it is part of why we treat edge computing as inseparable from cloud strategy rather than a substitute for it.
Shipping a TinyML model is not a one-time act. It is a versioned artifact like any other. Our standard workflow:
model_int8.tflite and a metadata JSON with input shape, classes, and threshold.xxd -i or the ESP-IDF model embedding flow.Treating the model as data, not code, is the unlock for fast iteration. We push new model partitions on a weekly cadence in some deployments, while the underlying firmware moves on its own quarterly cycle. This separation also lets the data science team ship without waiting on a firmware release train.
Numbers from our reference benchmarks on ESP32-S3 at 240 MHz, PSRAM at 80 MHz, INT8 models, single core:
If your model takes substantially longer than these numbers suggest, the usual culprits are unsupported ops falling back to reference implementations, an arena placed in slow memory, or a model topology that defeats the vector unit (lots of tiny ops, channel counts not aligned to 8 or 16). Profile before you optimize, and always verify which ops are actually being accelerated.
Edge ML is only useful if the device still meets its battery target. The good news: with WiFi off and the radio gated by inference results, ESP32-S3 spends most of its time in light sleep around 0.8 mA, wakes for a microphone window, runs MFCC plus inference in around 25 ms at roughly 35 mA, and falls back asleep. For a keyword spotter sampling continuously at a 200 ms hop, average current lands near 12-18 mA depending on duty. A 2000 mAh cell will run roughly four months on continuous listening, longer if you wake on acoustic energy first.
The accelerometer detector is friendlier still. We typically wake on a motion threshold, run a 320 ms window, and sleep until the next scheduled check. Average current sits in the single-digit milliamps and a CR123A primary cell carries the device through a multi-year deployment. Always measure with a real coulomb counter on the production PCB; simulation is a starting point, not an answer.
Edge inference does not eliminate the cloud. You still need centralized training, model versioning, drift detection, and a feedback loop that captures the cases where the edge model was wrong. The cloud also remains the right place for any reasoning that needs cross-device context – one device sees one machine, the cloud sees the fleet. Our broader AI platform work assumes hybrid by default and pushes inference to whichever side of the wire makes sense per use case.
If you are sizing up a TinyML deployment – whether it is a yacht cabin sensor, a hotel access device, or an industrial vibration node – we build these end-to-end on ESP32-S3 and adjacent silicon. Start with our connected devices service or browse the broader IoT capabilities we offer.
FSS Technology designs and builds IoT products from silicon to cloud — embedded firmware, custom hardware, and Azure backends.
Talk to our team →