Start › Blog › TinyML on ESP32-S3: Running TensorFlow Lite Micro for Edge AI Inference

Cloud AI Best Practices ESP32 Firmware IoT

TinyML on ESP32-S3: Running TensorFlow Lite Micro for Edge AI Inference

📅 April 2026 ⏳ 10 min read FSS Engineering Team

The pitch for edge ML used to be a hand-wave: lower latency, better privacy, less bandwidth. The pitch is now a spec sheet. An ESP32-S3 with vector instructions and 8 MB of PSRAM can run a quantized convolutional model in tens of milliseconds while sipping single-digit milliamps on average. That changes what you can put in a battery-powered sensor, a yacht cabin panel, or a hotel access fob. This article is a working engineer’s guide to deploying TensorFlow Lite Micro on ESP32-S3, with two real examples – keyword spotting and accelerometer anomaly detection – and benchmarks you can compare against.

Why Edge ML Matters Now

Four constraints push inference to the edge. Latency, because round-tripping audio or vibration data to a cloud model adds 200-2000 ms that you cannot have for closed-loop control. Privacy, because the cleanest way to comply with GDPR or hospitality data norms is to never move the raw signal. Bandwidth, because a fleet of ten thousand vibration sensors streaming raw IMU data at 1 kHz will saturate any reasonable backhaul. And offline operation, because boats, factories, and remote installations cannot assume connectivity.

The counter-pressure has always been compute. ESP32 classic could do toy models. ESP32-S3 changed the math. The S3 added a vector instruction set (PIE – Processor Instruction Extension) that accelerates the multiply-accumulate operations at the heart of every neural net, and supports up to 8 MB of octal PSRAM for model and tensor arena storage. We now routinely deploy 200-500 kB models that would have been laughable on the original ESP32, and we run them fast enough that the radio is the dominant power draw, not the compute.

ESP32-S3 Capabilities Worth Knowing

Dual Xtensa LX7 cores at 240 MHz, with the PIE vector unit on each.
Up to 8 MB octal PSRAM at 80 MHz, addressable for tensor arena and model weights.
Up to 16 MB flash for model storage and OTA partitions.
USB-OTG for direct host connection without an external bridge – useful for development and for shipping products that act as USB peripherals.
Hardware-accelerated SHA, AES, RSA, and ECC – the same engines that secure your TLS sessions can be reused for model integrity checks.

The dual cores matter. We typically pin TFLite Micro inference to core 1 and leave core 0 for networking, sensor sampling, and the OS. That separation prevents inference jitter from blowing your TLS handshakes or your MQTT keepalives. If you need a refresher on choosing between ESP32 family members and STM32, our ESP32 vs STM32 guide covers the tradeoffs in depth.

Framework Choice: TFLite Micro, Edge Impulse, ONNX

Three serious options for ESP32-class targets. They are not mutually exclusive – we have shipped products that use Edge Impulse for development and TFLite Micro for production deployment.

TensorFlow Lite Micro is the lowest-level option. You bring a .tflite model, link the runtime, register your op resolver, and call Invoke(). Most control, most boilerplate, smallest footprint. This is what we use in production for performance-critical paths.
Edge Impulse wraps TFLite Micro (and other backends) in a complete MLOps workflow – data collection, labeling, training, deployment as a C++ library. Excellent for getting from sensor to model in days rather than weeks. Slightly larger binary footprint and an opinionated structure.
ONNX Runtime for embedded is the newest entrant on this class of hardware. Useful if your data science team lives in PyTorch and exports ONNX. Operator coverage is still narrower than TFLite on Xtensa.

For new projects with internal ML capacity, we default to TFLite Micro. For teams without dedicated ML engineers, Edge Impulse pays for itself. The deployment artifact in either case is a static library you link into your firmware build, the same way you would link any other component in your connected devices project.

Model Size Budgets

Three numbers govern what you can ship: flash for the model, RAM for the tensor arena, and milliseconds per inference. Realistic budgets on ESP32-S3 with PSRAM:

Model weights: 50-500 kB in flash for INT8 models, more if you have spare partitions.
Tensor arena: 30-300 kB in PSRAM for activation tensors. Use MicroInterpreter::arena_used_bytes() after first Invoke() to size precisely.
Inference time: 5-150 ms depending on model architecture. Anything above 200 ms suggests you should rethink the model topology, not the hardware.

The tensor arena is the variable that surprises people. Allocate it in PSRAM, not internal SRAM, unless inference time is dominated by memory bandwidth and you have measured it. PSRAM access is slower than SRAM but vastly cheaper in terms of available capacity. The penalty is real but predictable, and it almost always loses to running out of internal RAM at the worst possible moment.

Quantization: INT8 Is the Default, INT4 Is the Frontier

Float32 models are not deployable on this class of hardware. INT8 quantization, done with representative data, typically costs 1-3 percentage points of accuracy in exchange for 4x smaller models and 2-4x faster inference. We use TensorFlow’s post-training quantization with a representative dataset of 100-500 samples drawn from the actual deployment distribution.

# Python: post-training INT8 quantization with representative data
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = lambda: (
    [sample.astype(np.float32)] for sample in calibration_samples
)
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
open("model_int8.tflite", "wb").write(tflite_model)

INT4 quantization is real and shipping in 2026, but operator coverage on TFLite Micro is still partial. We use INT4 selectively for very large models where the size win is worth the engineering cost. For the keyword spotter and anomaly detector below, INT8 is the right answer.

Example One: Keyword Spotting

Wake-word detection is the canonical TinyML demo because it exercises the full pipeline – audio capture, feature extraction, neural network, debouncing – in a footprint that fits everywhere. Our reference architecture for a hotel-room voice control puck:

I2S microphone sampling at 16 kHz, 16-bit mono.
Sliding window of 1 second, hopping every 200 ms.
MFCC feature extraction: 40 mel bins, 49 frames per window.
Small CNN: 4 depthwise-separable conv blocks, ~80 kB INT8.
Posterior smoothing over 3 windows, threshold 0.85.

The inference call itself is straightforward once the model and ops are registered:

// Setup, called once
static constexpr int kArenaSize = 96 * 1024;
static uint8_t *arena = (uint8_t*)heap_caps_malloc(kArenaSize, MALLOC_CAP_SPIRAM);

const tflite::Model* model = tflite::GetModel(g_kws_model);
static tflite::MicroMutableOpResolver<8> resolver;
resolver.AddDepthwiseConv2D();
resolver.AddConv2D();
resolver.AddFullyConnected();
resolver.AddSoftmax();
resolver.AddReshape();
resolver.AddAveragePool2D();
resolver.AddQuantize();
resolver.AddDequantize();

static tflite::MicroInterpreter interp(model, resolver, arena, kArenaSize);
interp.AllocateTensors();
TfLiteTensor* input = interp.input(0);
TfLiteTensor* output = interp.output(0);

// Hot loop, called every 200 ms
void run_inference(const int8_t* mfcc) {
    memcpy(input->data.int8, mfcc, input->bytes);
    int64_t t0 = esp_timer_get_time();
    if (interp.Invoke() != kTfLiteOk) return;
    int64_t t1 = esp_timer_get_time();
    ESP_LOGD(TAG, "infer %lld us", t1 - t0);

    int8_t score = output->data.int8[KEYWORD_INDEX];
    if (score > kThresholdInt8) on_keyword_detected();
}

Measured on an ESP32-S3 at 240 MHz with PSRAM at 80 MHz: inference takes 18-22 ms per window, MFCC extraction another 4-6 ms, leaving more than 170 ms of slack per hop for radio, app logic, and OS overhead. Average current is dominated by the microphone and amplifier, not the compute.

Example Two: Accelerometer Anomaly Detection

For predictive maintenance on a yacht’s water pump or a hotel’s chiller, the canonical model is a 1D convolutional autoencoder trained on healthy vibration signatures. Anomalies present as elevated reconstruction error. The deployment is similar to keyword spotting but smaller and faster.

3-axis accelerometer at 800 Hz over SPI.
Sliding window of 256 samples per axis (320 ms), overlapping 50%.
Per-axis normalization to zero mean, unit variance.
1D conv autoencoder, 4 layers, ~35 kB INT8, latent dim 16.
Reconstruction MSE compared against rolling baseline; alert if error exceeds learned threshold for 5 consecutive windows.

Inference comes in around 6-8 ms per window. The whole detector – sampling, normalization, inference, threshold logic – uses under 5% of one core, leaving headroom for the device to also publish raw telemetry to the cloud for retraining. Detected anomalies trigger a richer payload: the reconstruction error trace, the raw window, and a model-version stamp so the cloud team can retroactively assess true positives. That feedback loop is what turns a one-shot deployment into a system that improves; it is part of why we treat edge computing as inseparable from cloud strategy rather than a substitute for it.

Deployment Workflow

Shipping a TinyML model is not a one-time act. It is a versioned artifact like any other. Our standard workflow:

Train and quantize in Python; output model_int8.tflite and a metadata JSON with input shape, classes, and threshold.
Convert to a C array via xxd -i or the ESP-IDF model embedding flow.
Bake into firmware with a model-version constant; CI builds attach the metadata to the release.
Deliver via OTA using the same partition strategy as firmware. We separate model partitions from code partitions so a model update does not require a full firmware push.
Observe: every inference logs the model version, the score, and the action taken. That telemetry feeds the next training cycle.

Treating the model as data, not code, is the unlock for fast iteration. We push new model partitions on a weekly cadence in some deployments, while the underlying firmware moves on its own quarterly cycle. This separation also lets the data science team ship without waiting on a firmware release train.

Benchmarks Worth Comparing Against

Numbers from our reference benchmarks on ESP32-S3 at 240 MHz, PSRAM at 80 MHz, INT8 models, single core:

Keyword spotter (80 kB, 49x40x1 input): 19 ms inference, 64 kB arena.
Accelerometer autoencoder (35 kB, 256×3 input): 7 ms inference, 22 kB arena.
MobileNetV1 0.25x for 96×96 grayscale (220 kB): 145 ms inference, 180 kB arena.
Person-detection (250 kB, 96×96): 175 ms inference, 200 kB arena.

If your model takes substantially longer than these numbers suggest, the usual culprits are unsupported ops falling back to reference implementations, an arena placed in slow memory, or a model topology that defeats the vector unit (lots of tiny ops, channel counts not aligned to 8 or 16). Profile before you optimize, and always verify which ops are actually being accelerated.

Power Budget Sanity Check

Edge ML is only useful if the device still meets its battery target. The good news: with WiFi off and the radio gated by inference results, ESP32-S3 spends most of its time in light sleep around 0.8 mA, wakes for a microphone window, runs MFCC plus inference in around 25 ms at roughly 35 mA, and falls back asleep. For a keyword spotter sampling continuously at a 200 ms hop, average current lands near 12-18 mA depending on duty. A 2000 mAh cell will run roughly four months on continuous listening, longer if you wake on acoustic energy first.

The accelerometer detector is friendlier still. We typically wake on a motion threshold, run a 320 ms window, and sleep until the next scheduled check. Average current sits in the single-digit milliamps and a CR123A primary cell carries the device through a multi-year deployment. Always measure with a real coulomb counter on the production PCB; simulation is a starting point, not an answer.

What Edge ML Does Not Solve

Edge inference does not eliminate the cloud. You still need centralized training, model versioning, drift detection, and a feedback loop that captures the cases where the edge model was wrong. The cloud also remains the right place for any reasoning that needs cross-device context – one device sees one machine, the cloud sees the fleet. Our broader AI platform work assumes hybrid by default and pushes inference to whichever side of the wire makes sense per use case.

If you are sizing up a TinyML deployment – whether it is a yacht cabin sensor, a hotel access device, or an industrial vibration node – we build these end-to-end on ESP32-S3 and adjacent silicon. Start with our connected devices service or browse the broader IoT capabilities we offer.

Building something connected?

FSS Technology designs and builds IoT products from silicon to cloud — embedded firmware, custom hardware, and Azure backends.

Talk to our team →