TinyML in Practice: Running Neural Networks on Microcontrollers
Quantization, model pruning, and the engineering reality of deploying ML on devices with 512KB of RAM.
The Promise and the Constraint
Machine learning on microcontrollers sounds like an oxymoron. Modern ML models are measured in millions of parameters and gigabytes of memory. An ESP32 has 512KB of SRAM. The gap between what ML frameworks assume and what embedded hardware provides is enormous.
Yet TinyML is real, and it's powerful. Running inference at the edge means no network latency, no cloud costs, no privacy concerns from transmitting raw data, and no dependency on connectivity. For applications like agricultural monitoring, industrial anomaly detection, or wearable health devices, these properties are not nice-to-haves — they're requirements.
The Model Compression Pipeline
Getting a neural network to fit on a microcontroller is a multi-stage process:
1. Architecture Selection
Start small. Don't train a ResNet-50 and try to shrink it. Choose architectures designed for efficiency: MobileNets, EfficientNet-Lite, or custom shallow networks. For my agricultural classification task, a 3-layer convolutional network with depthwise separable convolutions gave 92% accuracy on the validation set before any compression.
2. Pruning
Structured pruning removes entire neurons or filters that contribute least to the output. I pruned 40% of filters based on L1-norm magnitude — the filters with the smallest weights were removed entirely. This reduced model size by ~35% with only 0.8% accuracy drop.
3. Quantization
This is where the biggest gains happen. Post-training quantization converts 32-bit floating-point weights to 8-bit integers. The model size drops by ~4x. On ARM Cortex-M processors (like the one in ESP32), INT8 operations are also significantly faster than FP32.
- Float32 model: 2.1MB → INT8 model: 180KB
- Accuracy: 92.0% → 90.8% (acceptable tradeoff)
- Inference time: reduced by ~3x on ESP32
4. TensorFlow Lite Micro Conversion
The final step converts the quantized TensorFlow model to a TF Lite FlatBuffer and generates a C array that can be compiled directly into the firmware. The TF Lite Micro runtime handles memory allocation from a pre-allocated tensor arena — you specify the arena size at compile time, and all intermediate activations are allocated within it.
Getting the arena size right is critical. Too small and inference fails. Too large and you don't have memory for the rest of your application (communication stack, sensor drivers, etc.). I used the TF Lite Micro memory profiling tools to find the minimum arena size: 48KB for my model.
Real-World Inference Pipeline
The inference pipeline on the ESP32 runs as follows:
- Sensor readings accumulated in a circular buffer (30s window)
- Feature extraction: compute rolling mean, variance, rate-of-change, min/max over the window
- Normalize features using pre-computed training statistics (stored in flash)
- Run TF Lite Micro inference (~45ms)
- Apply confidence threshold (0.7) — below threshold, defer to cloud
- High-confidence results trigger local actuation + LoRa transmission to gateway
What Surprised Me
Feature engineering mattered more than model architecture. I spent weeks trying different neural network architectures, gaining maybe 1-2% accuracy. Then I spent a day crafting better input features — adding time-domain statistics, frequency-domain features via a simple FFT, and cross-sensor correlations — and accuracy jumped 8%.
On constrained devices, the cost of computing complex features is far less than the cost of running a larger model. A few dozen floating-point operations for feature engineering vs thousands of multiply-accumulate operations for extra network layers — the math is clear.