TinyML in Practice: Running Neural Networks on Microcontrollers

Quantization, model pruning, and the engineering reality of deploying ML on devices with 512KB of RAM.

The Promise and the Constraint

Machine learning on microcontrollers sounds like an oxymoron. Modern ML models are measured in millions of parameters and gigabytes of memory. An ESP32 has 512KB of SRAM. The gap between what ML frameworks assume and what embedded hardware provides is enormous.

Yet TinyML is real, and it's powerful. Running inference at the edge means no network latency, no cloud costs, no privacy concerns from transmitting raw data, and no dependency on connectivity. For applications like agricultural monitoring, industrial anomaly detection, or wearable health devices, these properties are not nice-to-haves — they're requirements.

The Model Compression Pipeline

Getting a neural network to fit on a microcontroller is a multi-stage process:

1. Architecture Selection

Start small. Don't train a ResNet-50 and try to shrink it. Choose architectures designed for efficiency: MobileNets, EfficientNet-Lite, or custom shallow networks. For my agricultural classification task, a 3-layer convolutional network with depthwise separable convolutions gave 92% accuracy on the validation set before any compression.

2. Pruning

Structured pruning removes entire neurons or filters that contribute least to the output. I pruned 40% of filters based on L1-norm magnitude — the filters with the smallest weights were removed entirely. This reduced model size by ~35% with only 0.8% accuracy drop.

3. Quantization

This is where the biggest gains happen. Post-training quantization converts 32-bit floating-point weights to 8-bit integers. The model size drops by ~4x. On ARM Cortex-M processors (like the one in ESP32), INT8 operations are also significantly faster than FP32.

Practical tip: Use representative calibration data during quantization. The quality of your calibration dataset directly affects how well the quantized model preserves accuracy. I used 500 representative samples from each class — enough to capture the distribution without overfitting the quantization parameters.

4. TensorFlow Lite Micro Conversion

The final step converts the quantized TensorFlow model to a TF Lite FlatBuffer and generates a C array that can be compiled directly into the firmware. The TF Lite Micro runtime handles memory allocation from a pre-allocated tensor arena — you specify the arena size at compile time, and all intermediate activations are allocated within it.

Getting the arena size right is critical. Too small and inference fails. Too large and you don't have memory for the rest of your application (communication stack, sensor drivers, etc.). I used the TF Lite Micro memory profiling tools to find the minimum arena size: 48KB for my model.

Real-World Inference Pipeline

The inference pipeline on the ESP32 runs as follows:

What Surprised Me

Feature engineering mattered more than model architecture. I spent weeks trying different neural network architectures, gaining maybe 1-2% accuracy. Then I spent a day crafting better input features — adding time-domain statistics, frequency-domain features via a simple FFT, and cross-sensor correlations — and accuracy jumped 8%.

On constrained devices, the cost of computing complex features is far less than the cost of running a larger model. A few dozen floating-point operations for feature engineering vs thousands of multiply-accumulate operations for extra network layers — the math is clear.

Takeaway: TinyML is not about making big models small. It's about making small models smart. Invest in data quality, feature engineering, and problem framing before reaching for architectural complexity. The constraint of limited hardware forces a discipline that often produces better engineering.