TinyML Hardware Benchmarking: Explainable Performance Analysis Across Embedded AI Platforms

https://www.canva.com/design/DAG3_7hxSmI/BtdptIb17pcublnft958FQ/edit?utm_content=DAG3_7hxSmI&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton

TinyML Hardware Benchmarking: Explainable Performance Analysis Across Embedded AI Platforms

ðŸ“‹ Executive Summary

This project aims to create a comprehensive, explainable benchmarking framework for AI model inference across multiple embedded hardware platforms (Cortex-M, DSP, NPU). Unlike traditional benchmarking that reports only raw latency numbers, this approach provides hardware-aware explainabilityâ€”attributing every performance gain or loss to specific hardware behaviors like cache misses, memory bandwidth saturation, or frequency scaling.

Key Innovation: Create an energy-cycle attribution model that maps performance to measurable hardware events, enabling hardware selection guidelines for TinyML applications.

ðŸŽ¯ Project Objectives

Primary Goals

Multi-Hardware Comparison: Benchmark the same AI model on diverse embedded platforms
Explainable Metrics: Attribute performance to hardware-specific bottlenecks (cache, memory, compute units)
Energy Efficiency Analysis: Measure and decompose energy consumption per inference
Optimization Validation: Quantify the impact of model optimizations with hardware evidence
Decision Guidelines: Generate hardware selection criteria based on application constraints

Unique Value Proposition

Unlike existing benchmarks (e.g., MLPerf Tiny), this project focuses on why performance differs, not just how much. Each benchmark result includes:

Root cause analysis (e.g., "40% latency increase due to L1 cache misses")
Hardware utilization profiles (compute vs memory-bound classification)
Energy attribution (which operations consume which % of total energy)
Actionable optimization recommendations

ðŸ�—ï¸� Conceptual Framework

The Metrics Pyramid

Level 4: Application Metrics (Universal across all hardware)
â”œâ”€ Inference latency (ms/sample)
â”œâ”€ Throughput (samples/second)
â”œâ”€ Energy per inference (mJ/inference)
â””â”€ Memory footprint (Flash + RAM in KB)

Level 3: System Metrics (Platform-specific)
â”œâ”€ Clock frequency behavior (static vs dynamic scaling)
â”œâ”€ Power state transitions (active â†’ idle)
â”œâ”€ DMA utilization %
â””â”€ Bus contention events

Level 2: Microarchitecture Metrics (Architecture-specific)
â”œâ”€ CPU: Pipeline stalls, cache hit rates, branch mispredictions, IPC
â”œâ”€ DSP: MAC unit utilization, SIMD efficiency
â””â”€ NPU: Tensor unit occupancy, data movement overhead

Level 1: Physical Metrics (Hardware-specific)
â”œâ”€ Actual power consumption via shunt resistor/INA226
â”œâ”€ Thermal behavior under sustained load
â””â”€ Memory access patterns via hardware trace

Justification: This hierarchical structure allows:

Universal comparison at Level 4 (compare any hardware)
Architecture insights at Levels 2-3 (understand WHY differences exist)
Reproducibility by documenting all levels

ðŸ”¬ Technical Approach

1. Benchmark Design Principles

Why Explainability Matters

Traditional benchmarks report: "Model runs in 50ms on Device A, 120ms on Device B"

Problem: No insight into optimization opportunities or hardware bottlenecks.

Our approach: "Model runs in 120ms on Device B because:

45% of time in memory-bound operations (L1 cache miss rate: 12%)
30% in compute operations (MAC utilization: 65% due to small batch size)
15% in data movement (DMA overhead)
10% other (kernel launch, synchronization)"

This enables targeted optimization.

Hardware Counter Selection Rationale

Hardware Type	Critical Counters	Why These Matter
Cortex-M CPU	Cycles, Instructions, Cache Misses (L1/L2), Branch Mispredictions	TinyML models are often memory-bound due to limited cache. Branch prediction matters for control flow in quantization.
DSP	MAC Operations, SIMD Lane Utilization, Memory Stalls	DSPs excel at vectorized operations; underutilization indicates poor SIMD mapping.
NPU	Tensor Unit Occupancy, HBM/SRAM Transactions, Idle Cycles	NPUs are specialized; low occupancy means model doesn't map well to hardware.
Universal	Power (mW), Temperature (Â°C), Frequency (MHz)	Energy efficiency and thermal throttling affect all platforms.

Justification: These counters directly map to optimization strategies:

High cache misses â†’ Improve data locality or tile size
Low MAC utilization â†’ Batch operations or use SIMD intrinsics
High memory bandwidth â†’ Reduce precision (FP32 â†’ INT8)

2. Measurement Methodology

Cold Start vs Warm Inference

Critical for Embedded Systems:

Cold start: First inference after device wake-up (includes cache population, clock ramp-up)
Warm inference: Steady-state after multiple inferences

Why measure both:

Battery-powered IoT devices often sleep between inferences â†’ cold start matters
Server-like workloads â†’ warm inference matters

Implementation:

// Warm-up phase (discard first 10 inferences)
for (int i = 0; i < 10; i++) {
    model_invoke();
}

// Measurement phase
for (int i = 0; i < 100; i++) {
    uint32_t start = DWT->CYCCNT;
    model_invoke();
    latencies[i] = DWT->CYCCNT - start;
}

// Report p50, p95, p99 (not just mean)

Statistical Rigor

Sample size: 100+ inferences per configuration
Outlier handling: Report percentiles (p50, p95, p99) instead of mean
Reproducibility: Fixed CPU affinity, locked frequencies, documented environment

Justification: Embedded systems have:

OS scheduling jitter
Thermal throttling
Background interrupt activity

Single measurements are meaningless; statistical distributions reveal true behavior.

3. Energy Measurement Techniques

Hardware-Based Power Monitoring

Method 1: Shunt Resistor + ADC (Most Accurate)

VDD â”€â”€[0.1Î© Shunt]â”€â”€â”€ Device
         â”‚
         â””â”€â†’ ADC (measure voltage drop)
         
Power = V_drop / R_shunt Ã— V_supply
Energy = âˆ« Power dt

Method 2: INA226 IÂ²C Sensor (Easiest)

16-bit ADC, Â±40mV range
IÂ²C interface for logging
ÂµW resolution

Method 3: Oscilloscope + Current Probe (Transient Analysis)

See instantaneous power spikes
Identify which layers consume most energy

Justification:

Embedded AI is energy-constrained (battery life)
Energy per inference is more important than latency in many IoT applications
Energy attribution (which layer uses what energy) guides optimization

Energy Attribution Algorithm

def attribute_energy_to_layers(model, power_trace):
    """
    Maps power consumption to specific model layers
    """
    layer_energy = {}
    
    for layer in model.layers:
        start_time = layer.start_timestamp
        end_time = layer.end_timestamp
        
        # Integrate power over layer execution time
        energy_mj = integrate(power_trace, start_time, end_time)
        
        layer_energy[layer.name] = {
            'energy_mj': energy_mj,
            'percentage': energy_mj / total_energy * 100
        }
    
    return layer_energy

This reveals insights like:

"Conv2D layers consume 70% of total energy"
"Activation functions negligible (<5%)"
"Data movement between layers costs 20%"

ðŸ“š Essential Resources & Justification

Tier 1: Foundational Reading (MUST READ)

1. MLPerf Tiny Benchmark Suite

Paper: "MLPerf Tiny Benchmark" (arXiv:2106.07597)
GitHub: https://github.com/mlcommons/tiny
Why Critical:
- Industry-standard methodology for reproducible embedded ML benchmarks
- Defines energy measurement protocols (sampling rate, duration)
- Shows how to handle measurement noise and outliers
Key Takeaways:
- Warm-up strategy: 10 inferences before measurement
- Energy: Sample at â‰¥1kHz for 10 seconds
- Report p50/p90/p99 latencies, not mean

2. TVM & AutoTVM

Paper: "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning" (OSDI 2018)
Why Critical:
- TVM optimizes both graph-level (operator fusion) and operator-level (kernel tuning)
- AutoTVM searches hardware-specific optimizations automatically
- Understanding TVM's optimization space helps explain performance gaps
Key Takeaways:
- Graph-level: Operator fusion reduces memory transactions
- Operator-level: Tile size tuning for cache efficiency
- Your project can show: "TVM reduced L3 misses by 40% via operator fusion"

3. MCUNet: Tiny Deep Learning on IoT Devices

Paper: https://arxiv.org/abs/2007.10319 (MIT Han Lab)
Why Critical:
- State-of-the-art in memory-efficient model design
- Introduces activation memory analysis (often the bottleneck)
- TinyNAS: Hardware-aware neural architecture search
Key Takeaways:
- Peak memory = max(weights, activations)
- Layer-wise memory profiling reveals bottlenecks
- Input resolution affects memory more than model depth

4. ARM Cortex-M Machine Learning System Design Guide

Link: https://developer.arm.com/documentation/102589/latest/
Why Critical:
- Official guidance on Cortex-M ML extensions (DSP, FPU, MVE)
- Cache behavior for typical NN workloads
- SIMD intrinsics for optimization
Key Takeaways:
- Cortex-M7: 4-64KB L1 cache (document your target's config)
- DSP instructions: 2x-4x speedup for MAC operations
- Memory alignment matters (16-byte for SIMD)

5. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M

GitHub: https://github.com/ARM-software/CMSIS-NN
Paper: "CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs"
Why Critical:
- Reference implementation of optimized operators
- Shows how to use DSP/SIMD instructions effectively
- Your benchmarks should compare against CMSIS-NN as baseline
Key Takeaways:
- Im2col-based convolution for cache efficiency
- INT8 quantization with ARM SSAT/USAT instructions
- Fast sigmoid via lookup table

Tier 2: Methodology & Profiling

6. "Benchmarking TinyML Systems: Challenges and Directions" (SenSys 2021)

Why Useful:
- Documents common pitfalls (no warm-up, insufficient samples, thermal throttling)
- Proposes measurement variance reporting
Apply to Your Project: Use their checklist for reproducibility

7. STM32 AI Model Zoo

GitHub: https://github.com/STMicroelectronics/stm32ai-modelzoo
Why Useful:
- Real-world examples of deployed models
- Shows how STM32 benchmarks are structured
- Pre-optimized models for reference
Use Case: Compare your custom benchmark against their reported numbers for validation

8. OpenVINO Toolkit for ARM Platforms

Blog: https://blog.openvino.ai/blog-posts/openvino-toolkit-for-arm-platforms-overview
Why Useful:
- Intel's perspective on ARM optimization
- Graph optimization techniques applicable beyond OpenVINO
Note: Primarily for Cortex-A, but principles apply to Cortex-M

9. Optimus: Accelerating Neural Network Inference on Microcontrollers

Paper: https://dl.acm.org/doi/fullHtml/10.1145/3520142
Why MUST READ (per your notes):
- Very technical deep dive into MCU inference optimization
- Sparse operator scheduling
- Memory-aware execution planning
Expected Insight: Your project can validate Optimus's claims on real hardware

Tier 3: Background Theory

10. Roofline Model (Berkeley)

Why Useful:
- Visualizes compute-bound vs memory-bound operations
- Helps classify where your model sits on the spectrum
Application: Create roofline plots for each hardware platform

11. "Memory-Efficient Inference for Deep Neural Networks" (Google Research)

Why Useful:
- Techniques to reduce peak memory usage
- Operator recomputation vs storage trade-offs

ðŸ› ï¸� Implementation Roadmap

Phase 1: Baseline Profiling (Week 1)

Objective: Understand model behavior on reference hardware (your CPU)

Day 1-2: CPU Flamegraph & Operator Analysis

Tasks:

Generate execution flamegraph

# Using py-spy
py-spy record -o flamegraph.svg --native -- python inference.py

# Using PyTorch profiler
with torch.profiler.profile(
    activities=[ProfilerActivity.CPU],
    with_stack=True
) as prof:
    model(inputs)
prof.export_chrome_trace("trace.json")

Create operator-level breakdown

Operator Time (ms) % Total Memory (KB) FLOPs Arithmetic Intensity

Conv2D_1 15.2 30% 120 2.3M 19.2 (compute-bound)

ReLU 0.5 1% 0 0 N/A

... ... ... ... ... ...
Cross-reference with TVM optimization opportunities
- Which operators have low arithmetic intensity? (memory-bound candidates)
- Which operators could benefit from fusion?

Deliverable: Annotated flamegraph + operator CSV + optimization hypothesis document

Day 3-4: Hardware Counter Integration

Tasks:

Collect CPU performance counters

perf stat -e cycles,instructions,cache-references,cache-misses,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses,\
branches,branch-misses \
python inference.py

Calculate derived metrics

IPC = instructions / cycles
Cache Miss Rate = cache-misses / cache-references
Branch Prediction Accuracy = 1 - (branch-misses / branches)
Memory Bandwidth = LLC-loads Ã— 64 bytes / time

Classify bottleneck

def classify_bottleneck(counters):
    ipc = counters['instructions'] / counters['cycles']
    cache_miss_rate = counters['cache_misses'] / counters['cache_references']
    
    if cache_miss_rate > 0.10:
        return "Memory-bound (cache thrashing)"
    elif ipc < 0.5:
        return "Memory-bound (bandwidth)"
    elif ipc > 2.0:
        return "Compute-bound (good!)"
    else:
        return "Mixed bottleneck"

Deliverable: Baseline hardware counter report with bottleneck classification

Day 5: Literature Deep Dive

Tasks:

Read MLPerf Tiny methodology (focus on Section 3: Measurement)
Read MCUNet paper (focus on Section 3.2: Memory Analysis)
Skim CMSIS-NN repository (understand operator implementations)

Deliverable: 3-page synthesis document answering:

What is the standard warm-up strategy?
How should energy be measured?
What are common memory bottlenecks in TinyML?

Weekend: STM32 Environment Setup

Tasks:

Install toolchain
- STM32CubeIDE
- STM32CubeMX (for code generation)
- ARM GCC compiler

Test cycle counting

// Enable DWT cycle counter
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

// Measure cycles
uint32_t start = DWT->CYCCNT;
// ... code to measure ...
uint32_t cycles = DWT->CYCCNT - start;
float ms = cycles / (SystemCoreClock / 1000.0f);

Convert model to TensorFlow Lite for Microcontrollers

import tensorflow as tf

# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()

# Save for deployment
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

Deliverable: Working "Hello World" inference on STM32 with cycle counting

Phase 2: Multi-Hardware Deployment (Week 2)

Day 6-7: STM32 Comprehensive Benchmark

Tasks:

Enable all available counters

// Enable PMU for cache counters (Cortex-M7)
PMU->CTRL |= PMU_CTRL_ENABLE_Msk;
PMU->CNTENSET = (1 << 0) | (1 << 1);  // Enable counters 0,1
PMU->EVTYPER0 = 0x03;  // L1 D-cache miss
PMU->EVTYPER1 = 0x04;  // L1 D-cache access

Implement per-layer timing

typedef struct {
    const char* name;
    uint32_t cycles;
    uint32_t cache_misses;
    uint32_t energy_uj;  // microjoules
} LayerProfile;

LayerProfile profiles[MAX_LAYERS];

for (int i = 0; i < num_layers; i++) {
    uint32_t start_cycles = DWT->CYCCNT;
    uint32_t start_misses = PMU->EVCNTR0;
    
    // Invoke layer
    tflite_invoke_layer(i);
    
    profiles[i].cycles = DWT->CYCCNT - start_cycles;
    profiles[i].cache_misses = PMU->EVCNTR0 - start_misses;
}

Measure energy with INA226

from ina226 import INA226

ina = INA226(address=0x40)
ina.configure(avg=128, voltage_conv_time=1.1)

samples = []
start_time = time.time()

# Sample at 1kHz during inference
while time.time() - start_time < 1.0:
    samples.append({
        'voltage': ina.voltage(),
        'current': ina.current(),
        'power': ina.power(),
        'timestamp': time.time()
    })
    time.sleep(0.001)

energy_mj = sum(s['power'] for s in samples) * 1  # mJ

Deliverable: Complete STM32 benchmark report with:

Latency (cold start + warm)
Energy per inference + per-layer breakdown
Cache behavior analysis
Memory usage (Flash: code + weights, RAM: activations + scratch)

Day 8-9: Explainability Engine

Core Algorithm:

class PerformanceAttribution:
    def __init__(self, counters):
        self.counters = counters
    
    def attribute_latency(self):
        """
        Decompose total latency into attributable causes
        """
        total_cycles = self.counters['cycles']
        
        # Estimate cycles lost to cache misses
        # Assumption: L1 miss = 10 cycles, L3 miss = 100 cycles
        cache_miss_cycles = (
            self.counters['L1_miss'] * 10 +
            self.counters['L3_miss'] * 100
        )
        
        # Estimate cycles lost to branch misprediction
        # Assumption: misprediction = 15 cycle penalty
        branch_miss_cycles = self.counters['branch_miss'] * 15
        
        # Remaining cycles = useful compute
        compute_cycles = total_cycles - cache_miss_cycles - branch_miss_cycles
        
        return {
            'cache_overhead': cache_miss_cycles / total_cycles,
            'branch_overhead': branch_miss_cycles / total_cycles,
            'compute': compute_cycles / total_cycles,
            'explanation': self._generate_explanation()
        }
    
    def _generate_explanation(self):
        attribution = self.attribute_latency()
        
        if attribution['cache_overhead'] > 0.3:
            return "Memory-bound: 30%+ cycles lost to cache misses. " \
                   "Recommendation: Reduce working set size or improve locality."
        elif attribution['compute'] > 0.7:
            return "Compute-bound: Model is well-optimized for this hardware."
        else:
            return "Mixed bottleneck: Consider operator fusion and quantization."

Visualization Suite:

import matplotlib.pyplot as plt

def create_waterfall_chart(layer_profiles):
    """
    Show cumulative latency contribution per layer
    """
    layers = [p['name'] for p in layer_profiles]
    latencies = [p['latency_ms'] for p in layer_profiles]
    
    fig, ax = plt.subplots(figsize=(12, 6))
    ax.barh(layers, latencies, color=['red' if p['bottleneck'] == 'memory' 
                                       else 'blue' for p in layer_profiles])
    ax.set_xlabel('Latency (ms)')
    ax.set_title('Per-Layer Latency Contribution')
    plt.tight_layout()
    plt.savefig('waterfall.png')

def create_energy_pie_chart(layer_profiles):
    """
    Show energy distribution across layers
    """
    labels = [p['name'] for p in layer_profiles]
    energies = [p['energy_mj'] for p in layer_profiles]
    
    fig, ax = plt.subplots()
    ax.pie(energies, labels=labels, autopct='%1.1f%%')
    ax.set_title('Energy Distribution by Layer')
    plt.savefig('energy_pie.png')

Deliverable: Automated report generator that produces:

Attribution analysis (text summary)
Waterfall latency chart
Energy pie chart
Hardware utilization timeline

Day 10: TVM Optimization Experiment

Hypothesis: TVM's graph-level optimizations will reduce memory traffic

Experiment Design:

# Baseline: Standard TFLite model
baseline_counters = benchmark(tflite_model)

# TVM-optimized model
import tvm
from tvm import relay

# Load model and compile with TVM
mod, params = relay.frontend.from_tensorflow(model)

# AutoTVM tuning for target hardware
with tvm.autotvm.apply_history_best('tuning_log.json'):
    with tvm.transform.PassContext(opt_level=3):
        lib = relay.build(mod, target='llvm -mcpu=cortex-m7', params=params)

# Re-benchmark
tvm_counters = benchmark(lib)

# Calculate improvements
improvements = {
    'latency_reduction': (baseline_counters['latency'] - tvm_counters['latency']) / baseline_counters['latency'],
    'cache_miss_reduction': (baseline_counters['cache_misses'] - tvm_counters['cache_misses']) / baseline_counters['cache_misses'],
    'energy_reduction': (baseline_counters['energy'] - tvm_counters['energy']) / baseline_counters['energy']
}

# Explain improvements
print(f"TVM reduced latency by {improvements['latency_reduction']*100:.1f}% because:")
print(f"  - Cache misses decreased {improvements['cache_miss_reduction']*100:.1f}%")
print(f"  - Operator fusion reduced memory transactions")

Deliverable: TVM optimization report with:

Before/after metrics comparison table
Hardware counter deltas
Explanation of which TVM optimizations helped
TVM tuning logs analysis

Phase 3: Synthesis & Guidelines (Week 3-4)

Week 3: Comparative Analysis

Tasks:

Create unified comparison table

Metric	x86 CPU	STM32 M7	DSP (if available)	Winner
Latency (ms)	52	145	?	CPU
Energy (mJ)	150	18	?	STM32
Memory (KB)	Unlimited	512	?	CPU
$/unit	N/A	$5	?	STM32
Use Case	Dev/Test	Battery IoT	Audio/DSP	-

Generate radar chart for visualization

import numpy as np
import matplotlib.pyplot as plt

categories = ['Latency', 'Energy', 'Memory', 'Cost', 'Ease of Use']
cpu_scores = [0.9, 0.3, 1.0, 0.1, 1.0]  # Normalized 0-1
stm32_scores = [0.4, 1.0, 0.5, 0.9, 0.7]

angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
cpu_scores += cpu_scores[:1]  # Close the plot
stm32_scores += stm32_scores[:1]
angles += angles[:1]

fig, ax = plt.subplots(subplot_kw=dict(projection='polar'))
ax.plot(angles, cpu_scores, label='x86 CPU')
ax.plot(angles, stm32_scores, label='STM32')
ax.set_thetagrids(np.degrees(angles[:-1]), categories)
plt.legend()
plt.savefig('hardware_comparison_radar.png')

Build decision tree

IF latency_required < 50ms:
    IF energy_budget > 100mJ:
        â†’ x86 CPU (fast, but power-hungry)
    ELSE:
        â†’ Check if NPU available
ELSE IF latency_required < 200ms:
    IF model_size < 500KB:
        â†’ STM32 Cortex-M7 (best energy efficiency)
    ELSE:
        â†’ DSP with external SRAM
ELSE:
    â†’ Any platform works; optimize for cost

Deliverable: Comparative analysis document (15-20 pages) with:

Executive summary table
Detailed per-hardware analysis
Visualization suite (radar chart, bar charts, scatter plots)
Decision tree for hardware selection

Week 4: Guidelines Document

Structure:

# TinyML Hardware Selection Guidelines

## 1. Measurement Methodology
### 1.1 Reproducibility Checklist
- [ ] CPU frequency locked or logged
- [ ] Warm-up: 10 inferences
- [ ] Sample size: 100+ inferences
- [ ] Report p50/p95/p99
- [ ] Energy: Sample at â‰¥1kHz for â‰¥1 second
- [ ] Document: Framework version, compiler flags, hardware config

### 1.2 Hardware Counter Setup
**Cortex-M**:
- Enable DWT for cycle counting
- Enable PMU for cache/branch counters
- Use TIMERs for sub-cycle timing if needed

**x86 CPU**:
- Use `perf stat` for standard counters
- Use `perf mem` for memory profiling
- Consider Intel VTune for deeper analysis

## 2. Hardware-Specific Insights

### 2.1 Cortex-M Optimization Guide
**When to Choose**:
- Battery-powered applications
- Inference frequency < 1 Hz
- Energy budget < 50 mJ/inference
- Model size < 500 KB

**Common Bottlenecks**:
1. **Small L1 Cache** (typically 4-32 KB)
   - Symptom: Cache miss rate > 10%
   - Solution: Reduce tile size, use im2col carefully
   
2. **Limited SRAM** (typically 128-512 KB)
   - Symptom: Model doesn't fit
   - Solution: Quantize to INT8, use weight compression

3. **Low Clock Speed** (typically 80-200 MHz)
   - Symptom: Compute-bound even for small models
   - Solution: Use DSP instructions (CMSIS-NN), operator fusion

**Optimization Checklist**:
- [ ] Use CMSIS-NN kernels (2-4x speedup)
- [ ] Enable FPU if using FP16
- [ ] Quantize to INT8 (3-4x faster, 4x less memory)
- [ ] Fuse operators to reduce memory traffic
- [ ] Align data to 16-byte boundaries for SIMD

### 2.2 x86 CPU Optimization Guide
[Similar structure for each hardware type]

## 3. Model Optimization Impact Matrix

| Optimization | Latency Î” | Energy Î” | Accuracy Î” | Memory Î” | Difficulty |
|--------------|-----------|----------|------------|----------|------------|
| INT8 Quantization | -60% | -70% | -1.5% | -75% | Medium |
| Operator Fusion (TVM) | -25% | -30% | 0% | 0% | High |
| Pruning (50% sparsity) | -20% | -25% | -2% | -50% | High |
| Input Resolution â†“ | -40% | -45% | -5% | -60% | Easy |

**Note**: Deltas are approximate; measure on your hardware.

## 4. Trade-Off Analysis

### Latency vs Energy
**Key Insight**: Faster doesn't always mean more energy.

Example from our measurements:
- STM32 @ 168 MHz: 145ms, 18 mJ â†’ 124 mJ/sec
- STM32 @ 80 MHz: 305ms, 24 mJ â†’ 79 mJ/sec (more energy-efficient!)

**Recommendation**: Profile multiple frequencies to find energy-optimal point.

### Accuracy vs Efficiency
[Include Pareto frontier plot showing accuracy-energy trade-offs]

## 5. When to Use Each Hardware

### Cortex-M4/M7
âœ… Best for: Battery IoT, wearables, audio keywords
â�Œ Avoid for: Video processing, high-throughput tasks

### DSP
âœ… Best for: Audio processing, signal filtering, high MAC operations
â�Œ Avoid for: Complex control flow, irregular memory access

### NPU/Accelerator
âœ… Best for: Batch inference, convolutional workloads, high throughput
â�Œ Avoid for: Tiny models (<10KB), custom operators, low latency (<10ms)

### x86 CPU
âœ… Best for: Development, prototyping, non-constrained environments
â�Œ Avoid for: Battery-powered, cost-sensitive applications

## 6. Case Studies

### Case Study 1: Keyword Spotting
**Requirements**: <20ms latency, <1mJ/inference, always-on
**Model**: MobileNetV2-like, 15KB
**Winner**: Cortex-M4 with DSP extensions
**Reasoning**: 
- Latency: 12ms (meets requirement)
- Energy: 0.8mJ (best in class)
- Cost: $2/unit
- Bottleneck: Memory-bound (cache miss rate: 8%)
- Optimization: CMSIS-NN + INT8 quantization

### Case Study 2: Image Classification
**Requirements**: <100ms latency, <50mJ/inference, periodic inference
**Model**: MobileNetV2, 3.5MB
**Winner**: NPU (if available), else Cortex-M7
**Reasoning**:
- NPU: 45ms, 12mJ (tensor ops well-mapped)
- M7: 95ms, 38mJ (acceptable fallback)
- Bottleneck: Compute-bound on M7, memory-bound on NPU

## 7. Optimization Decision Tree

START: Model deployed, baseline measured â”‚ â”œâ”€ Is latency acceptable? â”‚ NO â†’ Profile bottleneck â”‚ â”‚ â”œâ”€ Memory-bound (cache miss > 10%)? â”‚ â”‚ â”‚ YES â†’ Reduce working set: â”‚ â”‚ â”‚ - Decrease batch size â”‚ â”‚ â”‚ - Tile operations â”‚ â”‚ â”‚ - Quantize (FP32â†’INT8) â”‚ â”‚ â”‚ NO â†’ Compute-bound? â”‚ â”‚ â”‚ YES â†’ Increase parallelism: â”‚ â”‚ â”‚ - Use SIMD/DSP instructions â”‚ â”‚ â”‚ - Enable hardware accelerators â”‚ â”‚ â”‚ - Operator fusion â”‚ YES â†’ Continue â”‚ â”œâ”€ Is energy acceptable? â”‚ NO â†’ Profile energy hotspots â”‚ â”‚ â”œâ”€ High static power (>40% of total)? â”‚ â”‚ â”‚ YES â†’ Reduce inference frequency or duty cycle â”‚ â”‚ â”œâ”€ High dynamic power? â”‚ â”‚ â”‚ YES â†’ Optimize performance first (faster = less active time) â”‚ â”‚ â”‚ Then: Lower voltage/frequency if possible â”‚ YES â†’ Continue â”‚ â””â”€ Is memory acceptable? NO â†’ Reduce model size: - Weight quantization - Pruning - Knowledge distillation - Smaller architecture YES â†’ âœ… DONE


## 8. Measurement Pitfalls & Solutions

### Pitfall 1: Thermal Throttling
**Symptom**: Latency increases over time
**Detection**: Monitor frequency during benchmark
**Solution**: Add cooling or reduce sustained load

### Pitfall 2: First-Inference Penalty
**Symptom**: First inference 10-100x slower
**Cause**: Cache cold start, JIT compilation, clock ramp-up
**Solution**: Always warm-up (10+ inferences), report cold start separately

### Pitfall 3: Background Interrupts
**Symptom**: High variance in latency (p99 >> p50)
**Detection**: Large std deviation, outliers
**Solution**: Disable non-essential interrupts, increase sample size (100+)

### Pitfall 4: Power Measurement Artifacts
**Symptom**: Energy measurements don't correlate with latency
**Cause**: Low sampling rate, measurement device overhead
**Solution**: Sample at â‰¥1kHz, use hardware-based measurement

### Pitfall 5: Compiler Optimization Surprises
**Symptom**: Release build 10x faster than debug
**Cause**: Dead code elimination, loop unrolling
**Solution**: Always benchmark optimized builds (-O3), but verify correctness

## 9. Advanced Techniques

### 9.1 Roofline Analysis
**Purpose**: Visualize if operations are compute-bound or memory-bound

**Method**:
1. Measure peak compute performance (GOps/s)
   ```c
   // Run compute-heavy kernel (e.g., GEMM)
   float peak_gops = measure_peak_compute();

Measure peak memory bandwidth (GB/s)

// Run memory-copy kernel
float peak_bandwidth = measure_memory_bandwidth();

Calculate operational intensity per layer
```
Op_Intensity = FLOPs / Bytes_Accessed
```

Plot roofline

import matplotlib.pyplot as plt
import numpy as np

# Hardware limits
peak_compute = 10  # GOps/s
peak_bandwidth = 5  # GB/s

# Roofline
intensity = np.logspace(-1, 2, 100)
roofline = np.minimum(peak_compute, peak_bandwidth * intensity)

plt.loglog(intensity, roofline, 'k-', linewidth=2, label='Roofline')

# Plot your operators
operators = {
    'Conv2D': (2.5, 8.5),  # (intensity, actual GOps/s)
    'GEMM': (15, 9.8),
    'DepthwiseConv': (0.5, 2.3)
}

for name, (x, y) in operators.items():
    plt.plot(x, y, 'ro', markersize=10)
    plt.text(x, y, name)

plt.xlabel('Operational Intensity (FLOPs/Byte)')
plt.ylabel('Performance (GOps/s)')
plt.title('Roofline Model: Model Operators')
plt.grid(True)
plt.legend()
plt.savefig('roofline.png')

Interpretation:

Points on/near roofline: Well-optimized
Points below roofline: Room for improvement
Points in horizontal region: Compute-bound
Points in diagonal region: Memory-bound

9.2 Energy-Cycle Correlation Model

Goal: Predict energy from cycle counts (faster than direct measurement)

Method:

Collect training data

data = []
for config in configurations:
    cycles = measure_cycles(config)
    energy = measure_energy(config)  # Expensive
    data.append((cycles, energy))

Fit regression model

from sklearn.linear_model import LinearRegression

X = np.array([d[0] for d in data]).reshape(-1, 1)
y = np.array([d[1] for d in data])

model = LinearRegression()
model.fit(X, y)

# Now predict energy from cycles alone
predicted_energy = model.predict([[new_cycles]])[0]

Validate on test set
- RÂ² > 0.9 means good correlation
- Use for rapid optimization iteration

9.3 Memory Access Pattern Visualization

Purpose: Understand cache behavior

Method:

import matplotlib.pyplot as plt

def visualize_memory_access(layer_profiles):
    """
    Show memory access patterns that cause cache misses
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Cache miss rate per layer
    layers = [p['name'] for p in layer_profiles]
    miss_rates = [p['cache_misses'] / p['cache_accesses'] 
                  for p in layer_profiles]
    
    ax1.barh(layers, miss_rates, color=['red' if r > 0.1 else 'green' 
                                         for r in miss_rates])
    ax1.set_xlabel('Cache Miss Rate')
    ax1.set_title('Per-Layer Cache Behavior')
    ax1.axvline(x=0.1, color='orange', linestyle='--', 
                label='10% threshold')
    ax1.legend()
    
    # Working set size vs cache size
    working_sets = [p['working_set_kb'] for p in layer_profiles]
    cache_size = 32  # KB (example)
    
    ax2.plot(working_sets, label='Working Set Size')
    ax2.axhline(y=cache_size, color='red', linestyle='--', 
                label=f'L1 Cache Size ({cache_size}KB)')
    ax2.set_xlabel('Layer Index')
    ax2.set_ylabel('Size (KB)')
    ax2.set_title('Working Set vs Cache Capacity')
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig('memory_access_patterns.png')

9.4 Automated Optimization Search

Purpose: Systematically explore optimization space

Method:

class OptimizationSearch:
    def __init__(self, model, hardware):
        self.model = model
        self.hardware = hardware
        self.results = []
    
    def search(self):
        """
        Grid search over optimization parameters
        """
        # Quantization levels
        for quant in ['fp32', 'fp16', 'int8']:
            # Compiler optimizations
            for opt_level in ['-O0', '-O2', '-O3', '-Ofast']:
                # Batch sizes (if applicable)
                for batch in [1, 2, 4]:
                    config = {
                        'quantization': quant,
                        'opt_level': opt_level,
                        'batch_size': batch
                    }
                    
                    # Compile and benchmark
                    model_compiled = self.compile(config)
                    metrics = self.benchmark(model_compiled)
                    
                    self.results.append({
                        'config': config,
                        'latency': metrics['latency'],
                        'energy': metrics['energy'],
                        'accuracy': metrics['accuracy'],
                        'memory': metrics['memory']
                    })
        
        # Find Pareto frontier
        return self.find_pareto_optimal()
    
    def find_pareto_optimal(self):
        """
        Return configurations that are not dominated
        """
        pareto = []
        for r1 in self.results:
            dominated = False
            for r2 in self.results:
                if (r2['latency'] <= r1['latency'] and
                    r2['energy'] <= r1['energy'] and
                    r2['accuracy'] >= r1['accuracy'] and
                    (r2['latency'] < r1['latency'] or 
                     r2['energy'] < r1['energy'] or
                     r2['accuracy'] > r1['accuracy'])):
                    dominated = True
                    break
            if not dominated:
                pareto.append(r1)
        return pareto

10. Reporting Template

Executive Summary (1 page)

Project goal
Hardware platforms tested
Key findings (3-5 bullet points)
Recommendation

Methodology (2-3 pages)

Benchmark design
Measurement setup
Hardware counter selection
Statistical approach
Reproducibility checklist

Results (5-10 pages)

Per-hardware detailed analysis
Comparative tables and charts
Attribution analysis
Bottleneck identification
Optimization impact

Guidelines (3-5 pages)

Decision tree
Hardware selection criteria
Optimization strategies
Trade-off analysis

Conclusion (1 page)

Summary of contributions
Limitations
Future work

Appendix

Raw data tables
Hardware specifications
Software versions
Measurement scripts


---

## ðŸ§ª Experimental Validation Plan

### Experiment 1: Quantization Impact
**Hypothesis**: INT8 quantization reduces latency by 60% and energy by 70% with <2% accuracy loss

**Variables**:
- Independent: Precision (FP32, FP16, INT8)
- Dependent: Latency, energy, accuracy, memory
- Control: Model architecture, hardware, compiler flags

**Procedure**:
1. Train baseline FP32 model
2. Apply post-training quantization
3. Measure all metrics on each hardware
4. Plot accuracy-efficiency Pareto frontier

**Expected Outcome**: Validate that INT8 is optimal for embedded

### Experiment 2: Operator Fusion (TVM)
**Hypothesis**: Graph-level optimization reduces memory traffic by 30%

**Variables**:
- Independent: Optimization level (None, TVM-L2, TVM-L3)
- Dependent: Cache misses, memory bandwidth, latency
- Control: Model, hardware

**Procedure**:
1. Benchmark baseline model
2. Apply TVM with AutoTVM tuning
3. Re-benchmark with same counters
4. Attribute improvements to specific optimizations

**Expected Outcome**: Show that fusion reduces cache misses

### Experiment 3: Frequency Scaling
**Hypothesis**: Energy-optimal frequency â‰  max frequency

**Variables**:
- Independent: CPU frequency (50%, 75%, 100%)
- Dependent: Latency, energy, energy-delay product
- Control: Model, optimization level

**Procedure**:
1. Lock CPU frequency at each level
2. Measure latency and energy
3. Calculate: Energy-Delay Product = Energy Ã— Latency
4. Find minimum EDP

**Expected Outcome**: Identify energy-optimal operating point

### Experiment 4: Cache Tile Size Optimization
**Hypothesis**: Tile size matching L1 cache minimizes misses

**Variables**:
- Independent: Convolution tile size (8Ã—8, 16Ã—16, 32Ã—32)
- Dependent: Cache miss rate, latency
- Control: Hardware, model

**Procedure**:
1. Implement tiled convolution
2. Vary tile size
3. Measure cache behavior
4. Plot miss rate vs tile size

**Expected Outcome**: Sweet spot at tile size â‰ˆ L1 cache capacity

---

## ðŸ“Š Visualization Examples

### 1. Performance Attribution Sunburst Chart
```python
import plotly.graph_objects as go

# Hierarchical performance breakdown
fig = go.Figure(go.Sunburst(
    labels=["Total", "Memory", "Compute", "Other", 
            "L1 Miss", "L3 Miss", "Bandwidth", 
            "MAC Ops", "Activation"],
    parents=["", "Total", "Total", "Total", 
             "Memory", "Memory", "Memory", 
             "Compute", "Compute"],
    values=[100, 45, 40, 15, 
            20, 10, 15, 
            30, 10],
    branchvalues="total"
))
fig.update_layout(title="Latency Attribution Breakdown")
fig.write_html("attribution_sunburst.html")

2. Hardware Comparison Dashboard

import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create 2Ã—2 dashboard
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=("Latency", "Energy", "Memory", "Cost"),
    specs=[[{"type": "bar"}, {"type": "bar"}],
           [{"type": "bar"}, {"type": "bar"}]]
)

hardware = ["x86 CPU", "STM32 M7", "DSP", "NPU"]

# Latency
fig.add_trace(
    go.Bar(x=hardware, y=[52, 145, 98, 45], name="Latency (ms)"),
    row=1, col=1
)

# Energy
fig.add_trace(
    go.Bar(x=hardware, y=[150, 18, 35, 12], name="Energy (mJ)"),
    row=1, col=2
)

# Memory
fig.add_trace(
    go.Bar(x=hardware, y=[1000, 512, 1024, 256], name="Memory (KB)"),
    row=2, col=1
)

# Cost
fig.add_trace(
    go.Bar(x=hardware, y=[50, 5, 8, 15], name="Cost ($)"),
    row=2, col=2
)

fig.update_layout(height=600, showlegend=False, 
                  title_text="Multi-Hardware Comparison Dashboard")
fig.write_html("comparison_dashboard.html")

3. Optimization Impact Timeline

import matplotlib.pyplot as plt

optimizations = ['Baseline', '+ Quantization', '+ Fusion', '+ SIMD']
latencies = [145, 58, 45, 32]
energies = [38, 15, 11, 8]

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

# Latency improvement
ax1.plot(optimizations, latencies, 'o-', linewidth=2, markersize=10)
ax1.fill_between(range(len(optimizations)), latencies, alpha=0.3)
ax1.set_ylabel('Latency (ms)', fontsize=12)
ax1.set_title('Optimization Impact on Latency', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Energy improvement
ax2.plot(optimizations, energies, 's-', linewidth=2, markersize=10, color='green')
ax2.fill_between(range(len(optimizations)), energies, alpha=0.3, color='green')
ax2.set_ylabel('Energy (mJ)', fontsize=12)
ax2.set_xlabel('Optimization Stage', fontsize=12)
ax2.set_title('Optimization Impact on Energy', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('optimization_timeline.png', dpi=300)

ðŸ”§ Implementation Tools & Scripts

Tool 1: Unified Benchmark Harness

#!/usr/bin/env python3
"""
unified_benchmark.py - Cross-platform AI benchmarking framework
"""

import time
import numpy as np
from abc import ABC, abstractmethod

class BenchmarkBackend(ABC):
    """Abstract base class for hardware-specific backends"""
    
    @abstractmethod
    def setup(self):
        """Initialize hardware and load model"""
        pass
    
    @abstractmethod
    def run_inference(self):
        """Execute single inference"""
        pass
    
    @abstractmethod
    def read_counters(self):
        """Read hardware performance counters"""
        pass
    
    @abstractmethod
    def measure_energy(self):
        """Measure energy consumption"""
        pass

class CPUBackend(BenchmarkBackend):
    def __init__(self, model_path):
        self.model_path = model_path
        self.model = None
    
    def setup(self):
        import onnxruntime as ort
        self.model = ort.InferenceSession(self.model_path)
    
    def run_inference(self):
        input_name = self.model.get_inputs()[0].name
        dummy = np.random.randn(1, 224, 224, 3).astype(np.float32)
        return self.model.run(None, {input_name: dummy})
    
    def read_counters(self):
        import subprocess
        result = subprocess.run(
            ['perf', 'stat', '-e', 'cycles,instructions,cache-misses', 
             'python', '-c', 'pass'],
            capture_output=True, text=True
        )
        # Parse perf output
        return self._parse_perf_output(result.stderr)
    
    def measure_energy(self):
        # Use RAPL or external power meter
        return 0.0  # Placeholder

class STM32Backend(BenchmarkBackend):
    def __init__(self, model_path, serial_port='/dev/ttyACM0'):
        self.model_path = model_path
        self.serial_port = serial_port
    
    def setup(self):
        import serial
        self.serial = serial.Serial(self.serial_port, 115200)
        # Flash model to STM32
        self._flash_model()
    
    def run_inference(self):
        # Send trigger command
        self.serial.write(b'RUN\n')
        # Read result
        return self.serial.readline()
    
    def read_counters(self):
        self.serial.write(b'COUNTERS\n')
        data = self.serial.readline().decode()
        return self._parse_stm32_counters(data)
    
    def measure_energy(self):
        # Use INA226 sensor
        from ina226 import INA226
        ina = INA226(address=0x40)
        return ina.energy()

class UnifiedBenchmark:
    """Main benchmarking orchestrator"""
    
    def __init__(self, backend: BenchmarkBackend):
        self.backend = backend
        self.results = []
    
    def run(self, iterations=100, warmup=10):
        """
        Execute complete benchmark suite
        """
        self.backend.setup()
        
        # Warm-up phase
        print(f"Warming up ({warmup} iterations)...")
        for _ in range(warmup):
            self.backend.run_inference()
        
        # Measurement phase
        print(f"Benchmarking ({iterations} iterations)...")
        for i in range(iterations):
            start = time.perf_counter()
            
            self.backend.run_inference()
            
            latency = time.perf_counter() - start
            counters = self.backend.read_counters()
            energy = self.backend.measure_energy()
            
            self.results.append({
                'iteration': i,
                'latency_ms': latency * 1000,
                'energy_mj': energy * 1000,
                **counters
            })
            
            if (i + 1) % 10 == 0:
                print(f"  Progress: {i+1}/{iterations}")
        
        return self.generate_report()
    
    def generate_report(self):
        """
        Create comprehensive benchmark report
        """
        import pandas as pd
        
        df = pd.DataFrame(self.results)
        
        report = {
            'summary': {
                'latency_p50': df['latency_ms'].quantile(0.5),
                'latency_p95': df['latency_ms'].quantile(0.95),
                'latency_p99': df['latency_ms'].quantile(0.99),
                'energy_mean': df['energy_mj'].mean(),
                'energy_std': df['energy_mj'].std()
            },
            'attribution': self._attribute_performance(df),
            'bottleneck': self._identify_bottleneck(df),
            'recommendations': self._generate_recommendations(df)
        }
        
        return report
    
    def _attribute_performance(self, df):
        """
        Decompose performance into components
        """
        # Example attribution logic
        if 'cache_misses' in df.columns:
            cache_overhead = df['cache_misses'].mean() * 10  # 10 cycles/miss
            total_cycles = df['cycles'].mean()
            
            return {
                'cache_overhead_pct': cache_overhead / total_cycles * 100,
                'compute_pct': (1 - cache_overhead / total_cycles) * 100
            }
        return {}
    
    def _identify_bottleneck(self, df):
        """
        Classify primary bottleneck
        """
        if 'cache_misses' in df.columns and 'cache_accesses' in df.columns:
            miss_rate = df['cache_misses'].mean() / df['cache_accesses'].mean()
            if miss_rate > 0.1:
                return "Memory-bound (high cache miss rate)"
        
        if 'ipc' in df.columns:
            ipc = df['ipc'].mean()
            if ipc < 0.5:
                return "Memory-bound (low IPC)"
            elif ipc > 2.0:
                return "Compute-bound"
        
        return "Mixed bottleneck"
    
    def _generate_recommendations(self, df):
        """
        Generate actionable optimization suggestions
        """
        recs = []
        
        bottleneck = self._identify_bottleneck(df)
        
        if "Memory-bound" in bottleneck:
            recs.append("Reduce working set size via quantization (FP32â†’INT8)")
            recs.append("Improve data locality (tiling, im2col optimization)")
            recs.append("Enable operator fusion to reduce memory traffic")
        
        if "Compute-bound" in bottleneck:
            recs.append("Well-optimized for this hardware!")
            recs.append("Consider SIMD/DSP instructions for further gains")
        
        return recs

# Usage example
if __name__ == "__main__":
    # Benchmark on CPU
    cpu_backend = CPUBackend("model.onnx")
    cpu_bench = UnifiedBenchmark(cpu_backend)
    cpu_report = cpu_bench.run(iterations=100)
    
    print("\n=== CPU Benchmark Report ===")
    print(f"Latency (p50): {cpu_report['summary']['latency_p50']:.2f} ms")
    print(f"Energy: {cpu_report['summary']['energy_mean']:.2f} mJ")
    print(f"Bottleneck: {cpu_report['bottleneck']}")
    print("\nRecommendations:")
    for rec in cpu_report['recommendations']:
        print(f"  - {rec}")
    
    # Benchmark on STM32
    stm32_backend = STM32Backend("model.tflite")
    stm32_bench = UnifiedBenchmark(stm32_backend)
    stm32_report = stm32_bench.run(iterations=100)
    
    print("\n=== STM32 Benchmark Report ===")
    # ... (similar reporting)

Tool 2: STM32 Firmware Template

/* stm32_benchmark.c - Embedded benchmarking firmware */

#include "stm32f7xx_hal.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include <stdio.h>

/* Enable performance counters */
static void pmu_init(void) {
    // Enable DWT cycle counter
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    DWT->CYCCNT = 0;
    DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
    
    // Enable PMU for cache counters (Cortex-M7)
    #ifdef CORTEX_M7
    PMU->CTRL |= PMU_CTRL_ENABLE_Msk;
    PMU->CNTENSET = (1 << 0) | (1 << 1);
    PMU->EVTYPER0 = 0x03;  // L1 D-cache miss
    PMU->EVTYPER1 = 0x04;  // L1 D-cache access
    #endif
}

typedef struct {
    uint32_t cycles;
    uint32_t cache_misses;
    uint32_t cache_accesses;
    float latency_ms;
    float energy_mj;
} BenchmarkResult;

static BenchmarkResult run_inference_with_profiling(void) {
    BenchmarkResult result = {0};
    
    // Read start counters
    uint32_t start_cycles = DWT->CYCCNT;
    uint32_t start_misses = PMU->EVCNTR0;
    uint32_t start_accesses = PMU->EVCNTR1;
    
    // Run inference
    TfLiteStatus status = interpreter->Invoke();
    
    // Read end counters
    result.cycles = DWT->CYCCNT - start_cycles;
    result.cache_misses = PMU->EVCNTR0 - start_misses;
    result.cache_accesses = PMU->EVCNTR1 - start_accesses;
    
    // Calculate metrics
    result.latency_ms = (float)result.cycles / (SystemCoreClock / 1000.0f);
    result.energy_mj = measure_energy();  // Via INA226
    
    return result;
}

int main(void) {
    HAL_Init();
    SystemClock_Config();
    pmu_init();
    
    // Initialize TFLite Micro
    setup_tflite_model();
    
    // Warm-up
    for (int i = 0; i < 10; i++) {
        interpreter->Invoke();
    }
    
    // Benchmark loop
    BenchmarkResult results[100];
    for (int i = 0; i < 100; i++) {
        results[i] = run_inference_with_profiling();
        
        // Send results over UART
        printf("%d,%lu,%lu,%lu,%.2f,%.2f\n",
               i,
               results[i].cycles,
               results[i].cache_misses,
               results[i].cache_accesses,
               results[i].latency_ms,
               results[i].energy_mj);
    }
    
    while (1) {
        // Wait for next command
    }
}

ðŸ“… Detailed Timeline & Milestones

Week 1: Foundation (Oct 15-21, 2025)

Mon-Tue: CPU flamegraph + operator analysis
- Milestone: Top 5 bottleneck operators identified
Wed-Thu: Hardware counter integration
- Milestone: Baseline metrics table complete
Fri: Literature review (MLPerf, MCUNet)
- Milestone: 3-page synthesis document
Weekend: STM32 setup + model conversion
- Milestone: Working embedded inference

Week 2: Multi-Hardware (Oct 22-28, 2025)

Mon-Tue: STM32 comprehensive benchmark
- Milestone: Complete STM32 report
Wed-Thu: Explainability engine
- Milestone: Automated attribution system
Fri: TVM optimization experiments
- Milestone: TVM vs baseline comparison
Weekend: Comparative analysis synthesis
- Milestone: Hardware comparison matrix

Week 3-4: Optimization & Guidelines (Oct 29 - Nov 11, 2025)

Week 3: Run all optimization experiments
- Milestone: Pareto frontier plots
Week 4: Draft guidelines document
- Milestone: 50% complete guidelines

Week 5-6: Finalization (Nov 12-25, 2025)

Week 5: Complete report + visualizations
- Milestone: 90% complete report
Week 6: Prepare demo + presentation
- Milestone: Live demo ready

Final Deliverables (End of Project)

âœ… Comprehensive benchmark report (30-40 pages) âœ… Hardware selection guidelines (10-15 pages) âœ… Open-source benchmark framework (GitHub repo) âœ… Visualization dashboard (HTML/interactive) âœ… Live demonstration âœ… Presentation slides

ðŸŽ“ Novel Contributions & Differentiators

What Makes This Project Unique

1. Energy-Cycle Attribution Model

Innovation: Most benchmarks report energy separately from performance. This project creates a predictive model:

Energy(layer) = Î± Ã— Cycles(layer) + Î² Ã— MemoryAccess(layer) + Î³ Ã— ClockFreq

Value:

Predict energy without measurement hardware (useful for design-time decisions)
Identify which operations are energy-inefficient
Guide hardware selection based on energy constraints

Implementation:

class EnergyCycleModel:
    def __init__(self):
        self.alpha = None  # Energy per cycle
        self.beta = None   # Energy per memory access
        self.gamma = None  # Frequency scaling factor
    
    def train(self, measurements):
        """
        Fit model from empirical measurements
        """
        from sklearn.linear_model import Ridge
        
        X = []
        y = []
        
        for m in measurements:
            features = [
                m['cycles'],
                m['memory_accesses'],
                m['frequency_mhz']
            ]
            X.append(features)
            y.append(m['energy_mj'])
        
        model = Ridge(alpha=0.1)
        model.fit(X, y)
        
        self.alpha, self.beta, self.gamma = model.coef_
        
        print(f"Energy Model: E = {self.alpha:.4f}Ã—Cycles + "
              f"{self.beta:.4f}Ã—MemAccess + {self.gamma:.4f}Ã—Freq")
    
    def predict(self, cycles, mem_access, freq):
        return (self.alpha * cycles + 
                self.beta * mem_access + 
                self.gamma * freq)

Expected Result: RÂ² > 0.9 correlation between predicted and measured energy

2. Real-Time Bottleneck Detection

Innovation: Instead of post-hoc analysis, embed detection logic in the benchmark framework

Implementation:

def detect_bottleneck_realtime(counters):
    """
    Classify bottleneck during inference
    """
    ipc = counters['instructions'] / counters['cycles']
    cache_miss_rate = counters['cache_misses'] / counters['cache_references']
    mem_bw_util = counters['mem_bandwidth_used'] / counters['mem_bandwidth_peak']
    
    # Decision tree
    if cache_miss_rate > 0.15:
        return {
            'type': 'CACHE_THRASHING',
            'severity': 'HIGH',
            'suggestion': 'Reduce working set or increase cache locality',
            'expected_gain': '30-50% latency reduction'
        }
    elif mem_bw_util > 0.8:
        return {
            'type': 'MEMORY_BANDWIDTH',
            'severity': 'HIGH',
            'suggestion': 'Reduce data movement via quantization or fusion',
            'expected_gain': '20-40% latency reduction'
        }
    elif ipc < 0.5:
        return {
            'type': 'MEMORY_STALL',
            'severity': 'MEDIUM',
            'suggestion': 'Prefetch data or improve memory access patterns',
            'expected_gain': '10-20% latency reduction'
        }
    elif ipc > 2.0:
        return {
            'type': 'OPTIMAL',
            'severity': 'NONE',
            'suggestion': 'Model is well-optimized for this hardware',
            'expected_gain': 'Marginal (<5%)'
        }
    else:
        return {
            'type': 'MIXED',
            'severity': 'MEDIUM',
            'suggestion': 'Profile individual operators for specific bottlenecks',
            'expected_gain': '15-25% latency reduction'
        }

Value: Immediate, actionable feedback during benchmarking

3. Cross-Hardware Portability Predictor

Innovation: Train a model to predict performance on untested hardware

Method:

class PortabilityPredictor:
    """
    Predict performance on new hardware based on specs
    """
    
    def __init__(self):
        self.feature_extractors = {
            'cache_size_kb': lambda hw: hw['l1_cache'] + hw['l2_cache'],
            'memory_bandwidth_gbps': lambda hw: hw['mem_bandwidth'],
            'peak_gops': lambda hw: hw['clock_mhz'] * hw['cores'] * hw['ops_per_cycle'] / 1000,
            'power_budget_w': lambda hw: hw['tdp']
        }
    
    def train(self, benchmark_results):
        """
        Learn from multiple hardware benchmarks
        """
        X = []
        y_latency = []
        y_energy = []
        
        for hw_name, result in benchmark_results.items():
            hw_spec = HARDWARE_SPECS[hw_name]
            
            features = [extractor(hw_spec) 
                       for extractor in self.feature_extractors.values()]
            X.append(features)
            y_latency.append(result['latency_ms'])
            y_energy.append(result['energy_mj'])
        
        from sklearn.ensemble import RandomForestRegressor
        
        self.latency_model = RandomForestRegressor(n_estimators=100)
        self.energy_model = RandomForestRegressor(n_estimators=100)
        
        self.latency_model.fit(X, y_latency)
        self.energy_model.fit(X, y_energy)
    
    def predict_performance(self, new_hardware_spec):
        """
        Predict latency and energy on unseen hardware
        """
        features = [extractor(new_hardware_spec) 
                   for extractor in self.feature_extractors.values()]
        
        predicted_latency = self.latency_model.predict([features])[0]
        predicted_energy = self.energy_model.predict([features])[0]
        
        # Confidence intervals
        latency_std = np.std([tree.predict([features])[0] 
                             for tree in self.latency_model.estimators_])
        
        return {
            'latency_ms': predicted_latency,
            'latency_confidence': (predicted_latency - 2*latency_std, 
                                  predicted_latency + 2*latency_std),
            'energy_mj': predicted_energy
        }

Value: Guide hardware selection without physical access to all platforms

4. Optimization Impact Simulator

Innovation: Predict speedup from optimizations before implementing them

Method:

class OptimizationSimulator:
    """
    Estimate optimization impact based on profiling data
    """
    
    OPTIMIZATION_MODELS = {
        'int8_quantization': {
            'compute_speedup': 3.5,
            'memory_reduction': 4.0,
            'accuracy_loss': 0.015,
            'applicable_if': lambda profile: profile['precision'] == 'fp32'
        },
        'operator_fusion': {
            'memory_traffic_reduction': 0.3,
            'applicable_if': lambda profile: profile['has_consecutive_ops']
        },
        'simd_vectorization': {
            'compute_speedup': 2.0,
            'applicable_if': lambda profile: profile['has_vectorizable_ops']
        }
    }
    
    def simulate(self, baseline_profile, optimization_name):
        """
        Simulate effect of optimization
        """
        opt = self.OPTIMIZATION_MODELS[optimization_name]
        
        # Check if optimization is applicable
        if not opt['applicable_if'](baseline_profile):
            return {'applicable': False, 'reason': 'Preconditions not met'}
        
        # Calculate expected improvements
        simulated_profile = baseline_profile.copy()
        
        if 'compute_speedup' in opt:
            simulated_profile['latency_ms'] /= opt['compute_speedup']
            simulated_profile['energy_mj'] /= opt['compute_speedup']
        
        if 'memory_reduction' in opt:
            simulated_profile['memory_kb'] /= opt['memory_reduction']
        
        if 'memory_traffic_reduction' in opt:
            cache_miss_latency = (baseline_profile['cache_misses'] * 10 / 
                                 baseline_profile['cycles'] * 
                                 baseline_profile['latency_ms'])
            reduction = cache_miss_latency * opt['memory_traffic_reduction']
            simulated_profile['latency_ms'] -= reduction
        
        if 'accuracy_loss' in opt:
            simulated_profile['accuracy'] -= opt['accuracy_loss']
        
        return {
            'applicable': True,
            'baseline': baseline_profile,
            'optimized': simulated_profile,
            'improvements': {
                'latency': (baseline_profile['latency_ms'] - 
                           simulated_profile['latency_ms']) / 
                           baseline_profile['latency_ms'] * 100,
                'energy': (baseline_profile['energy_mj'] - 
                          simulated_profile['energy_mj']) / 
                          baseline_profile['energy_mj'] * 100,
                'accuracy_delta': (simulated_profile['accuracy'] - 
                                  baseline_profile['accuracy'])
            }
        }

Usage:

simulator = OptimizationSimulator()

# Test INT8 quantization
result = simulator.simulate(cpu_baseline, 'int8_quantization')

print(f"Expected latency reduction: {result['improvements']['latency']:.1f}%")
print(f"Expected energy reduction: {result['improvements']['energy']:.1f}%")
print(f"Expected accuracy loss: {result['improvements']['accuracy_delta']:.2f}%")
print("\nShould you apply this optimization? "
      f"{'YES' if result['improvements']['latency'] > 30 else 'MAYBE'}")

5. Hardware-Model Compatibility Score

Innovation: Quantify how well a model architecture matches hardware capabilities

Method:

def calculate_compatibility_score(model, hardware):
    """
    Score: 0-100, higher = better match
    """
    score = 100.0
    penalties = []
    
    # Check 1: Memory fit
    model_memory = model.get_memory_footprint()
    hw_memory = hardware['ram_kb']
    
    if model_memory > hw_memory:
        score = 0  # Doesn't fit at all
        return score, ["Model doesn't fit in available memory"]
    
    memory_util = model_memory / hw_memory
    if memory_util > 0.9:
        score -= 20
        penalties.append(f"Very tight memory: {memory_util*100:.0f}% utilization")
    
    # Check 2: Operator support
    unsupported_ops = []
    for op in model.get_operators():
        if op not in hardware['supported_ops']:
            unsupported_ops.append(op)
    
    if unsupported_ops:
        score -= 10 * len(unsupported_ops)
        penalties.append(f"Unsupported ops: {unsupported_ops}")
    
    # Check 3: Arithmetic intensity vs hardware balance
    model_intensity = model.calculate_arithmetic_intensity()
    hw_balance = hardware['peak_gops'] / hardware['mem_bandwidth_gbps']
    
    mismatch = abs(model_intensity - hw_balance) / hw_balance
    if mismatch > 0.5:
        score -= 15
        penalties.append(f"Compute/memory imbalance: {mismatch*100:.0f}%")
    
    # Check 4: Precision match
    if model.precision == 'fp32' and 'fp32' not in hardware['supported_precisions']:
        score -= 25
        penalties.append("Precision mismatch (FP32 model, no FP32 hardware support)")
    
    # Check 5: Batch size efficiency
    if model.batch_size > 1 and hardware['type'] == 'microcontroller':
        score -= 10
        penalties.append("Batched inference inefficient on microcontrollers")
    
    return max(0, score), penalties

# Usage
model = load_model('mobilenetv2.tflite')
stm32_spec = HARDWARE_SPECS['stm32f746']

score, issues = calculate_compatibility_score(model, stm32_spec)

print(f"Compatibility Score: {score}/100")
if issues:
    print("\nIssues detected:")
    for issue in issues:
        print(f"  âš ï¸�  {issue}")

6. Automated Guidelines Generator

Innovation: Generate decision trees and selection criteria automatically from benchmark data

Implementation:

from sklearn.tree import DecisionTreeClassifier, export_text

class GuidelinesGenerator:
    """
    Automatically generate hardware selection guidelines
    """
    
    def __init__(self, benchmark_database):
        self.db = benchmark_database  # Dict of all benchmark results
    
    def generate_decision_tree(self):
        """
        Learn decision tree from benchmark results
        """
        X = []  # Features: latency req, energy req, memory req, accuracy req
        y = []  # Labels: recommended hardware
        
        for config in self.db:
            features = [
                config['latency_requirement_ms'],
                config['energy_budget_mj'],
                config['memory_available_kb'],
                config['min_accuracy_required']
            ]
            X.append(features)
            
            # Determine best hardware for this config
            best_hw = self._find_best_hardware(config)
            y.append(best_hw)
        
        # Train decision tree
        clf = DecisionTreeClassifier(max_depth=5, min_samples_leaf=3)
        clf.fit(X, y)
        
        # Export as text
        feature_names = ['Latency_Req', 'Energy_Budget', 'Memory_KB', 'Accuracy']
        tree_rules = export_text(clf, feature_names=feature_names)
        
        return tree_rules
    
    def _find_best_hardware(self, config):
        """
        Determine optimal hardware based on constraints and objectives
        """
        candidates = []
        
        for hw_name, result in config['results'].items():
            # Check hard constraints
            if (result['latency_ms'] <= config['latency_requirement_ms'] and
                result['energy_mj'] <= config['energy_budget_mj'] and
                result['memory_kb'] <= config['memory_available_kb'] and
                result['accuracy'] >= config['min_accuracy_required']):
                
                # Calculate utility score
                utility = (
                    1.0 / result['cost_usd'] *  # Lower cost better
                    1.0 / result['energy_mj']    # Lower energy better
                )
                
                candidates.append((hw_name, utility))
        
        if candidates:
            return max(candidates, key=lambda x: x[1])[0]
        else:
            return 'INFEASIBLE'
    
    def generate_markdown_guidelines(self):
        """
        Create human-readable guidelines document
        """
        doc = "# Hardware Selection Guidelines\n\n"
        doc += "## Decision Tree\n\n"
        doc += "```\n"
        doc += self.generate_decision_tree()
        doc += "\n```\n\n"
        
        doc += "## Hardware Comparison Table\n\n"
        doc += "| Hardware | Best For | Avoid When | Typical Use Case |\n"
        doc += "|----------|----------|------------|------------------|\n"
        
        # Analyze each hardware's strengths
        for hw_name in self.db[0]['results'].keys():
            best_for = self._analyze_strengths(hw_name)
            avoid_when = self._analyze_weaknesses(hw_name)
            use_case = self._suggest_use_case(hw_name)
            
            doc += f"| {hw_name} | {best_for} | {avoid_when} | {use_case} |\n"
        
        return doc

Output Example:

Decision Tree:
|--- Latency_Req <= 50ms
|   |--- Energy_Budget <= 20mJ
|   |   |--- class: NPU
|   |--- Energy_Budget > 20mJ
|   |   |--- class: CPU
|--- Latency_Req > 50ms
|   |--- Memory_KB <= 512
|   |   |--- class: STM32_M7
|   |--- Memory_KB > 512
|   |   |--- class: DSP

ðŸ”� Deep Dive: Key Technical Concepts

Understanding Arithmetic Intensity

Definition: Arithmetic Intensity (AI) = Operations / Bytes Accessed

Example Calculation:

def calculate_arithmetic_intensity(layer):
    """
    For a Conv2D layer: (H Ã— W Ã— C_in Ã— C_out Ã— K Ã— K) / (H Ã— W Ã— C_in Ã— 4 bytes)
    """
    if layer.type == 'Conv2D':
        H, W = layer.output_shape[0:2]
        C_in = layer.input_channels
        C_out = layer.output_channels
        K = layer.kernel_size
        
        flops = H * W * C_in * C_out * K * K
        bytes_accessed = (
            H * W * C_in * 4 +      # Input
            C_out * C_in * K * K * 4  # Weights
        )
        
        return flops / bytes_accessed

# Example: Standard Conv2D
layer = Conv2D(
    input_shape=(56, 56, 64),
    filters=128,
    kernel_size=3
)

ai = calculate_arithmetic_intensity(layer)
print(f"Arithmetic Intensity: {ai:.1f} FLOPs/byte")

# Interpretation:
# AI < 10: Memory-bound (limited by memory bandwidth)
# AI > 50: Compute-bound (limited by compute throughput)

Why It Matters:

Low AI operations â†’ optimize memory access patterns
High AI operations â†’ optimize compute (use SIMD, DSP, etc.)

Cache Behavior Modeling

Key Insight: Cache misses have fixed latency costs

Cost Model:

Total_Latency = Compute_Cycles + (L1_Misses Ã— 10) + (L2_Misses Ã— 50) + (L3_Misses Ã— 100)

Prediction:

def predict_cache_misses(layer, cache_size_kb):
    """
    Estimate cache miss rate based on working set size
    """
    working_set_kb = (
        layer.input_size_kb +
        layer.weight_size_kb +
        layer.output_size_kb
    )
    
    if working_set_kb <= cache_size_kb:
        miss_rate = 0.02  # Only compulsory misses
    elif working_set_kb <= cache_size_kb * 2:
        miss_rate = 0.10  # Moderate thrashing
    else:
        miss_rate = 0.30  # Heavy thrashing
    
    total_accesses = layer.memory_accesses
    misses = total_accesses * miss_rate
    
    return misses

# Usage
conv_layer = model.layers[0]
l1_size = 32  # KB

predicted_misses = predict_cache_misses(conv_layer, l1_size)
predicted_latency_penalty = predicted_misses * 10  # cycles

print(f"Expected cache penalty: {predicted_latency_penalty / layer.total_cycles * 100:.1f}%")

Energy Profiling Deep Dive

Why Energy Attribution Is Hard:

Multiple power domains (CPU, memory, peripherals)
Dynamic voltage/frequency scaling
Measurement sampling rate limitations

Best Practices:

def measure_energy_high_precision(inference_func, duration_sec=1.0):
    """
    High-precision energy measurement protocol
    """
    from ina226 import INA226
    import numpy as np
    
    ina = INA226(address=0x40)
    ina.configure(
        avg=128,              # Average 128 samples
        voltage_conv_time=1.1,  # 1.1ms conversion time
        current_conv_time=1.1
    )
    
    samples = []
    sample_rate_hz = 1000  # 1kHz
    
    # Start inference in background thread
    import threading
    inference_thread = threading.Thread(target=inference_func)
    inference_thread.start()
    
    # Collect power samples
    start_time = time.time()
    while time.time() - start_time < duration_sec:
        sample = {
            'timestamp': time.time(),
            'voltage': ina.voltage(),
            'current': ina.current(),
            'power': ina.power()
        }
        samples.append(sample)
        time.sleep(1.0 / sample_rate_hz)
    
    inference_thread.join()
    
    # Calculate energy
    df = pd.DataFrame(samples)
    energy_joules = np.trapz(df['power'], df['timestamp'])
    energy_mj = energy_joules * 1000
    
    # Analyze power profile
    peak_power = df['power'].max()
    avg_power = df['power'].mean()
    std_power = df['power'].std()
    
    return {
        'energy_mj': energy_mj,
        'avg_power_mw': avg_power * 1000,
        'peak_power_mw': peak_power * 1000,
        'power_std_mw': std_power * 1000,
        'samples': samples  # For detailed analysis
    }

ðŸ“– Glossary of Key Terms

Term	Definition	Why It Matters
Arithmetic Intensity	Ratio of FLOPs to bytes accessed	Determines if operation is compute or memory-bound
IPC	Instructions Per Cycle	Low IPC (<1) indicates memory stalls
Cache Miss Rate	% of memory accesses that miss cache	>10% typically indicates performance problem
Operational Intensity	FLOPs per byte of DRAM traffic	Used in roofline model
Energy-Delay Product	Energy Ã— Latency	Combined metric for efficiency
Pareto Frontier	Set of non-dominated solutions	Trade-off visualization
Quantization	Reducing numerical precision (FP32â†’INT8)	3-4x speedup, 4x memory reduction
Operator Fusion	Combining multiple ops into one kernel	Reduces memory traffic
TinyML	ML on microcontrollers (<1MB RAM, <1mW)	Extremely resource-constrained
NPU	Neural Processing Unit (dedicated accelerator)	Specialized for tensor operations
MAC	Multiply-Accumulate operation	Fundamental compute unit in NNs
SIMD	Single Instruction Multiple Data	Vectorized operations (2-4x speedup)
DWT	Data Watchpoint and Trace (ARM)	Cycle-accurate profiling on Cortex-M
PMU	Performance Monitoring Unit	Hardware counters for cache, branches

ðŸš€ Next Steps & Future Work

Immediate Actions (This Week)

âœ… Complete CPU flamegraph and baseline profiling
âœ… Read MLPerf Tiny and MCUNet papers
âœ… Set up STM32 development environment
ðŸ”„ Deploy model to STM32 and collect first metrics

Short-Term Enhancements (Next Month)

Add support for additional hardware (ESP32, Raspberry Pi Pico)
Implement automated optimization search
Create interactive web dashboard for results
Open-source the benchmark framework on GitHub

Long-Term Research Directions

Adaptive Benchmarking: Automatically adjust measurement granularity based on variability
Transfer Learning for Performance: Predict performance on new models based on architecture similarity
Energy-Aware NAS: Neural architecture search with hardware energy as objective
Cross-Compiler Optimization Study: Compare GCC vs Clang vs ARM Compiler systematically

ðŸ“š Complete Resource Checklist

Papers to Read

TVM: Automated End-to-End Optimizing Compiler (OSDI 2018)
MLPerf Tiny Benchmark (arXiv:2106.07597)
MCUNet: Tiny Deep Learning on IoT Devices (NeurIPS 2020)
Optimus: Accelerating Neural Network Inference (ACM TECS 2022)
CMSIS-NN: Efficient Neural Network Kernels
Benchmarking TinyML Systems (SenSys 2021)
Roofline Model (Berkeley Technical Report)

Documentation to Study

TensorFlow Lite Micro Docs
ARM Cortex-M7 Technical Reference Manual
STM32F7 Reference Manual (PMU/DWT sections)
CMSIS-DSP Library Documentation
OpenVINO ARM Platform Guide

Tools to Master

ðŸŽ¯ Success Metrics

Technical Metrics

âœ… Benchmark on â‰¥3 different hardware platforms
âœ… Collect â‰¥8 hardware counters per platform
âœ… Achieve <5% measurement variance (p95/p50 ratio)
âœ… Create â‰¥5 optimization experiments with measured impact
âœ… Build automated report generation system

Documentation Metrics

âœ… Final report: 30-40 pages
âœ… Guidelines document: 10-15 pages
âœ… Create â‰¥10 visualizations (charts, plots, dashboards)
âœ… Document all measurement protocols for reproducibility

Impact Metrics

âœ… Enable hardware selection for given application constraints
âœ… Quantify optimization ROI (effort vs performance gain)
âœ… Create reusable benchmark framework for future projects

ðŸ’¡ Final Recommendations

For Your Project Specifically

Start Simple, Add Complexity Gradually
- Week 1: Get basic latency + energy working
- Week 2: Add hardware counters
- Week 3: Add advanced attribution
Prioritize Reproducibility Over Novelty
- Document everything (hardware, software, environment)
- Use statistical rigor (warm-up, multiple runs, percentiles)
- Make measurements repeatable by others
Focus on Explainability
- Every metric should have an interpretation
- Every bottleneck should have a recommendation
- Visualize to make complex data accessible
Validate Against Existing Work
- Compare your STM32 results against STM32 Model Zoo
- Sanity-check energy numbers (should be 1-100mJ range for TinyML)
- Verify optimization gains match literature (INT8 â†’ 3-4x expected)
Build for Reuse
- Modular code (easy to add new hardware backends)
- Clear documentation
- Consider open-sourcing after project completion

Potential Pitfalls to Avoid

â�Œ Don't: Measure only once and trust the result âœ… Do: Run 100+ iterations and report percentiles

â�Œ Don't: Compare across hardware without fixing all variables âœ… Do: Document every difference (compiler, optimization flags, etc.)

â�Œ Don't: Report only latency without context âœ… Do: Include hardware counters to explain why

â�Œ Don't: Optimize blindly without profiling first âœ… Do: Profile â†’ identify bottleneck â†’ optimize â†’ validate

â�Œ Don't: Ignore energy measurement âœ… Do: Energy is often more important than latency in embedded

ðŸ“ž Resources for Help

When You Get Stuck

Hardware Counter Issues:

ARM Community Forums: https://community.arm.com/
STM32 Forum: https://community.st.com/

TVM/Optimization Questions:

TVM Discuss: https://discuss.tvm.apache.org/
TensorFlow Lite Micro GitHub Issues

Measurement Methodology:

MLPerf Tiny repository (reference implementation)
Papers' supplementary material (often has detailed protocols)

General ML Optimization:

Papers With Code (https://paperswithcode.com/)
ArXiv Sanity (http://www.arxiv-sanity.com/)

âœ… Project Completion Checklist

Phase 1: Research & Setup

Literature review complete (â‰¥5 papers)
Hardware platforms selected and available
Development environment set up and tested
Baseline model trained/selected
Measurement protocols documented

Phase 2: Implementation

Model deployed on all platforms
Benchmark framework implemented
Hardware counters integrated
Energy measurement working
Per-layer profiling functional

Phase 3: Optimization & Analysis

Baseline benchmarks complete (â‰¥3 platforms)
Attribution analysis working
â‰¥3 optimizations tested and measured
Comparative analysis complete
Bottleneck identification automated

Phase 4: Documentation

Comprehensive report written
Guidelines document created
All visualizations generated
Code commented and organized
Presentation slides prepared

Phase 5: Validation

Results validated against literature
Methodology reviewed for reproducibility
Demo tested and working
Peer review feedback incorporated
Final deliverables submitted

ðŸŽ“ Learning Outcomes

By completing this project, you will have mastered:

Technical Skills:

Hardware performance profiling and attribution
Embedded systems programming and optimization
ML model deployment on resource-constrained devices
Energy measurement and analysis
Statistical benchmarking methodology

Analytical Skills:

Bottleneck identification and root cause analysis
Trade-off analysis (latency vs energy vs accuracy)
Decision tree generation from empirical data
Performance prediction and modeling

Communication Skills:

Technical writing (30-40 page report)
Data visualization for complex multi-dimensional data
Guidelines creation for non-experts
Live demonstration and presentation

Research Skills:

Literature review and synthesis
Experimental design and validation
Reproducible research practices
Open-source contribution

This comprehensive guide provides everything needed to execute your TinyML hardware benchmarking project with explainability as the core focus. The combination of rigorous methodology, novel contributions (energy-cycle model, real-time bottleneck detection), and practical tools (benchmark harness, automated guidelines) positions this work to make a significant contribution to the embedded ML community.

Remember: The goal isn't just to measure performanceâ€”it's to understand and explain it. Every number should tell a story, and every bottleneck should point to a solution.

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
Bench		Bench
Models		Models
Profiling/UNET		Profiling/UNET
assets		assets
meta		meta
plugin		plugin
profiling_results		profiling_results
profiling_results_trt		profiling_results_trt
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
benchmark_results.csv		benchmark_results.csv
fusion_report.json		fusion_report.json
jitter_comparison.png		jitter_comparison.png
model_attn_validation_augmented.onnx		model_attn_validation_augmented.onnx
model_fused.onnx		model_fused.onnx
model_ref.onnx		model_ref.onnx
myprofile.qdstrm		myprofile.qdstrm
myprofile.sqlite		myprofile.sqlite
plan.txt		plan.txt
prompt.txt		prompt.txt
prompttt.txt		prompttt.txt
ssh.txt		ssh.txt
throughput_comparison.png		throughput_comparison.png
variance_comparison.png		variance_comparison.png
yolo12n_engine.json		yolo12n_engine.json

Operator	Time (ms)	% Total	Memory (KB)	FLOPs	Arithmetic Intensity
Conv2D_1	15.2	30%	120	2.3M	19.2 (compute-bound)
ReLU	0.5	1%	0	0	N/A
...	...	...	...	...	...

License

ghassen-fatnassi/MetalBench

Folders and files

Latest commit

History

Repository files navigation

TinyML Hardware Benchmarking: Explainable Performance Analysis Across Embedded AI Platforms

ðŸ“‹ Executive Summary

ðŸŽ¯ Project Objectives

Primary Goals

Unique Value Proposition

ðŸ�—ï¸� Conceptual Framework

The Metrics Pyramid

ðŸ”¬ Technical Approach

1. Benchmark Design Principles

Why Explainability Matters

Hardware Counter Selection Rationale

2. Measurement Methodology

Cold Start vs Warm Inference

Statistical Rigor

3. Energy Measurement Techniques

Hardware-Based Power Monitoring

Energy Attribution Algorithm

ðŸ“š Essential Resources & Justification

Tier 1: Foundational Reading (MUST READ)

1. MLPerf Tiny Benchmark Suite

2. TVM & AutoTVM

3. MCUNet: Tiny Deep Learning on IoT Devices

4. ARM Cortex-M Machine Learning System Design Guide

5. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M

Tier 2: Methodology & Profiling

6. "Benchmarking TinyML Systems: Challenges and Directions" (SenSys 2021)

7. STM32 AI Model Zoo

8. OpenVINO Toolkit for ARM Platforms

9. Optimus: Accelerating Neural Network Inference on Microcontrollers

Tier 3: Background Theory

10. Roofline Model (Berkeley)

11. "Memory-Efficient Inference for Deep Neural Networks" (Google Research)

ðŸ› ï¸� Implementation Roadmap

Phase 1: Baseline Profiling (Week 1)

Day 1-2: CPU Flamegraph & Operator Analysis

Day 3-4: Hardware Counter Integration

Day 5: Literature Deep Dive

Weekend: STM32 Environment Setup

Phase 2: Multi-Hardware Deployment (Week 2)

Day 6-7: STM32 Comprehensive Benchmark

Day 8-9: Explainability Engine

Day 10: TVM Optimization Experiment

Phase 3: Synthesis & Guidelines (Week 3-4)

Week 3: Comparative Analysis

Week 4: Guidelines Document

9.2 Energy-Cycle Correlation Model

9.3 Memory Access Pattern Visualization

9.4 Automated Optimization Search

10. Reporting Template

Executive Summary (1 page)

Methodology (2-3 pages)

Results (5-10 pages)

Guidelines (3-5 pages)

Conclusion (1 page)

Appendix

2. Hardware Comparison Dashboard

3. Optimization Impact Timeline

ðŸ”§ Implementation Tools & Scripts

Tool 1: Unified Benchmark Harness

Tool 2: STM32 Firmware Template

ðŸ“… Detailed Timeline & Milestones

Week 1: Foundation (Oct 15-21, 2025)

Week 2: Multi-Hardware (Oct 22-28, 2025)

Week 3-4: Optimization & Guidelines (Oct 29 - Nov 11, 2025)

Week 5-6: Finalization (Nov 12-25, 2025)

Final Deliverables (End of Project)

ðŸŽ“ Novel Contributions & Differentiators

What Makes This Project Unique

1. Energy-Cycle Attribution Model

2. Real-Time Bottleneck Detection

3. Cross-Hardware Portability Predictor

4. Optimization Impact Simulator

5. Hardware-Model Compatibility Score

6. Automated Guidelines Generator

Packages