Fastest GPIO Read Speed on STM32F103 (Tight Loops vs DMA)
When I was building Lookshi—a portable mini logic analyzer—I asked myself: How fast can you read GPIO on the STM32F1?
I needed to figure out the maximum GPIO read speed on the STM32F103. I ran four benchmarks and measured the number of CPU cycles needed to perform 128 consecutive GPIO reads. This small experiment taught me quite a bit about the limitations of the STM32F103 but also the ins-and-outs of DMA.
Setup
I used the ARM Cortex-M3 DWT (data watchpoint trigger) cycle counter to measure
how long it took for each method to perform the benchmark. The benchmarks were
compiled with arm-none-eabi-gcc with -Os. The STM32F103 was clocked at
72MHz.
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
Benchmark 1: Tight-Loop GPIO Reads
To begin with, I wanted to know how fast it would be if I just read from GPIO in a simple for-loop.
volatile uint32_t dest[128] = {0};
uint32_t t0 = DWT->CYCCNT;
for(uint8_t i = 0; i < 128; ++i){
dest[i] = GPIOA->IDR;
}
uint32_t t1 = DWT->CYCCNT;
Results
- Total number of cycles: 1540
- Cycles/read: 12.03
- Effective read frequency: 5.98 Mhz
Reading GPIO in a tightloop allowed for reading speeds of just under 6 Mhz. That’s quite low for a logic analyzer.
Side note: It’s faster to use an unsigned 32-bit buffer than a 16-bit buffer
because the processor performs an uxth instruction if we store IDR as a
16-bit value. This instruction throws away the top 16 bits.
Assembly
; setup
mov r1, r4
ldr r2, [r5, #4]
add.w r4, r4, #1073741824
add.w r4, r4, #67584
; read GPIOA->IDR
ldr r0, [r4, #8]
; write buffer
str.w r0, [r3, r1, lsl #2]
; loop exit condition check
adds r1, #1
cmp r1, #128
; exit loop
bne.n 8000320 <main+0x44>
; read DWT->CYCCNT
ldr r3, [r5, #4]
Looking at the assembly, it’s actually quite a fast operation. 4 instructions to
setup the benchmark (loop and t0 assignment), 2 to read from GPIO and write to
buffer, 2 to check the loop exit condition, 1 to exit the loop and 1 to store
DWT->CYCCNT to t1.
APB2 Peripheral Access Latency
There are only 4 instructions per read, yet it takes 12 cycles to complete. I learned that this is because APB2 peripheral access takes quite a few cycles. To read from GPIO, the CPU has to cross AHB into APB2 to access the peripheral. Considering this, we’re at the physical limit of the chip.
Benchmark 2: Tight Loop in RAM (RamFunc)
This is same loop, but placed in RAM to reduce possible flash wait states.
__attribute__ ((section(".RamFunc")))
static inline void read_gpioa(void){
for(uint8_t i = 0; i < 128; ++i){
dest[i] = GPIOA->IDR;
}
}
Results
- Total number of cycles: 1699
- Cycles/read: 13.27
- Effective read frequency: 5.42 Mhz
Surprisingly, this method took longer! Only 1.2 cycles more, but that’s half a megahertz in performance. The assembly was even 1 instruction less. I’m not sure why this is the case, so if anyone knows, let me know too.
Benchmark 3: Timer-Triggered DMA Sampling
The previous 2 methods tried to read GPIO as fast as possible but there were 2 obvious problems. The first problem is that there was always 2 instructions of for-loop overhead. The second problem is that the sampling rate cannot be controlled.
A solution to this is to use the DMA triggered by a timer to directly transfer
GPIOA->IDR to memory.
uint16_t dest[128] = {0};
// enable peripheral clocks
RCC->AHBENR |= RCC_AHBENR_DMA1EN;
RCC->APB2ENR |= RCC_APB2ENR_IOPAEN;
RCC->APB2ENR |= RCC_APB2ENR_TIM1EN;
// setup timer to trigger a DMA request on an update event
TIM1->ARR = 9;
TIM1->DIER |= TIM_DIER_UDE;
// setup DMA
DMA1_Channel5->CPAR = (uint32_t)&GPIOA->IDR;
DMA1_Channel5->CMAR = (uint32_t)dest;
DMA1_Channel5->CCR |= DMA_CCR_MINC |
DMA_CCR_PSIZE_0 |
DMA_CCR_MSIZE_0;
DMA1_Channel5->CNDTR = 128;
DMA1_Channel5->CCR |= DMA_CCR_EN;
// benchmark
uint32_t t0 = DWT->CYCCNT;
TIM1->CR1 |= TIM_CR1_CEN;
while(DMA1_Channel5->CNDTR > 0){
}
uint32_t t1 = DWT->CYCCNT;
Results
With this method, there’s a new variable: the timer period. So this benchmark was ran with different timer periods.
| Timer Frequency (MHz) | Timer Period | Total Cycles | Cycles/Read | Read Frequency (MHz) |
|---|---|---|---|---|
| 6 | 12 | 1569 | 12.25 | 5.87 |
| 7.2 | 10 | 1314 | 10.26 | 7.07 |
| 12 | 6 | 1309 | 10.22 | 7.04 |
| 24 | 3 | 1306 | 10.2 | 7.05 |
Much faster. With DMA, the peak reading speed is 7.05 MHz. Still low if you’re looking for an industrial logic analyzer, but I think it is good enough for hobbyists.
The DMA cannot be pushed to read faster than 10 cycles/read. Comparing the results with the previous 2 benchmarks, removing the for-loop overhead brings a full megahertz of performance.
Note that actual sampling rate with DMA is a couple of kilohertz behind the timer frequency. 16-bit/32-bit DMA sizes did not matter here.
Now the limit of how fast we can read from GPIO on an STM32F103 has really been reached.
Benchmark 4: DMA with Interrupt on Completion
Although unlikely, I wondered if using an interrupt instead of polling CNDTR
would improve performance. There is atleast 10 cycles or more of latency
associated with interrupts.
Results
| Timer Frequency (MHz) | Timer Period | Total Cycles | Cycles/Read | Read Frequency (MHz) |
|---|---|---|---|---|
| 6 | 12 | 1596 | 12.46 | 5.77 |
| 7.2 | 10 | 1340 | 10.46 | 6.88 |
| 12 | 6 | 1336 | 10.43 | 6.9 |
| 24 | 3 | 1333 | 10.47 | 6.91 |
Each run took longer exactly as expected, incurring 25~30 extra CPU cycles.
Wrapping up
As a beginner to embedded microcontroller programming, it’s not easy to go from datasheet and reference manual stats (“This chip runs at 72MHZ!”, “The formula for DMA service time is Ts = Ta + Trd + Twr!”, …) to a definitive conclusion of how fast things can really go. So I encourage those like me who do not yet have practical field experience to carry out these little benchmarks and experiments to test your hypotheses.
👉 The code for these benchmarks can be found in this Codeberg repository.