Fastest GPIO Read Speed on STM32F103 (Tight Loops vs DMA)

When I was building Lookshi—a portable mini logic analyzer—I asked myself: How fast can you read GPIO on the STM32F1?

I needed to figure out the maximum GPIO read speed on the STM32F103. I ran four benchmarks and measured the number of CPU cycles needed to perform 128 consecutive GPIO reads. This small experiment taught me quite a bit about the limitations of the STM32F103 but also the ins-and-outs of DMA.

Setup

I used the ARM Cortex-M3 DWT (data watchpoint trigger) cycle counter to measure how long it took for each method to perform the benchmark. The benchmarks were compiled with arm-none-eabi-gcc with -Os. The STM32F103 was clocked at 72MHz.

CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

Benchmark 1: Tight-Loop GPIO Reads

To begin with, I wanted to know how fast it would be if I just read from GPIO in a simple for-loop.

volatile uint32_t dest[128] = {0};
uint32_t t0 = DWT->CYCCNT;
for(uint8_t i = 0; i < 128; ++i){
  dest[i] = GPIOA->IDR;
}
uint32_t t1 = DWT->CYCCNT;

Results

Total number of cycles: 1540
Cycles/read: 12.03
Effective read frequency: 5.98 Mhz

Reading GPIO in a tightloop allowed for reading speeds of just under 6 Mhz. That’s quite low for a logic analyzer.

Side note: It’s faster to use an unsigned 32-bit buffer than a 16-bit buffer because the processor performs an uxth instruction if we store IDR as a 16-bit value. This instruction throws away the top 16 bits.

Assembly

; setup
mov     r1, r4
ldr     r2, [r5, #4]
add.w   r4, r4, #1073741824
add.w   r4, r4, #67584
; read GPIOA->IDR
ldr     r0, [r4, #8]
; write buffer
str.w   r0, [r3, r1, lsl #2]
; loop exit condition check
adds    r1, #1
cmp     r1, #128
; exit loop
bne.n   8000320 <main+0x44>
; read DWT->CYCCNT
ldr     r3, [r5, #4]

Looking at the assembly, it’s actually quite a fast operation. 4 instructions to setup the benchmark (loop and t0 assignment), 2 to read from GPIO and write to buffer, 2 to check the loop exit condition, 1 to exit the loop and 1 to store DWT->CYCCNT to t1.

APB2 Peripheral Access Latency

There are only 4 instructions per read, yet it takes 12 cycles to complete. I learned that this is because APB2 peripheral access takes quite a few cycles. To read from GPIO, the CPU has to cross AHB into APB2 to access the peripheral. Considering this, we’re at the physical limit of the chip.

Benchmark 2: Tight Loop in RAM (RamFunc)

This is same loop, but placed in RAM to reduce possible flash wait states.

__attribute__ ((section(".RamFunc")))
static inline void read_gpioa(void){
  for(uint8_t i = 0; i < 128; ++i){
    dest[i] = GPIOA->IDR;
  }
}

Results

Total number of cycles: 1699
Cycles/read: 13.27
Effective read frequency: 5.42 Mhz

Surprisingly, this method took longer! Only 1.2 cycles more, but that’s half a megahertz in performance. The assembly was even 1 instruction less. I’m not sure why this is the case, so if anyone knows, let me know too.

Benchmark 3: Timer-Triggered DMA Sampling

The previous 2 methods tried to read GPIO as fast as possible but there were 2 obvious problems. The first problem is that there was always 2 instructions of for-loop overhead. The second problem is that the sampling rate cannot be controlled.

A solution to this is to use the DMA triggered by a timer to directly transfer GPIOA->IDR to memory.

uint16_t dest[128] = {0};
// enable peripheral clocks
RCC->AHBENR |= RCC_AHBENR_DMA1EN;
RCC->APB2ENR |= RCC_APB2ENR_IOPAEN;
RCC->APB2ENR |= RCC_APB2ENR_TIM1EN;
// setup timer to trigger a DMA request on an update event
TIM1->ARR = 9;
TIM1->DIER |= TIM_DIER_UDE;
// setup DMA
DMA1_Channel5->CPAR = (uint32_t)&GPIOA->IDR;
DMA1_Channel5->CMAR = (uint32_t)dest;
DMA1_Channel5->CCR |= DMA_CCR_MINC |
                      DMA_CCR_PSIZE_0 |
                      DMA_CCR_MSIZE_0;
DMA1_Channel5->CNDTR = 128;
DMA1_Channel5->CCR |= DMA_CCR_EN;
// benchmark
uint32_t t0 = DWT->CYCCNT;
TIM1->CR1 |= TIM_CR1_CEN;
while(DMA1_Channel5->CNDTR > 0){
}
uint32_t t1 = DWT->CYCCNT;

Results

With this method, there’s a new variable: the timer period. So this benchmark was ran with different timer periods.

Timer Frequency (MHz)	Timer Period	Total Cycles	Cycles/Read	Read Frequency (MHz)
6	12	1569	12.25	5.87
7.2	10	1314	10.26	7.07
12	6	1309	10.22	7.04
24	3	1306	10.2	7.05

Much faster. With DMA, the peak reading speed is 7.05 MHz. Still low if you’re looking for an industrial logic analyzer, but I think it is good enough for hobbyists.

The DMA cannot be pushed to read faster than 10 cycles/read. Comparing the results with the previous 2 benchmarks, removing the for-loop overhead brings a full megahertz of performance.

Note that actual sampling rate with DMA is a couple of kilohertz behind the timer frequency. 16-bit/32-bit DMA sizes did not matter here.

Now the limit of how fast we can read from GPIO on an STM32F103 has really been reached.

Benchmark 4: DMA with Interrupt on Completion

Although unlikely, I wondered if using an interrupt instead of polling CNDTR would improve performance. There is atleast 10 cycles or more of latency associated with interrupts.

Results

Timer Frequency (MHz)	Timer Period	Total Cycles	Cycles/Read	Read Frequency (MHz)
6	12	1596	12.46	5.77
7.2	10	1340	10.46	6.88
12	6	1336	10.43	6.9
24	3	1333	10.47	6.91

Each run took longer exactly as expected, incurring 25~30 extra CPU cycles.

Wrapping up

As a beginner to embedded microcontroller programming, it’s not easy to go from datasheet and reference manual stats (“This chip runs at 72MHZ!”, “The formula for DMA service time is Ts = Ta + Trd + Twr!”, …) to a definitive conclusion of how fast things can really go. So I encourage those like me who do not yet have practical field experience to carry out these little benchmarks and experiments to test your hypotheses.

👉 The code for these benchmarks can be found in this Codeberg repository.

blog.soch.cc

Fastest GPIO Read Speed on STM32F103 (Tight Loops vs DMA)

Setup

Benchmark 1: Tight-Loop GPIO Reads

Results

Assembly

APB2 Peripheral Access Latency

Benchmark 2: Tight Loop in RAM (RamFunc)

Results

Benchmark 3: Timer-Triggered DMA Sampling

Results

Benchmark 4: DMA with Interrupt on Completion

Results

Wrapping up

Comments