blog.soch.cc

Nov 2025

Fastest GPIO Read Speed on STM32F103 (Tight Loops vs DMA)

When I was building Lookshi—a portable mini logic analyzer—I asked myself: How fast can you read GPIO on the STM32F1?

I needed to figure out the maximum GPIO read speed on the STM32F103. I ran four benchmarks and measured the number of CPU cycles needed to perform 128 consecutive GPIO reads. This small experiment taught me quite a bit about the limitations of the STM32F103 but also the ins-and-outs of DMA.

Setup

I used the ARM Cortex-M3 DWT (data watchpoint trigger) cycle counter to measure how long it took for each method to perform the benchmark. The benchmarks were compiled with arm-none-eabi-gcc with -Os. The STM32F103 was clocked at 72MHz.

CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

Benchmark 1: Tight-Loop GPIO Reads

To begin with, I wanted to know how fast it would be if I just read from GPIO in a simple for-loop.

volatile uint32_t dest[128] = {0};
uint32_t t0 = DWT->CYCCNT;
for(uint8_t i = 0; i < 128; ++i){
  dest[i] = GPIOA->IDR;
}
uint32_t t1 = DWT->CYCCNT;

Results

  • Total number of cycles: 1540
  • Cycles/read: 12.03
  • Effective read frequency: 5.98 Mhz

Reading GPIO in a tightloop allowed for reading speeds of just under 6 Mhz. That’s quite low for a logic analyzer.

Side note: It’s faster to use an unsigned 32-bit buffer than a 16-bit buffer because the processor performs an uxth instruction if we store IDR as a 16-bit value. This instruction throws away the top 16 bits.

Assembly

; setup
mov     r1, r4
ldr     r2, [r5, #4]
add.w   r4, r4, #1073741824
add.w   r4, r4, #67584
; read GPIOA->IDR
ldr     r0, [r4, #8]
; write buffer
str.w   r0, [r3, r1, lsl #2]
; loop exit condition check
adds    r1, #1
cmp     r1, #128
; exit loop
bne.n   8000320 <main+0x44>
; read DWT->CYCCNT
ldr     r3, [r5, #4]

Looking at the assembly, it’s actually quite a fast operation. 4 instructions to setup the benchmark (loop and t0 assignment), 2 to read from GPIO and write to buffer, 2 to check the loop exit condition, 1 to exit the loop and 1 to store DWT->CYCCNT to t1.

APB2 Peripheral Access Latency

There are only 4 instructions per read, yet it takes 12 cycles to complete. I learned that this is because APB2 peripheral access takes quite a few cycles. To read from GPIO, the CPU has to cross AHB into APB2 to access the peripheral. Considering this, we’re at the physical limit of the chip.

Benchmark 2: Tight Loop in RAM (RamFunc)

This is same loop, but placed in RAM to reduce possible flash wait states.

__attribute__ ((section(".RamFunc")))
static inline void read_gpioa(void){
  for(uint8_t i = 0; i < 128; ++i){
    dest[i] = GPIOA->IDR;
  }
}

Results

  • Total number of cycles: 1699
  • Cycles/read: 13.27
  • Effective read frequency: 5.42 Mhz

Surprisingly, this method took longer! Only 1.2 cycles more, but that’s half a megahertz in performance. The assembly was even 1 instruction less. I’m not sure why this is the case, so if anyone knows, let me know too.

Benchmark 3: Timer-Triggered DMA Sampling

The previous 2 methods tried to read GPIO as fast as possible but there were 2 obvious problems. The first problem is that there was always 2 instructions of for-loop overhead. The second problem is that the sampling rate cannot be controlled.

A solution to this is to use the DMA triggered by a timer to directly transfer GPIOA->IDR to memory.

uint16_t dest[128] = {0};
// enable peripheral clocks
RCC->AHBENR |= RCC_AHBENR_DMA1EN;
RCC->APB2ENR |= RCC_APB2ENR_IOPAEN;
RCC->APB2ENR |= RCC_APB2ENR_TIM1EN;
// setup timer to trigger a DMA request on an update event
TIM1->ARR = 9;
TIM1->DIER |= TIM_DIER_UDE;
// setup DMA
DMA1_Channel5->CPAR = (uint32_t)&GPIOA->IDR;
DMA1_Channel5->CMAR = (uint32_t)dest;
DMA1_Channel5->CCR |= DMA_CCR_MINC |
                      DMA_CCR_PSIZE_0 |
                      DMA_CCR_MSIZE_0;
DMA1_Channel5->CNDTR = 128;
DMA1_Channel5->CCR |= DMA_CCR_EN;
// benchmark
uint32_t t0 = DWT->CYCCNT;
TIM1->CR1 |= TIM_CR1_CEN;
while(DMA1_Channel5->CNDTR > 0){
}
uint32_t t1 = DWT->CYCCNT;

Results

With this method, there’s a new variable: the timer period. So this benchmark was ran with different timer periods.

Timer Frequency (MHz)Timer PeriodTotal CyclesCycles/ReadRead Frequency (MHz)
612156912.255.87
7.210131410.267.07
126130910.227.04
243130610.27.05

Much faster. With DMA, the peak reading speed is 7.05 MHz. Still low if you’re looking for an industrial logic analyzer, but I think it is good enough for hobbyists.

The DMA cannot be pushed to read faster than 10 cycles/read. Comparing the results with the previous 2 benchmarks, removing the for-loop overhead brings a full megahertz of performance.

Note that actual sampling rate with DMA is a couple of kilohertz behind the timer frequency. 16-bit/32-bit DMA sizes did not matter here.

Now the limit of how fast we can read from GPIO on an STM32F103 has really been reached.

Benchmark 4: DMA with Interrupt on Completion

Although unlikely, I wondered if using an interrupt instead of polling CNDTR would improve performance. There is atleast 10 cycles or more of latency associated with interrupts.

Results

Timer Frequency (MHz)Timer PeriodTotal CyclesCycles/ReadRead Frequency (MHz)
612159612.465.77
7.210134010.466.88
126133610.436.9
243133310.476.91

Each run took longer exactly as expected, incurring 25~30 extra CPU cycles.

Wrapping up

As a beginner to embedded microcontroller programming, it’s not easy to go from datasheet and reference manual stats (“This chip runs at 72MHZ!”, “The formula for DMA service time is Ts = Ta + Trd + Twr!”, …) to a definitive conclusion of how fast things can really go. So I encourage those like me who do not yet have practical field experience to carry out these little benchmarks and experiments to test your hypotheses.

👉 The code for these benchmarks can be found in this Codeberg repository.

Comments

Click to leave a comment on Mastodon. It'll show up here!