Home

Awesome

GPIO Speed using CPU and DMA on the Raspberry Pi

Experiments to measure speed of various ways to output data to GPIO. Also convenient code snippets to help get you started with GPIO.

I provide the code in gpio-dma-test.c to the public domain. If you use DMA you need the mailbox implementation; for that note the Broadcom copyright header with permissive license in mailbox.h.

You can compile this for Raspberry Pi 1..5 by passing the PI_VERSION variable when compiling

 PI_VERSION=1 make
 PI_VERSION=2 make  # works for Pi 2 and 3
 PI_VERSION=4 make  # works for Pi 4
 PI_VERSION=5 make  # preliminary Pi 5 testing (only example 1 works)

The resulting program gives you a set of 6 experiments to conduct. By default, it toggles GPIO 14 (which is pin 8 on the Raspberry Pi header).

Usage ./gpio-dma-test [1...6]
Give number of test operation as argument to ./gpio-dma-test
Test operation
== Baseline tests, using CPU directly ==
1 - CPU: Writing to GPIO directly in tight loop
2 - CPU: reading word from memory, write masked to GPIO set/clr.
3 - CPU: reading prepared set/clr from memory, write to GPIO.
4 - CPU: reading prepared set/clr from UNCACHED memory, write to GPIO.

== DMA tests, using DMA to pump data to ==
5 - DMA: Single control block per set/reset GPIO
6 - DMA: Sending a sequence of set/clear with one DMA control block and negative destination stride.

To understand the details, you want to read BCM2835 ARM Peripherals, an excellent dataheet to get started (if you are the datasheet-reading kinda person).

Measurements

In these experiments, we want to see how fast things can go, so we do a very simple operation in which we toggle the output of a pin (see TOGGLE_GPIO in the source). In real applications, the data would certainly be slightly more useful :)

The pictures in these experiments show the output wave-form on the given pin for the various Raspberry Pis. These are screen-shots straight from an oscilloscope, the time-base is with 100ns the same for all measurements to be able to compare them easily.

All measurements were done on unmodified Pis in their respective default clock speed with a default minimal Raspian operating-system.

TODO: For the Pi4, there are no images yet, just some preliminary measurements. Stay tuned.

Writing from CPU to GPIO

The most common way to get data out on the GPIO port is using the CPU to send the data. Let's do some measurements how the Pis perform here.

Direct Output Loop to GPIO

sudo ./gpio-dma-test 1

In this simplest way to control the output, we essentially just write to the GPIO set and clear register in a tight loop:

// Pseudocode
for (;;) {
    *gpio_set_register = (1<<TOGGLE_PIN);
    *gpio_clr_register = (1<<TOGGLE_PIN);
}
Result

The resulting output wave on the Raspberry Pi 1 of 22.7Mhz, the Raspberry Pi 2 reaches 41.7Mhz and the Raspberry Pi 3 65.8 Mhz.

Raspberry Pi 1Raspberry Pi 2Raspberry Pi 3Raspberry Pi 4
(about 131Mhz)

The limited resolution in the 100ns range of the scope did not read the frequency correctly for the Pi 3 (so it only shows 58.8Mhz above) but if we zoom in, we see the 65.8Mhz

Preliminary measurements with the Pi 5 shows that it is slower than even the Raspberry Pi 1: The Raspberry Pi 5 only reaches about 20Mhz. This is probably due to the GPIO not handled anymore by the processor directly, but by the external RP1 peripheral chip that is connected with a PCIe lane.

Reading Word from memory, write masked set/clr

sudo ./gpio-dma-test 2

The most common way you'd probably send data to GPIO: you have an array of 32 bit data representing the bits to be written to GPIO and a mask that defines which are the relevant bits in your application.

// Pseudocode
uint32_t data[256];         // Words to be written to GPIO

const uint32_t mask = ...;  // The GPIO pins used in the program.
const uint32_t *start = data;
const uint32_t *end = start + 256;
for (const uint32_t *it = start; it < end; ++it) {
    if (( *it & mask) != 0) *gpio_set_register =  *it & mask;
    if ((~*it & mask) != 0) *gpio_clr_register = ~*it & mask;
}
Result

Raspberry Pi 2 and Pi 3 are unimpressed and output in the same speed as writing directly, Raspberry Pi 1 takes a performance hit and drops to 14.7Mhz:

Raspberry Pi 1Raspberry Pi 2Raspberry Pi 3Raspberry Pi 4
(about 131Mhz)

Reading prepared set/clr from memory

sudo ./gpio-dma-test 3

This would be a bit more unusal way to prepare and write data: break out the set and clr bits beforehand and store in memory before writing them to GPIO. It uses twice as much memory per operation. It does help the Raspberry Pi 1 to be as fast as possible writing from memory though, while there is no additional advantage for the Raspberry Pi 2 or 3.

Primarily, this is a good preparation to understand the way we have to send data with DMA.

// Pseudocode
struct GPIOData {
   uint32_t set;
   uint32_t clr;
};
struct GPIOData data[256];  // Preprocessed set/clr to be written to GPIO

const struct GPIOData *start = data;
const struct GPIOData *end = start + 256;
for (const struct GPIOData *it = start; it < end; ++it) {
    *gpio_set_register = it->set;
    *gpio_clr_register = it->clr;
}
Result

The Raspberry Pi 2 and Pi 3 have the same high speed as in the previous examples, but Raspberry Pi 1 can digest the prepared data faster and gets up to 20.8Mhz out of this (compared to the 14.7Mhz we got with masked writing):

Raspberry Pi 1Raspberry Pi 2Raspberry Pi 3Raspberry Pi 4
(about 83Mhz)

Reading prepared set/clr from UNCACHED memory

sudo ./gpio-dma-test 4

This next example is not useful in real life, but it helps to better understand the performance impact of accessing memory that does not go through a cache (L1 or L2).

The DMA subsystem, which we are going to explore in the next examples, has to read from physical memory, as it cannot use the caches (or can it ? Somewhere I read that it can make at least use of L2 cache ?).

The example is the same as before: reading pre-processed set/clr values from memory and writing them to GPIO. Only the type of memory is different.

Result

The speed is significantly reduced - it is very slow to read from uncached memory (a testament of how fast CPUs are these days or slow DRAM actually is).

One interesting finding is, that the Raspberry Pi 2 and Pi 3 both are actually significantly slower than the Raspberry Pi 1. Maybe the makers were relying more on various caches and choose to equip the machine with slower memory to keep the price while increasing memory ? At least the Pi 3 is faster than the 2, so the relative order there is preserved.

Raspberry Pi 1Raspberry Pi 2Raspberry Pi 3Raspberry Pi 4
(about 2.7Mhz)

Using DMA to write to GPIO

Using the Direct Memory Access (DMA) subsystem allows to free the CPU and let independently running hardware do the job.

In various code that involve using DMA and GPIO on the Raspberry Pi, it is used in conjunction with the PWM or PCM hardware to create slower paced output with very reliable timing. Examples are PiBits by richardghirst or the icrobotics PiFM transmitter.

In our example, by contrast, we want to measure the raw speed that is possible using DMA (which is not very impressive as we'll see).

In order to use DMA, the DMA controller needs access to the actual memory bus address, as it can't deal with virtual memory (which means as well: it needs to be in physical memory and can't be swapped). There are various ways to allocate that memeory and do the mapping, but it looks like a reliable way for all PIs is to use the /dev/vcio interface provided by the Pi kernel; we are using a mailbox implementation provided in an raspberrypi/userland fft example. We have an abstraction around that called UncachedMemBlock in gpio-dma-test.c.

The DMA channel we are using in these examples is channel 5, as it is usually free, but you can configure that in the source. It can not be a Lite channel, as we need DMA 2D features for both examples.

DMA: using one Control Block per GPIO operation

sudo ./gpio-dma-test 5

With DMA, we can't do any data manipulation operations (such as masking) at the time the data is written, so just like in the last CPU example, we have to prepare the source data as separate set and clear operations:

struct GPIOData {
   uint32_t set;
   uint32_t clr;
};

The output needs to be written to the GPIO registers. Unfortunately, we can't just do a plain 1:1 memory copy from the source data to the destination registers, as the layout is a bit different than our input data: The set and clr register are a few bytes apart, so there is a gap between the two write operations:

Addr-offsetGPIO RegisterwidthOperation
0x1cset (lower-32 bits)32 bits = 4 bytes<-write here
0x20unused upper bits32 bits = 4 bytesskip
0x24(reserved)32 bits = 4 bytesskip
0x28clr (lower-32 bits)32 bits = 4 bytes<-.. and here

So what we are doing is to set up a DMA Control Block with a 2D operation:

We set up the control block's 'next' pointer to point to itself, so the DMA controller goes in an endless loop. No CPU needed, yay :)

(If you are wondering what "DMA 2D" operations means, this is not the place to explain the details, but there is an old embedded article to get you started on this standard feature of many DMA controllers. Please also look at the code which contains some documentation.)

One thing to note is that we only can set up a single output operation in this matter. Once we have written to the registers, the destination pointer is at the end of the relevant register block and the only way to go back to the beginning is to start with a fresh control block that has the starting address set to the beginning of the register block. We'll address that in the next example.

This is incredibly inefficent in use of memory, in particular if you need to send more than just a few blocks. We need 40 Bytes per output operation (8 Bytes GPIO data + 32 Bytes Control block). Note, all this memory needs to be locked into RAM.

Result

First thing we notice is how slow things are in comparison to the write from the CPU. As found out in the uncached CPU example, we see the influence of slower memory in the Raspberry Pi 2 and Pi 3 here as well. The latter two show exactly the same speed.

The live scope shows that the output has quite a bit of jitter, so DMA alone will not give you very reliable timing, you always have to combine that with PWM/PCM gating if you need this in a realtime context.

Another interesting finding is the asymmetry between the set/clr time. It looks like it takes about 100ns after the set operation until the clear operation arrives - but that then is lasting much longer. This is probably due some extra time needed when switching between control blocks (even though the 'next' control block is exactly the same):

Raspberry Pi 1Raspberry Pi 2Raspberry Pi 3Raspberry Pi 4
(about 1.82Mhz)

DMA: multiple GPIO operations per Control Block

sudo ./gpio-dma-test 6

One of the obvious down-sides of the previous example is, that we have to set up one DMA control block for each write operation which is a lot of memory overhead. Does it also mean a bad performance impact ?

If we set up the input data in a way that it has the same layout as the output registers, we could use the stride operation on the destination side to go back to the beginning after each write and so can do many write operations with a single control block setup.

// Input data has same layout as the output registers.
struct GPIOData {
    uint32_t set;
    uint32_t ignored_upper_set_bits; // bits 33..54 of GPIO. Not needed.
    uint32_t reserved_area;          // gap between GPIO registers.
    uint32_t clr;
};

Of course, this means that we are writing to 'reserved' places in the GPIO registers. The first 4 bytes after the 'set' register are benign, as these are essentially upper bits of GPIO bits - if we write 0 to these it should be fine. Slightly problematic could be the next block of 4 bytes, as it is 'reserved' according to the data sheet. We are writing zeros in here and hope for the best that this is not doing any harm (it does seem to be fine :) ).

So in this example the 2D DMA operation is set up the following way:

Now, we only need one control-block per n operations, but each operation takes 16 bytes to store in memory. Amortized still better than in the previous case.

As usual, if we want to do that endlessly, we can link that control block back to itself.

Result

Again, the Raspberry Pi 2 is slightly slower than the Raspberry Pi 1. In general, this method is even slower than one control block per data item. And again, Raspberry Pi 3 shows the same speed as Raspberry Pi 2.

Similar to the previous example, the output has quite some jitter.

It is interesting, that the positive pulse is shorter (about 50ns) than in the previous example. It suggests that writing the data in sequence with 8 dead bytes is faster than the 8 byte stride skip that we had in the previous example. Also it means that the DMA probably has a small cache for the 16 byte block, as it emits that part faster than it can read from the uncached memory.

Now the 'low' part of the pulse is even longer than before, apparently the minus 16 Byte stride takes its sweet time even though we don't switch between control blocks:

Raspberry Pi 1Raspberry Pi 2Raspberry Pi 3Raspberry Pi 4
(about 1.54Mhz)

Conclusions