Fast data capture with the Raspberry Pi

Video signal captured at 2.6 megasamples per second

Adding an Analog-to-Digital Converter (ADC) to the Raspberry Pi isn’t difficult, and there is ample support for reading a single voltage value, but what about getting a block of samples, in order to generate an oscilloscope-like trace, as shown above?

By careful manipulation of the Linux environment, it is possible to read the voltage samples in at a decent rate, but the big problem is the timing of the readings; the CPU is frequently distracted by other high-priority tasks, so there is a lot of jitter in the timing, which makes the analysis & display of the waveforms a lot more difficult – even a fast board such as the RPi 4 can suffer from this problem.

We need a way of grabbing the data samples at regular intervals without any CPU intervention; that means using Direct Memory Access, which operates completely independently of the processor, so even the cheapest Pi Zero board delivers rock-solid sample timing.

Direct Memory Access

Direct Memory Access (DMA) can be set up to transfer data between memory and peripherals, without any CPU intervention. It is a very powerful technique, and as a result, can easily cause havoc if programmed incorrectly. I strongly recommend you read my previous post on the subject, which includes some simple demonstrations of DMA in action, but here is a simplified summary:

  1. The CPU has three memory spaces: virtual, bus and physical. DMA accesses use bus memory addresses, but a user program employs virtual addresses, so it is necessary to translate between the two.
  2. When writing to memory, the CPU is actually writing to an on-chip cache, and sometime later the data is written to main memory. If the DMA controller tries to fetch the data before the cache has been emptied, it will get incorrect values. So it is necessary for all DMA data to be in uncached memory.
  3. If compiler optimisation is enabled, it can bypass some memory read operations, giving a false picture of what is actually in memory. The qualifier ‘volatile’ might be needed to make sure that variables changed by DMA are correctly read by the processor.
  4. The DMA controller receives its next instruction via a Control Block (CB) which specifies the source & destination addresses, and the number of bytes to be transferred. Control Blocks can be chained, so as to create a sequence of actions.
  5. DMA transactions are normally triggered by a data request from a peripheral, otherwise they run through at full speed without stopping.
  6. If the DMA controller receives incorrect data, it can overwrite any area of memory, or any peripheral, without warning. This can cause unusual malfunctions, system crashes or file corruption, so care is needed.

For this project, I’ve abstracted the DMA and I/O functions into the new files rpi_dma_utils.c and rpi_dma_utils.h. The handling of the memory spaces has also been improved, with a single structure for each peripheral or memory area:

// Structure for mapped peripheral or memory
typedef struct {
    int fd,         // File descriptor
        h,          // Memory handle
        size;       // Memory size
    void *bus,      // Bus address
        *virt,      // Virtual address
        *phys;      // Physical address
} MEM_MAP;

To access a peripheral, the structure is initialised with the physical address:

#define SPI0_BASE       (PHYS_REG_BASE + 0x204000)

// Use mmap to obtain virtual address, given physical
void *map_periph(MEM_MAP *mp, void *phys, int size)
{
    mp->phys = phys;
    mp->size = PAGE_ROUNDUP(size);
    mp->bus = phys - PHYS_REG_BASE + BUS_REG_BASE;
    mp->virt = map_segment(phys, mp->size);
    return(mp->virt);
}

MEM_MAP spi_regs;
map_periph(&spi_regs, (void *)SPI0_BASE, PAGE_SIZE);

Then a macro is used to access a specific register:

#define REG32(m, x) ((volatile uint32_t *)((uint32_t)(m.virt)+(uint32_t)(x)))
#define SPI_DLEN        0x0c

*REG32(spi_regs, SPI_DLEN) = 0;

The advantage of this approach is that it is easy to set or clear individual bits within a register, e.g.

*REG32(spi_regs, SPI_CS) |= 1;

Note that the REG32 macro uses the ‘volatile’ qualifier to ensure that the register access will still be executed if compiler optimisation is enabled.

Analog-to-Digital Converters (ADCs)

There are 3 ways an ADC can be linked to the Raspberry Pi (RPi):

  1. Inter-Integrated Circuit (I2C) serial bus
  2. Serial Peripheral Interface (SPI) serial bus
  3. Parallel bus

The I2C interface is the simplest from a hardware point of view, since it only has 2 connections: clock and data. However, these devices tend to be a bit slow, and the RPi I2C interface doesn’t support DMA, so we won’t be using this method.

The parallel interface is the most complicated, as it has 1 wire for each data bit, plus one or more clock lines: I’ll be looking at this in a future blog post.

This leaves the SPI interface, which is a good compromise between complexity and speed; it has only 4 connections (clock, data out, data in and chip select) but is capable of achieving over 1 megasample per second.

In this post we’ll be using 2 SPI ADC chips; the Microchip MCP3008 which is specified as 100 Ksamples/sec maximum (though I’ve only achieved 80 KS/s, for reasons I’ll discuss later), and the Texas Instruments ADS7884 which can theoretically achieve 3 Msample/s; I’ve run that at 2.6 MS/s. Both chips are 10-bit, so return a value of 0 to 1023, when measuring 0 to 3.3 volts.

MCP3008

The RasPiO Analog Zero board ( https://rasp.io/analogzero/ ) has the Microchip MCP3008 ADC on it, and very little else.

It is in the same form-factor as the RPi Zero, but I used a version 3 CPU board for most of my testing. There are 8 analogue input channels, but only a single ADC, that has to be switched to the appropriate channel prior to conversion. The voltage reference is taken from the RPi 3.3 volt rail; if you need greater stability & accuracy, a standalone voltage reference can be used instead.

SPI interface

The board is tied to the SPI0 interface on the RPi, using 4 connections

  • GPIO8 CE0: SPI 0 Chip Enable 0
  • GPIO11 SCLK: Clock signal
  • GPIO10 MOSI: data output to ADC
  • GPIO9 MISO: data input from ADC

The Chip Enable (or Chip Select as it is often known) is used to frame the overall transfer; it is normally high, then is set low to start the analog-to-digital conversion, and is held low while the data is transferred to & from the device.

Getting a single sample from the ADC is really easy in Python:

from gpiozero import MCP3008
adc = MCP3008(channel=0)
print(adc.value * 3.3)
adc.close()

We’ll be diving a bit deeper into the way the SPI interface works, so here is the same operation in Python, but direct-driving the SPI interface:

import spidev
spi = spidev.SpiDev()
spi.open(0, 0)
spi.max_speed_hz = 500000
spi.mode = 0
msg = [0x01,0x80,0x00]
rsp = spi.xfer2(msg)
val = ((rsp[1]*256 + rsp[2]) & 0x3ff) * 3.3 / 1.024
print(val)

The most useful diagnostic method is to view the signals on an oscilloscope, so here are the corresponding traces; the scale is 20 microseconds per division (per square) horizontally, and 5 volts per division vertically:

RPi SPI access of an MCP3008 ADC

You can see the Chip Select frames the transaction, but remains active (low) for about 120 microseconds after the transfer is finished; that is something we’ll need to improve to get better speeds. The clock is 50 kHz as specified in the code, but this can be up to 2 MHz. The MOSI (CPU output) data is as specified in the data sheet, a value of 01 80 hex has a ‘1’ start bit, followed by another ‘1’ to select single-ended mode (not differential). MISO (CPU input) data reflects the voltage value measured by the ADC. The data is always sent most-significant-bit first, and the first return byte is ignored (since the ADC hadn’t started the conversion), so the second byte has to be multiplied by 256, and added to the third byte.

You’ll see there is a downward curve at the end of the MISO trace; this shows that the line isn’t being driven high or low, and is floating. It is worth watching out for signals like this, since they can cause problems as they drift between 1 and 0; in this case the transition is harmless as the transfer cycle has already finished.

MCP3008 software

Here is the C equivalent of the Python code:

// Set / clear SPI chip select
void spi_cs(int set)
{
    uint32_t csval = *REG32(spi_regs, SPI_CS);

    *REG32(spi_regs, SPI_CS) = set ? csval | 0x80 : csval & ~0x80;
}

// Transfer SPI bytes
void spi_xfer(uint8_t *txd, uint8_t *rxd, int len)
{
    while (len--)
    {
        *REG8(spi_regs, SPI_FIFO) = *txd++;
        while((*REG32(spi_regs, SPI_CS) & (1<<17)) == 0) ;
        *rxd++ = *REG32(spi_regs, SPI_FIFO);
    }
}

// Fetch single 10-bit sample from ADC
int adc_get_sample(int chan)
{

    uint8_t txdata[3]={0x01,0x80|(chan<<4),0}, rxdata[3];
    
    spi_cs(1);
    spi_xfer(txdata, rxdata, sizeof(txdata));
    spi_cs(0);
    return(((rxdata[1]<<8) | rxdata[2]) & 0x3ff);
}

This takes 3 bytes to transfer 10 data bits, which is a bit wasteful. It is worth reading the MCP3008 data sheet, which explains that the leading ‘1’ of the outgoing data is used to trigger the conversion, so the whole cycle can be compressed into 16 bits, if you ignore the last data bit:

// Fetch 9-bit sample from ADC
int adc_get_sample(int chan)
{
    uint8_t txdata[2]={0xc0|(chan<<3),0}, rxdata[2];
    
    spi_cs(1);
    spi_xfer(txdata, rxdata, sizeof(txdata));
    spi_cs(0);
    return((((int)rxdata[0] << 9) | ((int)rxdata[1] << 1)) & 0x3ff);
}

You’ll see that the transmit bytes 0x01,0x80 have been shifted left by 7 bits to make one byte 0xc0, and this results in the response data being shifted left by the same amount.

A single transfer can easily be done using DMA, since the SPI controller has an auto-chip-select mode that handles the CE signal for us:

// Fetch single sample from MCP3008 ADC using DMA
int adc_dma_sample_mcp3008(MEM_MAP *mp, int chan)
{
    DMA_CB *cbs=mp->virt;
    uint32_t dlen, *txd=(uint32_t *)(cbs+2);
    uint8_t *rxdata=(uint8_t *)(txd+0x10);

    enable_dma(DMA_CHAN_A);
    enable_dma(DMA_CHAN_B);
    dlen = 4;
    txd[0] = (dlen << 16) | SPI_TFR_ACT;
    mcp3008_tx_data(&txd[1], chan);
    cbs[0].ti = DMA_SRCE_DREQ | (DMA_SPI_RX_DREQ << 16) | DMA_WAIT_RESP | DMA_CB_DEST_INC;
    cbs[0].tfr_len = dlen;
    cbs[0].srce_ad = REG_BUS_ADDR(spi_regs, SPI_FIFO);
    cbs[0].dest_ad = MEM_BUS_ADDR(mp, rxdata);
    cbs[1].ti = DMA_DEST_DREQ | (DMA_SPI_TX_DREQ << 16) | DMA_WAIT_RESP | DMA_CB_SRCE_INC;
    cbs[1].tfr_len = dlen + 4;
    cbs[1].srce_ad = MEM_BUS_ADDR(mp, txd);
    cbs[1].dest_ad = REG_BUS_ADDR(spi_regs, SPI_FIFO);
    *REG32(spi_regs, SPI_DC) = (8 << 24) | (4 << 16) | (8 << 8) | 4;
    *REG32(spi_regs, SPI_CS) = SPI_TFR_ACT | SPI_DMA_EN | SPI_AUTO_CS | SPI_FIFO_CLR | SPI_CSVAL;
    *REG32(spi_regs, SPI_DLEN) = 0;
    start_dma(mp, DMA_CHAN_A, &cbs[0], 0);
    start_dma(mp, DMA_CHAN_B, &cbs[1], 0);
    dma_wait(DMA_CHAN_A);
    return(mcp3008_rx_value(rxdata));
}

// Return Tx data for MCP3008
int mcp3008_tx_data(void *buff, int chan)
{
    uint8_t txd[3]={0x01, 0x80|(chan<<4), 0x00};
    memcpy(buff, txd, sizeof(txd));
    return(sizeof(txd));
}

// Return value from ADC Rx data
int mcp3008_rx_value(void *buff)
{
    uint8_t *rxd=buff;
    return(((int)(rxd[1]&3)<<8) | rxd[2]);
}

When testing new DMA code, it is not unusual for there to be an error such that the DMA cycle never completes, so the dma_wait function has a timeout:

// Wait until DMA is complete
void dma_wait(int chan)
{
    int n = 1000;

    do {
        usleep(100);
    } while (dma_transfer_len(chan) && --n);
    if (n == 0)
        printf("DMA transfer timeout\n");
}

So we have code to do a single transfer, can’t we use the same idea to grab multiple samples in one transfer? The problem is the CS line; this has to be toggled for each value, and the auto-chip-select mode only works for a single transfer; despite a lot of experimentation, I couldn’t find any way of getting the SPI controller to pulse CS low for each ADC cycle in a multi-cycle capture.

The solution to this problem comes in treating the transmit and receive DMA operations very differently. The receive operation simply keeps copying the 32-bit data from the SPI FIFO into memory, until all the required data has been captured. In contrast, the transmit side is repeatedly sending the same trigger message to the ADC (0x01, 0x80, 0x00 in the above example). Since the same message is repeating, we could set up a small sequence of DMA Control Blocks (CBs):

CB1: set chip select high
CB2: set chip select low
CB3: write next 32-bit word to the FIFO

The controller is normally executing CB3, waiting for the next SPI data request. When this arrives, it executes CB1 then CB2, briefly setting the chip select high & low to start a new data capture. It then stops in CB3 again, waiting for the next data request. Using this method, the typical width of the CS high pulse is 330 nanoseconds, which is more than adequate to trigger the ADC.

The bulk of code is the same as the previous example, here are the control block definitions:

    // Control block 0: read data from SPI FIFO
    cbs[0].ti = DMA_SRCE_DREQ | (DMA_SPI_RX_DREQ << 16) | DMA_WAIT_RESP | DMA_CB_DEST_INC;
    cbs[0].tfr_len = dlen;
    cbs[0].srce_ad = REG_BUS_ADDR(spi_regs, SPI_FIFO);
    cbs[0].dest_ad = MEM_BUS_ADDR(mp, rxdata);
    // Control block 1: CS high
    cbs[1].srce_ad = cbs[2].srce_ad = MEM_BUS_ADDR(mp, pindata);
    cbs[1].dest_ad = REG_BUS_ADDR(gpio_regs, GPIO_SET0);
    cbs[1].tfr_len = cbs[2].tfr_len = cbs[3].tfr_len = 4;
    cbs[1].ti = cbs[2].ti = DMA_DEST_DREQ | (DMA_SPI_TX_DREQ << 16) | DMA_WAIT_RESP | DMA_CB_SRCE_INC;
    // Control block 2: CS low
    cbs[2].dest_ad = REG_BUS_ADDR(gpio_regs, GPIO_CLR0);
    // Control block 3: write data to Tx FIFO
    cbs[3].ti = DMA_DEST_DREQ | (DMA_SPI_TX_DREQ << 16) | DMA_WAIT_RESP | DMA_CB_SRCE_INC;
    cbs[3].srce_ad = MEM_BUS_ADDR(mp, &txd[1]);
    cbs[3].dest_ad = REG_BUS_ADDR(spi_regs, SPI_FIFO);
    // Link CB1, CB2 and CB3 in endless loop
    cbs[1].next_cb = MEM_BUS_ADDR(mp, &cbs[2]);
    cbs[2].next_cb = MEM_BUS_ADDR(mp, &cbs[3]);
    cbs[3].next_cb = MEM_BUS_ADDR(mp, &cbs[1]);

A disadvantage of this approach is that we’re transferring 32 bits in order to get 10 bits of ADC data, which is quite wasteful; if the DMA controller could be persuaded to transfer 16 bits at a time, we’d be able to double the speed, but all my attempts to do this have failed.

However, on the positive side, it does produce an accurately-timed data capture with no CPU intervention:

Raspberry Pi MCP3008 ADC input using DMA

The oscilloscope trace just shows 4 transfers, but the technique works just as well with larger data blocks; here is a trace of 500 samples at 80 Ksample/s

To be honest, the ADC was overclocked to achieve this sample rate; the data sheet implies that the maximum SPI clock should be around 2 MHz with a 3.3V supply voltage, and the actual value I’ve used is 2.55 MHz, so don’t be surprised if this doesn’t work reliably in a different setup.

ADS7884

In the title of this blog post I promised ‘fast’ data capture, and I don’t think 80 Ksample/s really qualifies as fast; the generally accepted definition is at least 10 Msample/s, but that would require an SPI clock over 100MHz, which is quite unrealistic.

The ADS7884 is a fast single-channel SPI ADC; it can acquire 3 Msample/s, with an SPI clock of 48 MHz, but you do have to be quite careful when dealing with signals this fast; a small amount of stray capacitance or inductance can easily distort the signals so that the transfers are unreliable. All connections must be kept short, especially the clock, power and ground, which ideally should be less than 50 mm (2 inches) long.

The ADC chip is in a very small 6-pin package (0.95 mm pin spacing) so I soldered it to a Dual-In-Line (DIL) adaptor, with 1 uF and 10 nF decoupling capacitors as close to the power & ground pins as possible. This arrangement is then mounted on a solder prototyping board (not stripboard) with very short wires soldered to the RPi I/O connector.

ADS7884 on a prototyping board

You may think that the ADC should still work correctly in a poor layout, if the clock frequency is reduced. This may not be true as, generally speaking, the faster the device, the more sensitive it is to the quality of the external signals. If they aren’t clean enough, the ADC will still malfunction, no matter how slow the clock is.

The device pins are:

1  Supply (3.3V)
2  Ground
3  VIN (voltage to be measured)
4  SCLK (SPI clock)
5  SDO (SPI data output)
6  CS (chip select, active low)

You’ll see that there is no data input line; this is because, unlike the MCP3008, there is nothing to control; just set CS low, toggle the clock 16 times, then set CS high, and you’ll have the data.

This can be demonstrated by a Python program:

import spidev
bus, device = 0, 0
spi = spidev.SpiDev()
spi.open(bus, device)
spi.max_speed_hz = 1000000
spi.mode = 0
msg = [0x00,0x00]
spi.xfer2(msg)
res = spi.xfer2(msg)
val = (res[0] * 256 + res[1]) >> 6
print("%1.3f" % val * 3.3 / 1024.0)

You’ll see that I’ve discarded the first sample from the ADC; that is because it always returns the data from the previous sample, i.e. it outputs the last sample while obtaining the next.

When creating the DMA software, it is tempting to use the same technique I employed on the MCP3008, but I want really fast sampling, and using a 32-bit word to carry 10 bits of data seems much too wasteful.

Since the SPI transmit line is unused (as the ADS7884 doesn’t have a data input) we can use it for another purpose, so why not use it to drive the chip select line? This means we can drive CS high or low whenever we want, just by setting the transmit data.

So the connections between the ADC and RPi are:

Pin 1: 3.3V supply 
Pin 2: ground 
Pin 3: voltage to be measured
Pin 4: SPI0 clock, GPIO11
Pin 5: SPI0 MISO,  GPIO9
Pin 6: SPI0 MOSI,  GPIO10 (ADC chip select)

If you are driving other SPI devices, the absence of a proper chip select could be a major problem. The solution would be to invert the transmitted data, add a NAND gate between the MOSI line and the ADC chip select, and drive the other NAND input with a spare I/O line, to enable (when high) or disable (when low) the ADC transfers. You’d just need to keep an eye on the additional delay in the CS line, which could alter the phase shift between the transmitted and received data.

ADS7884 software

Driving the chip-select line from the SPI data output makes the software quite a bit simpler, just repeat the same 16-bit pattern on the transmit side, and save the received data in a buffer. This is the code:

// Fetch samples from ADS7884 ADC using DMA
int adc_dma_samples_ads7884(MEM_MAP *mp, int chan, uint16_t *buff, int nsamp)
{
    DMA_CB *cbs=mp->virt;
    uint32_t i, dlen, shift, *txd=(uint32_t *)(cbs+3);
    uint8_t *rxdata=(uint8_t *)(txd+0x10);

    enable_dma(DMA_CHAN_A); // Enable DMA channels
    enable_dma(DMA_CHAN_B);
    dlen = (nsamp+3) * 2;   // 2 bytes/sample, plus 3 dummy samples
    // Control block 0: store Rx data in buffer
    cbs[0].ti = DMA_SRCE_DREQ | (DMA_SPI_RX_DREQ << 16) | DMA_WAIT_RESP | DMA_CB_DEST_INC;
    cbs[0].tfr_len = dlen;
    cbs[0].srce_ad = REG_BUS_ADDR(spi_regs, SPI_FIFO);
    cbs[0].dest_ad = MEM_BUS_ADDR(mp, rxdata);
    // Control block 1: continuously repeat last Tx word (pulse CS low)
    cbs[1].srce_ad = MEM_BUS_ADDR(mp, &txd[2]);
    cbs[1].dest_ad = REG_BUS_ADDR(spi_regs, SPI_FIFO);
    cbs[1].tfr_len = 4;
    cbs[1].ti = DMA_DEST_DREQ | (DMA_SPI_TX_DREQ << 16) | DMA_WAIT_RESP | DMA_CB_SRCE_INC;
    cbs[1].next_cb = MEM_BUS_ADDR(mp, &cbs[1]);
    // Control block 2: send first 2 Tx words, then switch to CB1 for the rest
    cbs[2].srce_ad = MEM_BUS_ADDR(mp, &txd[0]);
    cbs[2].dest_ad = REG_BUS_ADDR(spi_regs, SPI_FIFO);
    cbs[2].tfr_len = 8;
    cbs[2].ti = DMA_DEST_DREQ | (DMA_SPI_TX_DREQ << 16) | DMA_WAIT_RESP | DMA_CB_SRCE_INC;
    cbs[2].next_cb = MEM_BUS_ADDR(mp, &cbs[1]);
    // DMA request every 4 bytes, panic if 8 bytes
    *REG32(spi_regs, SPI_DC) = (8 << 24) | (4 << 16) | (8 << 8) | 4;
    // Clear SPI length register and Tx & Rx FIFOs, enable DMA
    *REG32(spi_regs, SPI_DLEN) = 0;
    *REG32(spi_regs, SPI_CS) = SPI_TFR_ACT | SPI_DMA_EN | SPI_AUTO_CS | SPI_FIFO_CLR | SPI_CSVAL;
    // Data to be transmited: 32-bit words, MS bit of LS byte is sent first
    txd[0] = (dlen << 16) | SPI_TFR_ACT;// SPI config: data len & TI setting
    txd[1] = 0xffffffff;                // Set CS high
    txd[2] = 0x01000100;                // Pulse CS low
    // Enable DMA, wait until complete
    start_dma(mp, DMA_CHAN_A, &cbs[0], 0);
    start_dma(mp, DMA_CHAN_B, &cbs[2], 0);
    dma_wait(DMA_CHAN_A);
    // Check whether Rx data has 1 bit delay with respect to Tx
    shift = rxdata[4] & 0x80 ? 3 : 4;
    // Convert raw data to 16-bit unsigned values, ignoring first 3
    for (i=0; i<nsamp; i++)
        buff[i] = ((rxdata[i*2+6]<<8 | rxdata[i*2+7]) >> shift) & 0x3ff;
    return(nsamp);
}

There are a few points that need clarification:

  1. When using DMA, the first word sent to the SPI controller isn’t the data to be transmitted; it is a configuration word that sets the SPI data length, and other parameters. In the MCP3008 implementation I sent it by direct-writing to the FIFO before DMA starts, but at high speed this can cause occasional glitches. So I send the initial SPI configuration using DMA Control Block 2; once that is sent, CB1 performs the main data output.
  2. The phase relationship between the outgoing (chip-select) data and the incoming (ADC value) data isn’t immediately obvious, and as the sampling rate gets faster, this phase relationship changes by 1 bit. To detect this, I first send an all-ones word to keep CS high, then set it low, and check which bit goes low in the received data. This is also done in control block 2, and when that is complete, control block 1 takes over for the remaining transmissions.
  3. The data decoder shifts the raw data depending on the detected phase value, then saves it as 16-bit values in the output array (which has been created in virtual memory using a conventional memory allocation call).
  4. The ADC always returns the result of the previous conversion, so the first sample has to be discarded. Also, the chip select (SPI output) defaults to being low, so the first conversion is usually spurious, and the phase-detection method mentioned above also results in incorrect data. So it is necessary to discard the first 3 samples.

Here is an oscilloscope trace when running at 2.6 megasample/s:

Running the code

The software is in 3 files on Github here.

rpi_adc_dma_test.c
rpi_dma_utils.c
rpi_dma_utils.h

The definition at the top of rpi_adc_dma_test.c needs to be edited to select the ADC (MCP3008 or ADS7884), also rpi_dma_utils.h must be changed to reflect the CPU board you are using (RPi 0/1, 2/3, or 4) and the master clock frequency that will used to determine the SPI clock. Bizarrely, the RPi zero has a 400 MHz master clock, while the later boards use 250 MHz. If you neglect to make this change when using the Pi Zero, the SPI interface will run 1.6 times too fast; I once made this mistake, and to my surprise the ADC still seemed to work fine, even though the resulting 5.76 MS/s data rate is way beyond the values in the ADC data sheet. So if you are an overclocking enthusiast, there is plenty of scope for experimentation.

The code is compiled on the Rasberry Pi using gcc, then run with root privileges using ‘sudo’:

gcc -Wall -o rpi_adc_dma_test rpi_adc_dma_test.c rpi_dma_utils.c
sudo ./rpi_adc_dma_test

The usual security warnings apply when running code with root privileges; the operating system won’t protect you against any undesired operations.

The response will depend on which ADC and processor is in use, but should show the current ADC input value, and the corresponding voltage. This is the Pi Zero:

SPI ADC test v0.03
VC mem handle 5, phys 0xde510000, virt 0xb6f00000
SPI frequency 160000 Hz, 10000 sample/s
ADC value 212 = 0.683V
Closing

There are 2 command-line parameters:

-r to set sample rate        e.g. -r 100000 to set 100 Ksample/s
-n to set number of samples  e.g. -n 500 to fetch 500 samples.

The software reports the actual sample rate; on Pi 3 & 4 boards it generally won’t be the same as the requested value, due to the awkward divisor values to scale down 250 MHz into a suitable SPI clock.

There will be a limit as to how many samples can be gathered, as the raw data is stored in uncached memory. This limit can be increased by allocating more of the RAM to the graphics processor, see the gpu_mem option in config.txt. Alternatively, you could change the code to use cached memory (obtained with mmap) for the raw data buffer, and accept that there will be a delay while the CPU cache is emptied into it.

The output is just a list of voltages, with one sample per line; this can conveniently be piped to a CSV file for plotting in a spreadsheet, for example:

sudo ./rpi_adc_dma_test -r 3000000 -n 500 > test1.csv

The graphs in this post were actually produced using gnuplot, running on the RPi. It is easy to install using ‘sudo apt install gnuplot’, and here is a sample command line, with the graph it produces; I’ve split the commands into multiple lines for clarity:

gnuplot -e "set term png size 420,240 font 'sans,8'; \
  set title '2.5 Msample/s'; set grid; set key noautotitle; \
  set output 'test1.png'; plot 'test1.csv' every ::4 with lines"
Data display using gnuplot

This capture (of a composite video signal) was done on a Pi ZeroW, proving that you don’t need an expensive processor to perform fast & accurate data acquisition.

Copyright (c) Jeremy P Bentham 2020. Please credit this blog if you use the information or software in it.

Raspberry Pi DMA programming in C

If you need a fast efficient way of moving data around a Raspberry Pi system, Direct Memory Access (DMA) is the preferred option; it works independently of the main processor, doing memory and I/O transfers at high speed.

Programming DMA under Linux can be quite difficult; a device driver is normally used, which needs to be custom-written for a specific application. There are also some Raspberry Pi user-mode programs on the Web that can be run from the command line, but they do need to bypass all the usual memory protections, so require root privileges (e.g. run using ‘sudo’). This means that a minor error in the code can cause random corruption of the processor’s memory, resulting in system instability or a crash.

I couldn’t find any simple explanations and code examples on the Web, so decided to write this blog, documenting all the potential problem areas, with fully commented example code.

I’ll be making extensive use of the Broadcom ‘BCM2835 ARM Peripherals’ document, you can get a copy here. There is also an errata document that is worth reading here.

Address spaces

When creating an executable program, you are (possibly unknowingly) using a ‘virtual’ memory space. The addresses you use are just a temporary fiction, created by the Operating System (OS) for the duration of that program. This allows the OS to make maximum usage of the available RAM; when it gets really crowded, your program may even be pushed out to a ‘swap file’ on disk, so it isn’t even in RAM at all.

This is fine for most user programs, but the DMA controller is a relatively simple piece of hardware, so can not handle the free-for-all nature of virtual memory. It requires everything to be at a known address location, in a memory space known as ‘bus memory’. You may already be familiar with this if you have browsed the BCM2835 document; it describes all the peripherals in terms of their bus addresses.

Accessing peripherals

Raspberry Pi peripheral addressing

Peripherals need to be accessible by the DMA controller (for data transfers) and the user program (for initialisation and configuration). It is easy for the DMA controller to access any peripheral; it just uses the bus address, as given in the documentation. However, the user program runs in its own virtual world, so usually can’t access any peripherals, except through device drivers. To gain direct read/write access, it has to specifically request permission from the OS, by making a call to ‘mmap’ with the physical address of the peripheral we want to access:

// Get virtual memory segment for peripheral regs or physical mem
void *map_segment(void *addr, int size)
{
    int fd;
    void *mem;

    if ((fd = open ("/dev/mem", O_RDWR|O_SYNC|O_CLOEXEC)) < 0)
        FAIL("Error: can't open /dev/mem, run using sudo\n");
    mem = mmap(0, size, PROT_WRITE|PROT_READ, MAP_SHARED, fd, (uint32_t)addr);
    close(fd);
    return(mem);
}

The procedure is slightly strange, in that you have to give the function a file descriptor for /dev/mem, and this requires root privileges, but on reflection this isn’t surprising, since we could do a lot of damage by making unauthorised access to the peripherals, so the OS needs to know we have the authority to do this. There is another descriptor, namely /dev/iomem, that doesn’t require root privileges, but that is confined to the GPIO pins, so we can’t use it for DMA.

The mmap function takes a physical address of the peripheral, and opens a window in virtual memory that our program can access; any read or write to the window is automatically redirected to the peripheral.

I’ve said the mmap function needs a physical address, and you may think this is the same as the bus address, but sadly that isn’t true; there are a total of 3 address spaces: bus, physical and virtual. The conversion between bus & physical is quite easy, but changes depending on the Pi board version: this is the code for Pi 2 or 3, with an example of user-mode GPIO access:

#define PHYS_REG_BASE    0x3F000000
#define GPIO_BASE       (PHYS_REG_BASE + 0x200000)
#define PAGE_SIZE       0x1000

void *virt_gpio_regs
virt_gpio_regs = map_segment((void *)GPIO_BASE, PAGE_SIZE);

#define VIRT_GPIO_REG(a) ((uint32_t *)((uint32_t)virt_gpio_regs + (a)))
#define GPIO_LEV0       0x34

// Get an I/P pin value
uint8_t gpio_in(int pin)
{
    uint32_t *reg = VIRT_GPIO_REG(GPIO_LEV0) + pin/32;
    return (((*reg) >> (pin % 32)) & 1);
}

Accessing memory

Raspberry Pi memory addressing

Memory accesses by the DMA controller are a more complicated, as a known fixed address is required. This can be done by mmap; if it is given a zero address, it will allocate a block of memory, and return a virtual pointer to that block:

#define MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_NORESERVE|MAP_LOCKED)
mem = mmap(0, size, MMAP_FLAGS, fd, 0);

We now have a virtual memory address, which is fine for our user code to access, but can’t be used by the DMA controller, so we need to look up the physical address by consulting the mapping table:

// Return physical address of virtual memory
void *phys_mem(void *virt)
{
    uint64_t pageInfo;
    int file = open("/proc/self/pagemap", 'r');
    
    if (lseek(file, (((size_t)virt)/PAGE_SIZE)*8, SEEK_SET) != (size_t)virt>>9)
        printf("Error: can't find page map for %p\n", virt);
    read(file, &pageInfo, 8);
    close(file);
    return((void*)(size_t)((pageInfo*PAGE_SIZE)));
}

This physical address can be converted to a bus address, and given to the DMA controller, but you will find the end result is quite unreliable; there is a disconnect between the data that the user program is writing, and the values that the DMA controller is reading; the two don’t match up, unless you include very significant delays in the code. This is due to the CPU caching memory accesses.

Memory caching

Raspberry Pi cache areas

Caches are used to temporarily store data values within the CPU, so they can be accessed much faster than main memory. Normally they are completely transparent to the software; the CPU manipulates the cached value of a variable, then the value is written out to main memory after a suitable delay. The length of this delay is dependant on the CPU workload, but may be around 1 second.

This is a major problem when working with DMA; it fetches data and descriptors directly from memory, but if that data was prepared less than a second ago, it may only be in the CPU cache; the memory will still have random values from a previous program, making the DMA controller behave in a totally unpredictable way.

This has the potential to be very nasty problem, since it will come & go depending on the CPU workload and other programs, so can be really difficult to diagnose. We must be absolutely sure that all the cached data has been written to memory before starting DMA. There are various ways this can be done in theory, for example there is a GCC command:

void __clear_cache(void *start, void *end)

however this seems to be more applicable to instruction than data caches, and I didn’t have any success using it.

Another approach is to use the aliases in bus memory, as shown in the diagram above. Basically the same memory appears 4 times in the memory map, with varying degrees of caching, so if the bus address is Cxxxxxxx hex, the memory is uncached. This gives rise to the method:

Allocate memory using mmap with phys addr 0, get virt addr
Convert the virt addr to phys & bus addr
De-allocate the memory
Allocate memory using mmap with same phys addr, in uncached area

I did quite a bit of experimentation with this method, and wasn’t convinced it always works; it was still necessary to include arbitrary delays in the code, otherwise there was still a tendency to sometimes crash.

Eventually my searches for a completely reliable method of getting uncached memory lead me to the VideoCore Mailbox.

VideoCore graphics processor

It may seem strange that I’m tinkering with the graphics processor in order to get uncached memory, but the VideoCore IV Graphics Processing Unit (GPU) controls some primary functionality of the RPi, including the split between main & video memories.

Communication with the GPU is via a confusingly-named ‘mailbox’; this is nothing to do with emails, it is just an ioctl calling mechanism, e.g.

// Open mailbox interface, return file descriptor
int open_mbox(void)
{
   int fd;

   if ((fd = open("/dev/vcio", 0)) < 0)
       FAIL("Error: can't open VC mailbox\n");
   return(fd);
}
// Send message to mailbox, return first response int, 0 if error
uint32_t msg_mbox(int fd, VC_MSG *msgp)
{
    uint32_t ret=0, i;

    for (i=msgp->dlen/4; i<=msgp->blen/4; i+=4)
        msgp->uints[i++] = 0;
    msgp->len = (msgp->blen + 6) * 4;
    msgp->req = 0;
    if (ioctl(fd, _IOWR(100, 0, void *), msgp) < 0)
        printf("VC IOCTL failed\n");
    else if ((msgp->req&0x80000000) == 0)
        printf("VC IOCTL error\n");
    else if (msgp->req == 0x80000001)
        printf("VC IOCTL partial error\n");
    else
        ret = msgp->uints[0];
    return(ret);
}
// Allocate memory on PAGE_SIZE boundary, return handle
uint32_t alloc_vc_mem(int fd, uint32_t size, VC_ALLOC_FLAGS flags)
{
    VC_MSG msg={.tag=0x3000c, .blen=12, .dlen=12,
        .uints={PAGE_ROUNDUP(size), PAGE_SIZE, flags}};
    return(msg_mbox(fd, &msg));
}
// Lock allocated memory, return bus address
void *lock_vc_mem(int fd, int h)
{
    VC_MSG msg={.tag=0x3000d, .blen=4, .dlen=4, .uints={h}};
    return(h ? (void *)msg_mbox(fd, &msg) : 0);
}

The ioctl call requires a 108-byte structure with the command plus data; it returns the response in the same structure:

// Mailbox command/response structure
typedef struct {
    uint32_t len,   // Overall length (bytes)
        req,        // Zero for request, 1<<31 for response
        tag,        // Command number
        blen,       // Buffer length (bytes)
        dlen;       // Data length (bytes)
        uint32_t uints[32-5];   // Data (108 bytes maximum)
} VC_MSG __attribute__ ((aligned (16)));

As you can see, the mailbox functions are quite easy to use; for details of other functionality, see the documentation.

So at last we have a reliable source of uncached memory; for simplicity my software just allocates a single block, which is then subdivided into the control blocks and data needed by the DMA controller.

Code optimisation

One final issue needs to be mentioned in this context; if compiler optimisation is enabled (e.g. gcc command line options -O2 or -O3) then some of the memory accesses may be optimised out, leading to confusing results. For example, you may be using DMA to transfer a data value, and are polling the destination in a tight loop to see when the transfer is complete.

int *destp = ...    // Pointer to somewhere in uncached memory
*destp = 0;
while (*desp == 0)  // While DMA data not received..
    sleep(1);       // ..sleep

On the first poll cycle, the code will read the memory, but subsequent read cycles may be optimised out, so the CPU just re-uses the same data value without re-checking memory.

The solution is simple: declare the variable as volatile, e.g.

volatile int *destp = ...

This ensures that the CPU will always access the memory on every read cycle.

DMA controller

The primary configuration mechanism for the DMA controller is a Control Block (CB). This fully defines the required transfer, including source & destination addresses, data lengths, and the like:

// DMA control block (must be 32-byte aligned)
typedef struct {
    uint32_t ti,    // Transfer info
        srce_ad,    // Source address
        dest_ad,    // Destination address
        tfr_len,    // Transfer length
        stride,     // Transfer stride
        next_cb,    // Next control block
        debug,      // Debug register
        unused;
} DMA_CB __attribute__ ((aligned(32)));
#define DMA_CB_DEST_INC (1<<4)
#define DMA_CB_SRC_INC  (1<<8)

The next_cb address means that you can create a chain of CBs; the controller will work through them all until it encounters a next_cb value of zero.

1st example: memory-to-memory transfer

We’ll start with a really simple operation: a memory-to-memory transfer.

// DMA memory-to-memory test
int dma_test_mem_transfer(void)
{
    DMA_CB *cbp = virt_dma_mem;
    char *srce = (char *)(cbp+1);
    char *dest = srce + 0x100;

    strcpy(srce, "memory transfer OK");
    memset(cbp, 0, sizeof(DMA_CB));
    cbp->ti = DMA_CB_SRC_INC | DMA_CB_DEST_INC;
    cbp->srce_ad = BUS_DMA_MEM(srce);
    cbp->dest_ad = BUS_DMA_MEM(dest);
    cbp->tfr_len = strlen(srce) + 1;
    start_dma(cbp);
    usleep(10);
#if DEBUG
    disp_dma();
#endif
    printf("DMA test: %s\n", dest[0] ? dest : "failed");
    return(dest[0] != 0);
}

The variable virt_dma_mem is pointing to an area of uncached memory, which has been used to house a control block, and the source & destination arrays. The DMA controller starts with that control block, and after a brief delay, the destination is checked to see if the data has been transferred.

I originally thought that the DMA transfer would be so fast that no delay is required, but this isn’t true; some delay is necessary, but even a zero delay is sufficient, i.e. usleep(0), so the 10 microseconds I’ve used is more than adequate.

2nd example: memory-to-GPIO transfer

Assuming the above example works, it is time to try writing to a peripheral, namely a GPIO pin, that can be connected to an LED to provide a simple flashing indication.

On most CPUs you’d write 1 or 0 to a GPIO register to turn the LED on or off, but the Broadcom hardware doesn’t work that way; there is on register to turn it on, and another to turn it off. So we just need to flip the register address between DMA transfers, and the LED will flash.

// DMA memory-to-GPIO test: flash LED
void dma_test_led_flash(int pin)
{
    DMA_CB *cbp=virt_dma_mem;
    uint32_t *data = (uint32_t *)(cbp+1), n;

    printf("DMA test: flashing LED on GPIO pin %u\n", pin);
    memset(cbp, 0, sizeof(DMA_CB));
    *data = 1 << pin;
    cbp->tfr_len = 4;
    cbp->srce_ad = BUS_DMA_MEM(data);
    for (n=0; n<16; n++)
    {
        usleep(200000);
        cbp->dest_ad = BUS_GPIO_REG(n&1 ? GPIO_CLR0 : GPIO_SET0);
        start_dma(cbp);
    }
}

As before, the CB and source data are placed in uncached memory, but the transfer destination is either the ‘set’ or ‘clear’ GPIO registers.

After each on/off transition, the DMA stops, and needs to be restarted with the modified control block.

3rd example: timed triggering

The previous 2 examples are useful demonstrations that DMA is working, but have little practical application since they require significant CPU intervention to keep them running. What we really need is a way of triggering the DMA cycles from a timer, so the transfers carry on automatically while the CPU is doing other tasks.

Unlike most microcontrollers, the Broadcom hardware has no real timers, but it does have a Pulse-Width Modulation (PWM) controller, that can be used instead; it can be programmed to request a data update on a regular basis, i.e. issue a DMA request, and once the update data is received, wait for a fixed time before issuing another request.

That gives us a regular stream of DMA requests at specific intervals, but how do we use that to toggle an LED pin? The answer is that we create 4 control blocks in an endless circular loop:

CB0: clear LED
CB1: write data to PWM controller
CB2: set LED
CB3: write data to PWM controller

You need to bear in mind that the DMA controller will continue processing CBs while its request line is asserted. If we didn’t have CB1 & 3, the DMA cycles would be running continuously, and toggling the LED very fast; this isn’t recommended, since it does use up a lot of memory bandwidth, but on the few occasions I’ve done that, the system seemed to cope quite well, and didn’t crash. With the above arrangement, the controller will execute CB0 & 1, then delay, CB2 & 3, another delay, CB 0 & 1, and so on.

// PWM clock frequency and range (FREQ/RANGE = LED flash freq)
#define PWM_FREQ        100000
#define PWM_RANGE       20000

// DMA trigger test: fLash LED using PWM trigger
void dma_test_pwm_trigger(int pin)
{
    DMA_CB *cbs=virt_dma_mem;
    uint32_t n, *pindata=(uint32_t *)(cbs+4), *pwmdata=pindata+1;

    printf("DMA test: PWM trigger, ctrl-C to exit\n");
    memset(cbs, 0, sizeof(DMA_CB)*4);
    // Transfers are triggered by PWM request
    cbs[0].ti = cbs[1].ti = cbs[2].ti = cbs[3].ti = (1 << 6) | (DMA_PWM_DREQ << 16);
    // Control block 0 and 2: clear & set LED pin, 4-byte transfer
    cbs[0].srce_ad = cbs[2].srce_ad = BUS_DMA_MEM(pindata);
    cbs[0].dest_ad = BUS_GPIO_REG(GPIO_CLR0);
    cbs[2].dest_ad = BUS_GPIO_REG(GPIO_SET0);
    cbs[0].tfr_len = cbs[2].tfr_len = 4;
    *pindata = 1 << pin;
    // Control block 1 and 3: update PWM FIFO (to clear DMA request)
    cbs[1].srce_ad = cbs[3].srce_ad = BUS_DMA_MEM(pwmdata);
    cbs[1].dest_ad = cbs[3].dest_ad = BUS_PWM_REG(PWM_FIF1);
    cbs[1].tfr_len = cbs[3].tfr_len = 4;
    *pwmdata = PWM_RANGE / 2;
    // Link control blocks 0 to 3 in endless loop
    for (n=0; n<4; n++)
        cbs[n].next_cb = BUS_DMA_MEM(&cbs[(n+1)%4]);
    // Enable PWM with data threshold 1, and DMA
    init_pwm(PWM_FREQ);
    *VIRT_PWM_REG(PWM_DMAC) = PWM_DMAC_ENAB|1;
    start_pwm();
    start_dma(&cbs[0]);
    // Nothing to do while LED is flashing
    sleep(4);
}

PWM clock setting

Before leaving the code, it is worth mentioning another area of difficulty: setting the clock frequency of the PWM controller. I arbitrarily chose 100 kHz, since that could be divided by 20,000 to flash the LED at 5 Hz.

The recommended way of setting the clock is using the VideoCore mailbox:

void set_vc_clock(int fd, int id, uint32_t freq)
{
    VC_MSG msg1={.tag=0x38001, .blen=8, .dlen=8, .uints={id, 1}};
    VC_MSG msg2={.tag=0x38002, .blen=12, .dlen=12, .uints={id, freq, 0}};
    msg_mbox(fd, &msg1);
    msg_mbox(fd, &msg2);
}

This method works sometimes, but not always; it can take several attempts to change from one frequency to another, and I don’t understand why.

A fall-back option is to write to the (undocumented) timer registers, which is the method I use by default:

#define USE_VC_CLOCK_SET 0

#if USE_VC_CLOCK_SET
    set_vc_clock(mbox_fd, PWM_CLOCK_ID, freq);
#else
    int divi=(CLOCK_KHZ*1000) / freq;
    *VIRT_CLK_REG(CLK_PWM_CTL) = CLK_PASSWD | (1 << 5);
    while (*VIRT_CLK_REG(CLK_PWM_CTL) & (1 << 7)) ;
    *VIRT_CLK_REG(CLK_PWM_DIV) = CLK_PASSWD | (divi << 12);
    *VIRT_CLK_REG(CLK_PWM_CTL) = CLK_PASSWD | 6 | (1 << 4);
    while ((*VIRT_CLK_REG(CLK_PWM_CTL) & (1 << 7)) == 0) ;
#endif
    usleep(100);

The PWM controller seems to be very sensitive to changes in its clock frequency, so before any change, it is essential to disable it, and wait some time before re-enabling. On one occasion, it locked up completely and just wouldn’t work until I re-powered the board, so care is needed when modifying the clocking code – it is certainly an area that merits further investigation.

Running the code

There is a single source file rpi_dma_test.c on Github here.

You’ll need to change the definition at the top depending on the RPi version you are using:

//#define PHYS_REG_BASE  0x20000000  // Pi Zero or 1
#define PHYS_REG_BASE    0x3F000000  // Pi 2 or 3
//#define PHYS_REG_BASE  0xFE000000  // Pi 4

Then the code can be compiled with GCC, and run with ‘sudo’:

gcc -Wall -o rpi_dma_test rpi_dma_test.c
sudo ./rpi_dma_test

You can optionally compile with -O2 or -O3 optimisation.

To view the results you need to connect an LED (with a 330 ohm resistor in series) to ground and LED_PIN, which I’ve set to GPIO pin 21. This is at the far end of the I/O connector, conveniently next to a ground pin.

Raspberry Pi LED connection

The positive leg of the LED goes to the output pin, which is nearest the camera.

The usual warnings apply when running a program with root privileges -there is a security risk, since it has unrestricted access to all system functions.

To see DMA being used for data acquisition, take a look at my next post.

Copyright (c) Jeremy P Bentham 2020. Please credit this blog if you use the information or software in it.