Raspberry Pi DMA programming in C

If you need a fast efficient way of moving data around a Raspberry Pi system, Direct Memory Access (DMA) is the preferred option; it works independently of the main processor, doing memory and I/O transfers at high speed.

Programming DMA under Linux can be quite difficult; a device driver is normally used, which needs to be custom-written for a specific application. There are also some Raspberry Pi user-mode programs on the Web that can be run from the command line, but they do need to bypass all the usual memory protections, so require root privileges (e.g. run using ‘sudo’). This means that a minor error in the code can cause random corruption of the processor’s memory, resulting in system instability or a crash.

I couldn’t find any simple explanations and code examples on the Web, so decided to write this blog, documenting all the potential problem areas, with fully commented example code.

I’ll be making extensive use of the Broadcom ‘BCM2835 ARM Peripherals’ document, you can get a copy here. There is also an errata document that is worth reading here.

Address spaces

When creating an executable program, you are (possibly unknowingly) using a ‘virtual’ memory space. The addresses you use are just a temporary fiction, created by the Operating System (OS) for the duration of that program. This allows the OS to make maximum usage of the available RAM; when it gets really crowded, your program may even be pushed out to a ‘swap file’ on disk, so it isn’t even in RAM at all.

This is fine for most user programs, but the DMA controller is a relatively simple piece of hardware, so can not handle the free-for-all nature of virtual memory. It requires everything to be at a known address location, in a memory space known as ‘bus memory’. You may already be familiar with this if you have browsed the BCM2835 document; it describes all the peripherals in terms of their bus addresses.

Accessing peripherals

Peripherals need to be accessible by the DMA controller (for data transfers) and the user program (for initialisation and configuration). It is easy for the DMA controller to access any peripheral; it just uses the bus address, as given in the documentation. However, the user program runs in its own virtual world, so usually can’t access any peripherals, except through device drivers. To gain direct read/write access, it has to specifically request permission from the OS, by making a call to ‘mmap’ with the physical address of the peripheral we want to access:

// Get virtual memory segment for peripheral regs or physical mem
void *map_segment(void *addr, int size)
{
    int fd;
    void *mem;

    if ((fd = open ("/dev/mem", O_RDWR|O_SYNC|O_CLOEXEC)) < 0)
        FAIL("Error: can't open /dev/mem, run using sudo\n");
    mem = mmap(0, size, PROT_WRITE|PROT_READ, MAP_SHARED, fd, (uint32_t)addr);
    close(fd);
    return(mem);
}

The procedure is slightly strange, in that you have to give the function a file descriptor for /dev/mem, and this requires root privileges, but on reflection this isn’t surprising, since we could do a lot of damage by making unauthorised access to the peripherals, so the OS needs to know we have the authority to do this. There is another descriptor, namely /dev/iomem, that doesn’t require root privileges, but that is confined to the GPIO pins, so we can’t use it for DMA.

The mmap function takes a physical address of the peripheral, and opens a window in virtual memory that our program can access; any read or write to the window is automatically redirected to the peripheral.

I’ve said the mmap function needs a physical address, and you may think this is the same as the bus address, but sadly that isn’t true; there are a total of 3 address spaces: bus, physical and virtual. The conversion between bus & physical is quite easy, but changes depending on the Pi board version: this is the code for Pi 2 or 3, with an example of user-mode GPIO access:

#define PHYS_REG_BASE    0x3F000000
#define GPIO_BASE       (PHYS_REG_BASE + 0x200000)
#define PAGE_SIZE       0x1000

void *virt_gpio_regs
virt_gpio_regs = map_segment((void *)GPIO_BASE, PAGE_SIZE);

#define VIRT_GPIO_REG(a) ((uint32_t *)((uint32_t)virt_gpio_regs + (a)))
#define GPIO_LEV0       0x34

// Get an I/P pin value
uint8_t gpio_in(int pin)
{
    uint32_t *reg = VIRT_GPIO_REG(GPIO_LEV0) + pin/32;
    return (((*reg) >> (pin % 32)) & 1);
}

Accessing memory

Memory accesses by the DMA controller are a more complicated, as a known fixed address is required. This can be done by mmap; if it is given a zero address, it will allocate a block of memory, and return a virtual pointer to that block:

#define MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_NORESERVE|MAP_LOCKED)
mem = mmap(0, size, MMAP_FLAGS, fd, 0);

We now have a virtual memory address, which is fine for our user code to access, but can’t be used by the DMA controller, so we need to look up the physical address by consulting the mapping table:

// Return physical address of virtual memory
void *phys_mem(void *virt)
{
    uint64_t pageInfo;
    int file = open("/proc/self/pagemap", 'r');
    
    if (lseek(file, (((size_t)virt)/PAGE_SIZE)*8, SEEK_SET) != (size_t)virt>>9)
        printf("Error: can't find page map for %p\n", virt);
    read(file, &pageInfo, 8);
    close(file);
    return((void*)(size_t)((pageInfo*PAGE_SIZE)));
}

This physical address can be converted to a bus address, and given to the DMA controller, but you will find the end result is quite unreliable; there is a disconnect between the data that the user program is writing, and the values that the DMA controller is reading; the two don’t match up, unless you include very significant delays in the code. This is due to the CPU caching memory accesses.

Memory caching

Caches are used to temporarily store data values within the CPU, so they can be accessed much faster than main memory. Normally they are completely transparent to the software; the CPU manipulates the cached value of a variable, then the value is written out to main memory after a suitable delay. The length of this delay is dependant on the CPU workload, but may be around 1 second.

This is a major problem when working with DMA; it fetches data and descriptors directly from memory, but if that data was prepared less than a second ago, it may only be in the CPU cache; the memory will still have random values from a previous program, making the DMA controller behave in a totally unpredictable way.

This has the potential to be very nasty problem, since it will come & go depending on the CPU workload and other programs, so can be really difficult to diagnose. We must be absolutely sure that all the cached data has been written to memory before starting DMA. There are various ways this can be done in theory, for example there is a GCC command:

void __clear_cache(void *start, void *end)

however this seems to be more applicable to instruction than data caches, and I didn’t have any success using it.

Another approach is to use the aliases in bus memory, as shown in the diagram above. Basically the same memory appears 4 times in the memory map, with varying degrees of caching, so if the bus address is Cxxxxxxx hex, the memory is uncached. This gives rise to the method:

Allocate memory using mmap with phys addr 0, get virt addr
Convert the virt addr to phys & bus addr
De-allocate the memory
Allocate memory using mmap with same phys addr, in uncached area

I did quite a bit of experimentation with this method, and wasn’t convinced it always works; it was still necessary to include arbitrary delays in the code, otherwise there was still a tendency to sometimes crash.

Eventually my searches for a completely reliable method of getting uncached memory lead me to the VideoCore Mailbox.

VideoCore graphics processor

It may seem strange that I’m tinkering with the graphics processor in order to get uncached memory, but the VideoCore IV Graphics Processing Unit (GPU) controls some primary functionality of the RPi, including the split between main & video memories.

Communication with the GPU is via a confusingly-named ‘mailbox’; this is nothing to do with emails, it is just an ioctl calling mechanism, e.g.

// Open mailbox interface, return file descriptor
int open_mbox(void)
{
   int fd;

   if ((fd = open("/dev/vcio", 0)) < 0)
       FAIL("Error: can't open VC mailbox\n");
   return(fd);
}
// Send message to mailbox, return first response int, 0 if error
uint32_t msg_mbox(int fd, VC_MSG *msgp)
{
    uint32_t ret=0, i;

    for (i=msgp->dlen/4; i<=msgp->blen/4; i+=4)
        msgp->uints[i++] = 0;
    msgp->len = (msgp->blen + 6) * 4;
    msgp->req = 0;
    if (ioctl(fd, _IOWR(100, 0, void *), msgp) < 0)
        printf("VC IOCTL failed\n");
    else if ((msgp->req&0x80000000) == 0)
        printf("VC IOCTL error\n");
    else if (msgp->req == 0x80000001)
        printf("VC IOCTL partial error\n");
    else
        ret = msgp->uints[0];
    return(ret);
}
// Allocate memory on PAGE_SIZE boundary, return handle
uint32_t alloc_vc_mem(int fd, uint32_t size, VC_ALLOC_FLAGS flags)
{
    VC_MSG msg={.tag=0x3000c, .blen=12, .dlen=12,
        .uints={PAGE_ROUNDUP(size), PAGE_SIZE, flags}};
    return(msg_mbox(fd, &msg));
}
// Lock allocated memory, return bus address
void *lock_vc_mem(int fd, int h)
{
    VC_MSG msg={.tag=0x3000d, .blen=4, .dlen=4, .uints={h}};
    return(h ? (void *)msg_mbox(fd, &msg) : 0);
}

The ioctl call requires a 108-byte structure with the command plus data; it returns the response in the same structure:

// Mailbox command/response structure
typedef struct {
    uint32_t len,   // Overall length (bytes)
        req,        // Zero for request, 1<<31 for response
        tag,        // Command number
        blen,       // Buffer length (bytes)
        dlen;       // Data length (bytes)
        uint32_t uints[32-5];   // Data (108 bytes maximum)
} VC_MSG __attribute__ ((aligned (16)));

As you can see, the mailbox functions are quite easy to use; for details of other functionality, see the documentation.

So at last we have a reliable source of uncached memory; for simplicity my software just allocates a single block, which is then subdivided into the control blocks and data needed by the DMA controller.

Code optimisation

One final issue needs to be mentioned in this context; if compiler optimisation is enabled (e.g. gcc command line options -O2 or -O3) then some of the memory accesses may be optimised out, leading to confusing results. For example, you may be using DMA to transfer a data value, and are polling the destination in a tight loop to see when the transfer is complete.

int *destp = ...    // Pointer to somewhere in uncached memory
*destp = 0;
while (*desp == 0)  // While DMA data not received..
    sleep(1);       // ..sleep

On the first poll cycle, the code will read the memory, but subsequent read cycles may be optimised out, so the CPU just re-uses the same data value without re-checking memory.

The solution is simple: declare the variable as volatile, e.g.

volatile int *destp = ...

This ensures that the CPU will always access the memory on every read cycle.

DMA controller

The primary configuration mechanism for the DMA controller is a Control Block (CB). This fully defines the required transfer, including source & destination addresses, data lengths, and the like:

// DMA control block (must be 32-byte aligned)
typedef struct {
    uint32_t ti,    // Transfer info
        srce_ad,    // Source address
        dest_ad,    // Destination address
        tfr_len,    // Transfer length
        stride,     // Transfer stride
        next_cb,    // Next control block
        debug,      // Debug register
        unused;
} DMA_CB __attribute__ ((aligned(32)));
#define DMA_CB_DEST_INC (1<<4)
#define DMA_CB_SRC_INC  (1<<8)

The next_cb address means that you can create a chain of CBs; the controller will work through them all until it encounters a next_cb value of zero.

1st example: memory-to-memory transfer

We’ll start with a really simple operation: a memory-to-memory transfer.

// DMA memory-to-memory test
int dma_test_mem_transfer(void)
{
    DMA_CB *cbp = virt_dma_mem;
    char *srce = (char *)(cbp+1);
    char *dest = srce + 0x100;

    strcpy(srce, "memory transfer OK");
    memset(cbp, 0, sizeof(DMA_CB));
    cbp->ti = DMA_CB_SRC_INC | DMA_CB_DEST_INC;
    cbp->srce_ad = BUS_DMA_MEM(srce);
    cbp->dest_ad = BUS_DMA_MEM(dest);
    cbp->tfr_len = strlen(srce) + 1;
    start_dma(cbp);
    usleep(10);
#if DEBUG
    disp_dma();
#endif
    printf("DMA test: %s\n", dest[0] ? dest : "failed");
    return(dest[0] != 0);
}

The variable virt_dma_mem is pointing to an area of uncached memory, which has been used to house a control block, and the source & destination arrays. The DMA controller starts with that control block, and after a brief delay, the destination is checked to see if the data has been transferred.

I originally thought that the DMA transfer would be so fast that no delay is required, but this isn’t true; some delay is necessary, but even a zero delay is sufficient, i.e. usleep(0), so the 10 microseconds I’ve used is more than adequate.

2nd example: memory-to-GPIO transfer

Assuming the above example works, it is time to try writing to a peripheral, namely a GPIO pin, that can be connected to an LED to provide a simple flashing indication.

On most CPUs you’d write 1 or 0 to a GPIO register to turn the LED on or off, but the Broadcom hardware doesn’t work that way; there is on register to turn it on, and another to turn it off. So we just need to flip the register address between DMA transfers, and the LED will flash.

// DMA memory-to-GPIO test: flash LED
void dma_test_led_flash(int pin)
{
    DMA_CB *cbp=virt_dma_mem;
    uint32_t *data = (uint32_t *)(cbp+1), n;

    printf("DMA test: flashing LED on GPIO pin %u\n", pin);
    memset(cbp, 0, sizeof(DMA_CB));
    *data = 1 << pin;
    cbp->tfr_len = 4;
    cbp->srce_ad = BUS_DMA_MEM(data);
    for (n=0; n<16; n++)
    {
        usleep(200000);
        cbp->dest_ad = BUS_GPIO_REG(n&1 ? GPIO_CLR0 : GPIO_SET0);
        start_dma(cbp);
    }
}

As before, the CB and source data are placed in uncached memory, but the transfer destination is either the ‘set’ or ‘clear’ GPIO registers.

After each on/off transition, the DMA stops, and needs to be restarted with the modified control block.

3rd example: timed triggering

The previous 2 examples are useful demonstrations that DMA is working, but have little practical application since they require significant CPU intervention to keep them running. What we really need is a way of triggering the DMA cycles from a timer, so the transfers carry on automatically while the CPU is doing other tasks.

Unlike most microcontrollers, the Broadcom hardware has no real timers, but it does have a Pulse-Width Modulation (PWM) controller, that can be used instead; it can be programmed to request a data update on a regular basis, i.e. issue a DMA request, and once the update data is received, wait for a fixed time before issuing another request.

That gives us a regular stream of DMA requests at specific intervals, but how do we use that to toggle an LED pin? The answer is that we create 4 control blocks in an endless circular loop:

CB0: clear LED
CB1: write data to PWM controller
CB2: set LED
CB3: write data to PWM controller

You need to bear in mind that the DMA controller will continue processing CBs while its request line is asserted. If we didn’t have CB1 & 3, the DMA cycles would be running continuously, and toggling the LED very fast; this isn’t recommended, since it does use up a lot of memory bandwidth, but on the few occasions I’ve done that, the system seemed to cope quite well, and didn’t crash. With the above arrangement, the controller will execute CB0 & 1, then delay, CB2 & 3, another delay, CB 0 & 1, and so on.

// PWM clock frequency and range (FREQ/RANGE = LED flash freq)
#define PWM_FREQ        100000
#define PWM_RANGE       20000

// DMA trigger test: fLash LED using PWM trigger
void dma_test_pwm_trigger(int pin)
{
    DMA_CB *cbs=virt_dma_mem;
    uint32_t n, *pindata=(uint32_t *)(cbs+4), *pwmdata=pindata+1;

    printf("DMA test: PWM trigger, ctrl-C to exit\n");
    memset(cbs, 0, sizeof(DMA_CB)*4);
    // Transfers are triggered by PWM request
    cbs[0].ti = cbs[1].ti = cbs[2].ti = cbs[3].ti = (1 << 6) | (DMA_PWM_DREQ << 16);
    // Control block 0 and 2: clear & set LED pin, 4-byte transfer
    cbs[0].srce_ad = cbs[2].srce_ad = BUS_DMA_MEM(pindata);
    cbs[0].dest_ad = BUS_GPIO_REG(GPIO_CLR0);
    cbs[2].dest_ad = BUS_GPIO_REG(GPIO_SET0);
    cbs[0].tfr_len = cbs[2].tfr_len = 4;
    *pindata = 1 << pin;
    // Control block 1 and 3: update PWM FIFO (to clear DMA request)
    cbs[1].srce_ad = cbs[3].srce_ad = BUS_DMA_MEM(pwmdata);
    cbs[1].dest_ad = cbs[3].dest_ad = BUS_PWM_REG(PWM_FIF1);
    cbs[1].tfr_len = cbs[3].tfr_len = 4;
    *pwmdata = PWM_RANGE / 2;
    // Link control blocks 0 to 3 in endless loop
    for (n=0; n<4; n++)
        cbs[n].next_cb = BUS_DMA_MEM(&cbs[(n+1)%4]);
    // Enable PWM with data threshold 1, and DMA
    init_pwm(PWM_FREQ);
    *VIRT_PWM_REG(PWM_DMAC) = PWM_DMAC_ENAB|1;
    start_pwm();
    start_dma(&cbs[0]);
    // Nothing to do while LED is flashing
    sleep(4);
}

PWM clock setting

Before leaving the code, it is worth mentioning another area of difficulty: setting the clock frequency of the PWM controller. I arbitrarily chose 100 kHz, since that could be divided by 20,000 to flash the LED at 5 Hz.

The recommended way of setting the clock is using the VideoCore mailbox:

void set_vc_clock(int fd, int id, uint32_t freq)
{
    VC_MSG msg1={.tag=0x38001, .blen=8, .dlen=8, .uints={id, 1}};
    VC_MSG msg2={.tag=0x38002, .blen=12, .dlen=12, .uints={id, freq, 0}};
    msg_mbox(fd, &msg1);
    msg_mbox(fd, &msg2);
}

This method works sometimes, but not always; it can take several attempts to change from one frequency to another, and I don’t understand why.

A fall-back option is to write to the (undocumented) timer registers, which is the method I use by default:

#define USE_VC_CLOCK_SET 0

#if USE_VC_CLOCK_SET
    set_vc_clock(mbox_fd, PWM_CLOCK_ID, freq);
#else
    int divi=(CLOCK_KHZ*1000) / freq;
    *VIRT_CLK_REG(CLK_PWM_CTL) = CLK_PASSWD | (1 << 5);
    while (*VIRT_CLK_REG(CLK_PWM_CTL) & (1 << 7)) ;
    *VIRT_CLK_REG(CLK_PWM_DIV) = CLK_PASSWD | (divi << 12);
    *VIRT_CLK_REG(CLK_PWM_CTL) = CLK_PASSWD | 6 | (1 << 4);
    while ((*VIRT_CLK_REG(CLK_PWM_CTL) & (1 << 7)) == 0) ;
#endif
    usleep(100);

The PWM controller seems to be very sensitive to changes in its clock frequency, so before any change, it is essential to disable it, and wait some time before re-enabling. On one occasion, it locked up completely and just wouldn’t work until I re-powered the board, so care is needed when modifying the clocking code – it is certainly an area that merits further investigation.

Running the code

There is a single source file rpi_dma_test.c on Github here.

You’ll need to change the definition at the top depending on the RPi version you are using:

//#define PHYS_REG_BASE  0x20000000  // Pi Zero or 1
#define PHYS_REG_BASE    0x3F000000  // Pi 2 or 3
//#define PHYS_REG_BASE  0xFE000000  // Pi 4

Then the code can be compiled with GCC, and run with ‘sudo’:

gcc -Wall -o rpi_dma_test rpi_dma_test.c
sudo ./rpi_dma_test

You can optionally compile with -O2 or -O3 optimisation.

To view the results you need to connect an LED (with a 330 ohm resistor in series) to ground and LED_PIN, which I’ve set to GPIO pin 21. This is at the far end of the I/O connector, conveniently next to a ground pin.

The positive leg of the LED goes to the output pin, which is nearest the camera.

The usual warnings apply when running a program with root privileges -there is a security risk, since it has unrestricted access to all system functions.

To see DMA being used for data acquisition, take a look at my next post.

Update

Since I first wrote this post, I’ve been using DMA in various projects, most recently an ADC streaming application, and need to clarify a few items in this post based on that experience.

Choice of DMA channel number

It is necessary to pick an unused channel, to avoid clashes with the operating system. There is various contradictory information posted on the Internet, so I wrote my own DMA-detection utility, which suggests that the Pi 4 (or 400) uses channels 2, 11, 12, 13, 14, and the earlier boards use 0, 2, 4, 6, so the choice of channel 5 in this post isn’t a bad one – but of course this might change in a future OS release.

PWM master clock frequency

The CLOCK_KHZ value of 250000 is correct for Raspberry Pi versions 0 – 3, but versions 4 & 400 use a value of 375000.

Videocore memory allocation

I have been using MEM_FLAG_DIRECT when allocating the uncached memory, but subsequent tests suggest that MEM_FLAG_COHERENT is a better bet when working with fast-changing data – but this isn’t an issue when dealing with with slow-changing I/O as in these examples.

Structuring the DMA data

The method I’ve used to define the data & CBs in uncached memory is a bit messy, so I’ve been looking for a cleaner way to do this, to reduce the likelihood of errors.

I’ve achieved this by using a single structure to house the data and Control Blocks, the latter being at the front of the structure so they’re on a 32-byte boundary. The steps then become:

Prepare the CBs and data in user memory.
Copy the CBs and data across to uncached memory
Start the DMA controller
Start the DMA pacing

Here is the PWM-triggered LED flash function, rewritten to use the new method; hopefully you’ll find it easier to understand and modify.

// DMA control block macros
#define NUM_CBS         4
#define GPIO(r)         BUS_GPIO_REG(r)
#define PWM(r)          BUS_PWM_REG(r)
#define MEM(m)          BUS_DMA_MEM(m)
#define CBS(n)          BUS_DMA_MEM(&dp->cbs[(n)])
#define PWM_TI          ((1 << 6) | (DMA_PWM_DREQ << 16))

// Control Blocks and data to be in uncached memory
typedef struct {
    DMA_CB cbs[NUM_CBS];
    uint32_t pindata, pwmdata;
} DMA_TEST_DATA;

// Updated DMA trigger test, using data structure
void dma_test_pwm_trigger(int pin)
{
    DMA_TEST_DATA *dp=virt_dma_mem;
    DMA_TEST_DATA dma_data = {
        .pindata=1<<pin, .pwmdata=PWM_RANGE/2,
        .cbs = {
          // TI      Srce addr          Dest addr        Len   Next CB
            {PWM_TI, MEM(&dp->pindata), GPIO(GPIO_CLR0), 4, 0, CBS(1), 0},  // 0
            {PWM_TI, MEM(&dp->pwmdata), PWM(PWM_FIF1),   4, 0, CBS(2), 0},  // 1
            {PWM_TI, MEM(&dp->pindata), GPIO(GPIO_SET0), 4, 0, CBS(3), 0},  // 2
            {PWM_TI, MEM(&dp->pwmdata), PWM(PWM_FIF1),   4, 0, CBS(0), 0},  // 3
        }
    };
    memcpy(dp, &dma_data, sizeof(dma_data));    // Copy data into uncached memory
    init_pwm(PWM_FREQ);                         // Enable PWM with DMA
    *VIRT_PWM_REG(PWM_DMAC) = PWM_DMAC_ENAB|1;
    start_dma(&dp->cbs[0]);                     // Start DMA
    start_pwm();                                // Start PWM
    sleep(4);                                   // Do nothing while LED flashing
}

Copyright (c) Jeremy P Bentham 2020. Please credit this blog if you use the information or software in it.

13 thoughts on “Raspberry Pi DMA programming in C”

mperak says:

November 7, 2020 at 9:40 am

Very nice article, thanks!
Code works fine and explanation is great.

In the 3rd example, if instead of toggling GPIO pin, I read GPIO_LEV0 to memory using DMA_CB_DEST_INC it seems GPIO values are transfered, but overwriting the SAME address in memory without address ever increasing from one write to the next as it should be (as though DMA_CB_DEST_INC is ignored). Here is the changed/added code:

uint32_t *data = (uint32_t *)(pwmdata + 1);

cbs[1].ti = DMA_CB_DEST_INC;
cbs[1].srce_ad = BUS_GPIO_REG(GPIO_LEV0);
cbs[1].dest_ad = BUS_DMA_MEM(data);
cbs[1].tfr_len = 4;

cbs[3] is same as cbs[1]

Any idea why DMA_CB_DEST_INC does not seem to work & how to solve this?
(when using a single CB to transfer say 1K length of data from GPIO_LEV0 to memory, DMA_CB_DEST_INC seems to work fine but if using same CB in a CB chain, DMA_CB_DEST_INC seems to be ignored somehow)

Thank you!

LikeLiked by 1 person

Reply
1. iosoftcode says:
  
  November 7, 2020 at 2:16 pm
  
  Due to time limitations, I can’t usually provide responses to specific problems, however in this case you’ve fallen into the common trap of thinking that when a CB is loaded as part of a CB chain, the transfer will carry on from where it left off. In reality, every time a CB is loaded, the DMA controller re-starts that transfer from scratch, as you have seen.
  
  I’d guess that your follow-up question would be “so how can I DMA a block of data, whilst toggling some I/O lines when each word is transferred?”. So far, I haven’t been able to find a general-purpose answer to this problem; it is quite do-able with SPI, and possibly with SMI, but is really tricky (or impossible) with general-purpose I/O lines, unless you use a very large number of CBs (two or more for each word transferred).
  
  I’m working on a new post that has a detailed explanation of this issue with regard to SPI, using simpler C code.
  
  LikeLiked by 1 person
  
  Reply
Dubularity says:

December 3, 2020 at 4:47 pm

Thanks. Very interesting/useful article. I think/suspect things have changed somewhat on the Pi 4…(based on experiments trying to access video RAM direct). Any chance of an update?

LikeLiked by 1 person

Reply
1. iosoftcode says:
  
  December 4, 2020 at 11:11 am
  
  I’ve had no problems with DMA on the Pi 4 or 400, providing you take notice of the points I’ve made in the ‘update’ section.
  I don’t think you are supposed to DMA straight to video RAM, but don’t know enough about the memory organisation to say what the consequences might be. It also raises the question as to what application might need this capability, when you can do all sort of clever (and very fast) things with the GPU; I’m currently writing a post on that subject, which should be on the site in the next week or two.
  
  LikeLiked by 1 person
  
  Reply
mohammadhefny says:

December 14, 2022 at 6:27 am

Thank you for your great article. I have problems with Raspberry OS64.
it gives segmentation fault.
#0 0x0000005555551c20 in enable_dma () at rpi_dma_test.c:472
seems that map_segment is not working well in 64OS version.
Kindly advise.

LikeLike

Reply
1. iosoftcode says:
  
  December 14, 2022 at 3:08 pm
  
  Sorry, I haven’t used the 64-bit OS, so can’t help you.
  
  LikeLike
  
  Reply
Graeme says:

November 11, 2023 at 1:59 am

Hi there, I was just wondering if you knew what the REG_BASE value should be for the Raspberry Pi 5. Do you know if it will be the same as for the 4? Thank you very much.

LikeLike

Reply
1. iosoftcode says:
  
  November 11, 2023 at 10:02 am
  
  I’ve only just got my hands on a Pi 5, but I think it is highly unlikely that my DMA code will work on it, not least because most people seem to be running the 64-bit OS, and that is definitely incompatible.
  
  LikeLike
  
  Reply
2. Andrew says:
  
  March 23, 2025 at 12:12 pm
  
  Hello!The interesting article! I worked with STM32 DMA engine and there was interrupt on -half , full transfer.It would be very good to have such a DMA mode fot USART0 for UART-UDP bridge.Is it possible with Ubuntu server OS and Raspi Pi4B?
  
  LikeLike
  
  Reply
  1. iosoftcode says:
    
    March 24, 2025 at 7:40 pm
    
    Sorry, I don’t think that it’d be a good idea to have a user-mode program handling the UART with DMA. Anyway, the UART data rate is usually sufficiently low that there wouldn’t be much of a benefit, when compared with a conventional interrupt-driven device driver.
    
    LikeLike
christoph says:

February 21, 2024 at 10:25 am

Hi, thanks for your great work and documentation of it. It helps me alot to develop my own DMA peripheral access.

As remark, I was able to run your example code on a Compute Module 4 with 64-bit OS. All I have to do is changing the cast from or to void pointers to 64-bit, for example:

#define REG32(m, x) ((volatile uint32_t *)((uint64_t)(m.virt)+(uint64_t)(x)))

The mmap works as expected.

@Greame: I do not have a PI5, but you can give a try on the bcm_host_get_peripheral_address() function.

LikeLike

Reply
1. Graeme says:
  
  July 27, 2024 at 10:25 pm
  
  Thank you so much for this information. I updated the REG defines in rpi_dma_utils.h to the following:
  
  define REG8(m, x) ((volatile uint8_t *) ((uint64_t)(m.virt)+(uint64_t)(x))) //64-bit
  
  define REG32(m, x) ((volatile uint32_t *)((uint64_t)(m.virt)+(uint64_t)(x))) //64-bit
  
  define REG_BUS_ADDR(m, x) ((uint64_t)(m.bus) + (uint64_t)(x)) //64-bit
  
  define MEM_BUS_ADDR(mp, a) ((uint64_t)a-(uint64_t)mp->virt+(uint64_t)mp->bus) //64-bit
  
  define BUS_PHYS_ADDR(a) ((void *)((uint64_t)(a)&~0xC0000000)) //64-bit
  
  Note that I also changed the following line in the map_periph function of rpi_dma_utils.c (although this one might be redundant):
  
  mp->bus = (void *)((uint64_t)phys – PHYS_REG_BASE + BUS_REG_BASE); //64-bit
  
  My code now works beautifully on my Pi4 with Raspbian Buster 64-bit. My project is one that uses DMA passthrough mode 80 to interface the Pi4 with the 16-bit parallel interface of an old video game console deck and it involves transferring 1 MB block of code from the Pi to the console at a very high transfer speed (~5MB/sec) at console startup. I use it as a gamepad driver for Retropie. It now works flawlessly on a 64-bit OS which I honestly never expected to be possible.
  
  I really didn’t think that it would be simple or even possible to get this code working on a 64-bit OS. That said, this is an extremely simple fix and it seems to work with full compatibility. I think that there is strong evidence here that Jeremy’s code is basically 64-bit compatible.
  
  Also, I gave up my investigation into the Pi5. The CaribouLite folks explain on their github page that the Pi5 doesn’t have a SMI (their RPi HAT relies on the SMI). Pi engineers claim that the SMI could eventually be emulated through the Pi5’s PIO, but it doesn’t sound likely that this will happen anytime soon, if ever. If we’re lucky, maybe the Pi5 will eventually get some other kind high-speed parallel interface implemented for its GPIO pins, hopefully one with more publicly available documentation. That said, I am not holding my breath that we will see one with a DMA Passthrough Mode 80 where an external peripheral can drive the read/write operations at very high speed. For this reason, an overclocked Pi4 is probably as far as I will ever go for my current use case.
  
  LikeLike
  
  Reply
  1. iosoftcode says:
    
    July 28, 2024 at 9:19 am
    
    Thanks very much for the information, it is good that there is a simple solution for the 64-bit OS. I agree with your comments on the Pi 5; basically the new silicon lacks some of the obscure-but-useful functionality of the older devices.
    
    LikeLike