Parallel Processing Arduino Style – Make Massive NeoPixel Displays With Nanoscale Concurrent Computing

We’ve already seen that it is possible to drive thousands of WS2812B NeoPixels with a lowly Arduino using careful bit-banging. But what if we could bang out 8 bits at a time rather than sending them single file? Could it be possible to drive 8 times as many strings (or get 8 times the refresh rate) from our Arduino by processing bits in parallel? It would be like having a tiny pipelined GPU render engine inside our Arduino!

Read on to find out the results of a quick proof-of-concept test!….

Perfunctory Video

Spoiler alert: If you want to keep up the suspense, read the article first and then come back and watch the video.

Pins and Ports

Each output pin on the Arduino maps to a single bit of a port”. A port is just an internal register that happens to be connected to the pins, so writing to the port can change the output of the pins it is connected to. Ports are named B, C, D, etc. The pin-to-port mappings for an Uno are shown here…


For example, Digital Pin 4 maps to bit 4 of port D (shown as PD4), so if we set bit 4 of port D then digital pin #4 will go high (assuming it is set to output mode). You can read more about pins and ports here.

By writing a full byte to a port, we can set all the pins at once in one very quick step.  If we can compound these gains by doing of our computations and signal generation using 8 bits in parallel, we should be able to drive nearly 8 times as many pixels or get 8 times the refresh rate when compared to individual bit banging.

Picking Our Port

Looking carefully at the above map, you’ll see that only Port D has all of its bits mapped to accessible pins, so that is the one we will use. Mind as well get maximum bang for our bit-bang-buck!

To test, we very simply connect up 8 strings to the 8 pins of port D like this…


Real life is a bit messier, but still recognizable…

2016-05-04 12.55.03

Pushing Parallel Pixels

To drive a WS2812B Neopixel strip, we need to generate a sequence of specially timed signals. To drive 8 strips, we need to generate 8 of these sequences – and all at the same time. This is not as hard as it sounds since each data bit in the generated signal can be neatly divided into three phases…


Signals for sending one WS2812B data bit

Step Color Output Description Duration
1 GREEN HIGH always high  (T0H)
2 YELLOW DATA the data itself (T1H-T0H)
3 RED LOW always low (T0H)

(The exact timings for each phase are described here)

See how the only difference between a 1 bit and a 0 bit is the level in the time period shown in yellow? This makes things much simpler for us since the beginning and ending of each bit are always the same.

So, to transmit a set of 8 encoded data bits (1 data bit to each string), all we need to do is…

  1. set all bits in the port to 1 (which is a single write of 0xff to the port)
  2. wait the right amount of time
  3. set the bits in the port to the data we want to send to each string (also a single write of a byte with all the 8 bits set to the correct data)
  4. wait the right amount of time
  5. set all the bits in the port to 0 (again, a single write of 0x00 to the port)
  6. wait the right amount of time and repeat

Translated into pseudo assembly code, that looks like…

out PORTD, 0xff ; set all pins on port D to 1
delay T0H ; complete 1st phase of an encoded bit
out PORTD, data ; set all pins on port D to their data values
delay T1H-T0H ; complete 2nd phase of an encoded bit
out PORTD, 0x00 ; set all pins on port D to 0
delay T1L ; complete last phase of an encoded bit

…things get a bit messier converting to real Arduino C code, but the steps are still recognizable…

[code lang=”cpp”]
// Actually send the next set of 8 WS2812B encoded bits to the 8 pins.
// We must to drop to asm to enusre that the complier does
// not reorder things and make it so the delay happens in the wrong place.

static inline __attribute__ ((always_inline)) void sendBitX8( uint8_t bits ) {

const uint8_t onBits = 0xff; // We need to send all bits on on all pins as the first 1/3 of the encoded bits

asm volatile (

"out %[port], %[onBits] \n\t" // 1st step – send T0H high

".rept %[T0HCycles] \n\t" // Execute NOPs to delay exactly the specified number of cycles
"nop \n\t"
".endr \n\t"

"out %[port], %[bits] \n\t" // set the output bits to thier values for T0H-T1H
".rept %[dataCycles] \n\t" // Execute NOPs to delay exactly the specified number of cycles
"nop \n\t"
".endr \n\t"

"out %[port],__zero_reg__ \n\t" // last step – T1L all bits low

// Don’t need an explicit delay here since the overhead that follows will always be long enough

[port] "I" (_SFR_IO_ADDR(PIXEL_PORT)),
[bits] "d" (bits),
[onBits] "d" (onBits),

[T0HCycles] "I" (NS_TO_CYCLES(T0H) – 2), // 1-bit width less overhead for the actual bit setting, note that this delay could be longer and everything would still work

[dataCycles] "I" (NS_TO_CYCLES((T1H-T0H)) – 2) // Minimum interbit delay. Note that we probably don’t need this at all since the loop overhead will be enough, but here for correctness




Note that we do not explicitly wait for the T1L delay during the final phase since the overhead of calling the function will add enough time of low level between bits.

A Pixel Is More Than Just a Bit

We are using color strips, so each pixel is a total of 24 bits long – 8 bits for each Red, Green, and Blue brightness.  To keep things simple for this test, we will just send each pixel as either 24 1‘s for on or 24 0‘s for off. The 24 1‘s encode a brightness of 255 for all three colors – which corresponds to full brightness white pixel (visually “ON”). The 24 0‘s encode a brightness of 0 for all three colors- which corresponds to a black pixel (visually “OFF”).

[code lang=”cpp”]
// Send a single pixel out to each of the 8 strings
/ Each bit in `row` indicates if the pixel in the corresponding string should be on or off

static inline void __attribute__ ((always_inline)) sendPixelRow( uint8_t row ) {

// Send the bit 24 times down every row.
// This ends up as 100% white if the bit in row is 1, or black (off) if the bit is 0.
// Remember that each pixel is 24 bits wide (8 bits each for R,G, & B)

uint8_t bit=24;

while (bit–) {

sendBitX8( row );



Commence Test Data Transmission!

Now we are ready to test!

Remember that each call to sendPixelRow() will send one full pixel to each of the 8 attached strips. Each bit in the passed byte corresponds to one of the strips. Bit 0 goes to the bottom strip, bit 7 to the top one, etc…

Let’s send an interesting pattern so we can tell if it works….

[code lang=”cpp”]
sendPixelRow( 0b10000000 ); // Send an interesting and challenging pattern
sendPixelRow( 0b01000000 );
sendPixelRow( 0b00100000 );
sendPixelRow( 0b00010000 );
sendPixelRow( 0b00001000 );
sendPixelRow( 0b00000100 );
sendPixelRow( 0b00000010 );
sendPixelRow( 0b00000001 );
sendPixelRow( 0b00000010 );
sendPixelRow( 0b00000100 );
sendPixelRow( 0b00001000 );
sendPixelRow( 0b00010000 );
sendPixelRow( 0b00100000 );
sendPixelRow( 0b01000000 );
sendPixelRow( 0b10000000 );
sendPixelRow( 0b00000000 );
sendPixelRow( 0b01010101 );
sendPixelRow( 0b10101010 );
sendPixelRow( 0b01010101 );
sendPixelRow( 0b10101010 );
sendPixelRow( 0x00000000 );
sendPixelRow( 0b11111111 );
sendPixelRow( 0x00000000 );
sendPixelRow( 0b11111111 );
sendPixelRow( 0x00000000 );

Success! Hopefully you can make out the test pattern on the 8 strips…

2016-05-04 13.37.32

This is really cool! We updated 8 strips in about the same amount of time as it takes to update just one! The power of parallel bit-banging!


Now that we have proof of concept, we need to figure out a way to put all this extra bandwidth to good use.  Ideally, we want to find an application that also lets us do our display computations in parallel so we effectively have a (tiny!) 8-way parallel processor generating our display.

Stay tuned, because I know a prefect project to make the most of our new found power. This is going to be big…

Code Drop

Complete working sketch for an Arduino Uno here…


Q: Why?

When the question came up, I thought it was interesting enough to justify a proof of concept test.

Q: Why bother doing this on an Arduino? Just get a Beaglebone/Teensy/RaspPi/Fadecandy!

All of these platforms have more horsepower/memory/swagger than a lowly Arduino. The BeagleBone’s PRU is especially well suited to driving lots and lots of Neopixel strings in parallel.

That said, the Arduino is a widely available and popular platform, and lots and lots of people use them for driving WS2812B NeoPixels. The Arduino naively runs at 5 volts, so you can connect the strings directly to it. The Arduino is bare metal, so generating the precise timing needed is straight forward (although not necessarily easy).

Using an Arduino (or cheap clone) can be an order of magnitude cheaper than the above platforms and there is an aesthetic beauty to using the minimum hardware necessary to solve a problem.

The code presented here can even run directly on a $2 naked Arduino chip connected directly to the strips- even getting its power from them.  All that’s needed is a bit of tweaking to the clock speed to avoid needing the 20MHz crystal.

Q: Can you do color? Animation? Scrolling? Video Games? Ahhhh- head exploding with ideas!!!!

All this and more. Just wait until next time…


  1. dntruong

    I’m glad you got this working: The main concern is not actually driving 8 pins, but getting 8bits repeatedly from 8 places in your pixel array.
    I’ve managed getting up to two bits working, hence I have a 2bit bitbang mode on FAB_LED, but not eight…
    The question is: can we read 8 bytes to push before the strip times out and resets. Now I have not tried to buffer the 8 bytes but I don’t see how that makes things faster and consumes RAM.

    • bigjosh2

      I have an application that is ideally suited to processing the pixels 8 bits at a time and can generate the display pixels and resulting signals in real-time fast enough to avoid inadvertent resets. It is even super memory efficient, needing only a single byte to store 6 rows of the display. Stay tuned for some very long (and actually useful) displays!

      • dntruong

        Note: I’ve implemented blindly APA102 support and ARM support, but IDK how buggy it is. Don’t play with it yet unless you wanna debug it :).

  2. dntruong

    How to pull it off? pipelining.
    First, offset the display time to each strip by 8.
    Create a loop that loads ONE byte in a buffer register from memory at every iteration.
    Use 8 registers as buffers.
    Now pull a bit from each register to form the port’s next value. display. move bit pointer. The trick is each register displays a different bit and loads when its bit counter hits 8.
    Makes sense?
    Question : are there enough registers to hold 8 buffered bytes, indices, addresses do the math, etc. for the loop to work.

      • dntruong

        Well bitmap was not initialized. :/

        So this works for 4 ports, but it blows my mind, it doesn’t with 8, though the loop should be perfectly balanced at 8. :/

        • bigjosh2

          After a quick look, (1) I think you can make the bit scatter gather much faster with a sprinkle of ASM. Instead of all that ANDing, use only a ROR and ROL for each bit, (2) change the layout of the buffer so that each block is always 8 bytes long, this avoids any multiplies and the only overhead for each byte in the buffer is a MOV Rx, Z+. You have plenty of time to deal with the shuffled 8x buffer blocks in the foreground thread, no reason to spend time on it when in a rush to get the bits out.

          • dntruong

            I checked in code in FAB_LED, and so far it can drive max 6 pins in parallel on a 16MHz Uno.
            Example F demos it.
            Daniel Garcia, the FastLED guy, helped me iron out some of the bugs in the code.

            BTW I usually rely on gcc to use optimal instructions, just helping it getting it right with proper coding that makes it go the right path.

            I admit I’m doing it blind still, as I don’t look at the ASM (partly IDK where to find it with the IDE :P ). I should do that to check if code already generates a ROL to save a couple of cycles per register and make this handle 8 ports with spare cycles.

            I’d use 8B only for rgbw, as I want to keep the flexibility for users. I suspect the * will be replaced by a shift right.

            IDK what you mean by “foreground”. In FAB_LED the idea is I don’t buffer anything. There’s one array of data owned by the user which may hold 1 to 32 bits per pixel.

            Current working code (max 6 ports @16MHz):


  3. Nikos

    Well done, very interesting concept.
    I hwas however confused from the name
    “sendPixelRow( 0b10000000 );”

    Would it be more precise if it was sendPixelColumn ? This is what it does isn;t it?

    • bigjosh2

      Yes, I struggled with descriptive names for these functions, and sendPixelRow() probably ended up being the worst possible name for this function in this context where it is sending a column of pixels to the display. I’ll fix next time I rev the code. Thanks!

    • bigjosh2

      Thanks CW! As much as I hate the Yun, this is a solution that works and I know that lots of people will use it. The Yun has an additional Serial port that does not use the D0-D7 pins at all, so no conflict the neopixel code. (It uses this extra serial port to talk to the little onboard linux computer.)

      Got any suggestions for good test websites that return interesting text to use for “”?
      Got a video of your setup to share?!

      Thanks again!

Leave a Reply