Site icon josh.com

Parallel Processing Arduino Style – Make Massive NeoPixel Displays With Nanoscale Concurrent Computing

We’ve already seen that it is possible to drive thousands of WS2812B NeoPixels with a lowly Arduino using careful bit-banging. But what if we could bang out 8 bits at a time rather than sending them single file? Could it be possible to drive 8 times as many strings (or get 8 times the refresh rate) from our Arduino by processing bits in parallel? It would be like having a tiny pipelined GPU render engine inside our Arduino!

Read on to find out the results of a quick proof-of-concept test!….

Perfunctory Video

Spoiler alert: If you want to keep up the suspense, read the article first and then come back and watch the video.

Pins and Ports

Each output pin on the Arduino maps to a single bit of a port”. A port is just an internal register that happens to be connected to the pins, so writing to the port can change the output of the pins it is connected to. Ports are named B, C, D, etc. The pin-to-port mappings for an Uno are shown here…

For example, Digital Pin 4 maps to bit 4 of port D (shown as PD4), so if we set bit 4 of port D then digital pin #4 will go high (assuming it is set to output mode). You can read more about pins and ports here.

By writing a full byte to a port, we can set all the pins at once in one very quick step.  If we can compound these gains by doing of our computations and signal generation using 8 bits in parallel, we should be able to drive nearly 8 times as many pixels or get 8 times the refresh rate when compared to individual bit banging.

Picking Our Port

Looking carefully at the above map, you’ll see that only Port D has all of its bits mapped to accessible pins, so that is the one we will use. Mind as well get maximum bang for our bit-bang-buck!

To test, we very simply connect up 8 strings to the 8 pins of port D like this…

Real life is a bit messier, but still recognizable…

Pushing Parallel Pixels

To drive a WS2812B Neopixel strip, we need to generate a sequence of specially timed signals. To drive 8 strips, we need to generate 8 of these sequences – and all at the same time. This is not as hard as it sounds since each data bit in the generated signal can be neatly divided into three phases…

Signals for sending one WS2812B data bit

Step Color Output Description Duration
1 GREEN HIGH always high  (T0H)
2 YELLOW DATA the data itself (T1H-T0H)
3 RED LOW always low (T0H)

(The exact timings for each phase are described here)

See how the only difference between a 1 bit and a 0 bit is the level in the time period shown in yellow? This makes things much simpler for us since the beginning and ending of each bit are always the same.

So, to transmit a set of 8 encoded data bits (1 data bit to each string), all we need to do is…

  1. set all bits in the port to 1 (which is a single write of 0xff to the port)
  2. wait the right amount of time
  3. set the bits in the port to the data we want to send to each string (also a single write of a byte with all the 8 bits set to the correct data)
  4. wait the right amount of time
  5. set all the bits in the port to 0 (again, a single write of 0x00 to the port)
  6. wait the right amount of time and repeat

Translated into pseudo assembly code, that looks like…

[code]
out PORTD, 0xff ; set all pins on port D to 1
delay T0H ; complete 1st phase of an encoded bit
out PORTD, data ; set all pins on port D to their data values
delay T1H-T0H ; complete 2nd phase of an encoded bit
out PORTD, 0x00 ; set all pins on port D to 0
delay T1L ; complete last phase of an encoded bit
[/code]

…things get a bit messier converting to real Arduino C code, but the steps are still recognizable…

[code lang=”cpp”]
// Actually send the next set of 8 WS2812B encoded bits to the 8 pins.
// We must to drop to asm to enusre that the complier does
// not reorder things and make it so the delay happens in the wrong place.

static inline __attribute__ ((always_inline)) void sendBitX8( uint8_t bits ) {

const uint8_t onBits = 0xff; // We need to send all bits on on all pins as the first 1/3 of the encoded bits

asm volatile (

"out %[port], %[onBits] \n\t" // 1st step – send T0H high

".rept %[T0HCycles] \n\t" // Execute NOPs to delay exactly the specified number of cycles
"nop \n\t"
".endr \n\t"

"out %[port], %[bits] \n\t" // set the output bits to thier values for T0H-T1H
".rept %[dataCycles] \n\t" // Execute NOPs to delay exactly the specified number of cycles
"nop \n\t"
".endr \n\t"

"out %[port],__zero_reg__ \n\t" // last step – T1L all bits low

// Don’t need an explicit delay here since the overhead that follows will always be long enough

::
[port] "I" (_SFR_IO_ADDR(PIXEL_PORT)),
[bits] "d" (bits),
[onBits] "d" (onBits),

[T0HCycles] "I" (NS_TO_CYCLES(T0H) – 2), // 1-bit width less overhead for the actual bit setting, note that this delay could be longer and everything would still work

[dataCycles] "I" (NS_TO_CYCLES((T1H-T0H)) – 2) // Minimum interbit delay. Note that we probably don’t need this at all since the loop overhead will be enough, but here for correctness

);

}
[/code]

 

Note that we do not explicitly wait for the T1L delay during the final phase since the overhead of calling the function will add enough time of low level between bits.

A Pixel Is More Than Just a Bit

We are using color strips, so each pixel is a total of 24 bits long – 8 bits for each Red, Green, and Blue brightness.  To keep things simple for this test, we will just send each pixel as either 24 1‘s for on or 24 0‘s for off. The 24 1‘s encode a brightness of 255 for all three colors – which corresponds to full brightness white pixel (visually “ON”). The 24 0‘s encode a brightness of 0 for all three colors- which corresponds to a black pixel (visually “OFF”).

[code lang=”cpp”]
// Send a single pixel out to each of the 8 strings
/ Each bit in `row` indicates if the pixel in the corresponding string should be on or off

static inline void __attribute__ ((always_inline)) sendPixelRow( uint8_t row ) {

// Send the bit 24 times down every row.
// This ends up as 100% white if the bit in row is 1, or black (off) if the bit is 0.
// Remember that each pixel is 24 bits wide (8 bits each for R,G, & B)

uint8_t bit=24;

while (bit–) {

sendBitX8( row );
}

}

[/code]

Commence Test Data Transmission!

Now we are ready to test!

Remember that each call to sendPixelRow() will send one full pixel to each of the 8 attached strips. Each bit in the passed byte corresponds to one of the strips. Bit 0 goes to the bottom strip, bit 7 to the top one, etc…

Let’s send an interesting pattern so we can tell if it works….

[code lang=”cpp”]
sendPixelRow( 0b10000000 ); // Send an interesting and challenging pattern
sendPixelRow( 0b01000000 );
sendPixelRow( 0b00100000 );
sendPixelRow( 0b00010000 );
sendPixelRow( 0b00001000 );
sendPixelRow( 0b00000100 );
sendPixelRow( 0b00000010 );
sendPixelRow( 0b00000001 );
sendPixelRow( 0b00000010 );
sendPixelRow( 0b00000100 );
sendPixelRow( 0b00001000 );
sendPixelRow( 0b00010000 );
sendPixelRow( 0b00100000 );
sendPixelRow( 0b01000000 );
sendPixelRow( 0b10000000 );
sendPixelRow( 0b00000000 );
sendPixelRow( 0b01010101 );
sendPixelRow( 0b10101010 );
sendPixelRow( 0b01010101 );
sendPixelRow( 0b10101010 );
sendPixelRow( 0x00000000 );
sendPixelRow( 0b11111111 );
sendPixelRow( 0x00000000 );
sendPixelRow( 0b11111111 );
sendPixelRow( 0x00000000 );
[/code]

Success! Hopefully you can make out the test pattern on the 8 strips…

This is really cool! We updated 8 strips in about the same amount of time as it takes to update just one! The power of parallel bit-banging!

 

Now that we have proof of concept, we need to figure out a way to put all this extra bandwidth to good use.  Ideally, we want to find an application that also lets us do our display computations in parallel so we effectively have a (tiny!) 8-way parallel processor generating our display.

Stay tuned, because I know a prefect project to make the most of our new found power. This is going to be big…

Code Drop

Complete working sketch for an Arduino Uno here…

https://github.com/bigjosh/MultiBitBangPOC/blob/master/Arduino/MultiBitBang/MultiBitBang.ino

FAQ

Q: Why?

When the question came up, I thought it was interesting enough to justify a proof of concept test.

Q: Why bother doing this on an Arduino? Just get a Beaglebone/Teensy/RaspPi/Fadecandy!

All of these platforms have more horsepower/memory/swagger than a lowly Arduino. The BeagleBone’s PRU is especially well suited to driving lots and lots of Neopixel strings in parallel.

That said, the Arduino is a widely available and popular platform, and lots and lots of people use them for driving WS2812B NeoPixels. The Arduino naively runs at 5 volts, so you can connect the strings directly to it. The Arduino is bare metal, so generating the precise timing needed is straight forward (although not necessarily easy).

Using an Arduino (or cheap clone) can be an order of magnitude cheaper than the above platforms and there is an aesthetic beauty to using the minimum hardware necessary to solve a problem.

The code presented here can even run directly on a $2 naked Arduino chip connected directly to the strips- even getting its power from them.  All that’s needed is a bit of tweaking to the clock speed to avoid needing the 20MHz crystal.

Q: Can you do color? Animation? Scrolling? Video Games? Ahhhh- head exploding with ideas!!!!

All this and more. Just wait until next time…

Exit mobile version