December 22, 2015

Getting Real – Using Blind Send SPI to Turbocharge the Adafruit DotStar Libraryspi,

Last time, we experimented with spiritual blind-sending as a way to theoretically speed up SPI on AVR. While there were lots of fancy oscilloscope traces and impressive demo code, there is nothing like an actual, real, practical application to get people excited. Read on to see how much faster we can make the already highly optimized AdaFruit DotStar library with a little blind-sending action… (spoiler alert – the answer is lots more faster!)

DotStars are Adafruit’s branded line of APA102 LED strips. They are a lot like Neopixel strips, except they are much faster and so don’t have the same flicker and jitter problems. If fast is what makes Dotstars good, then making them even faster should make them even better!

Punchline

If you just want your Dotstars to refresh 18-23% faster and don’t care about the magic that makes that happen, you can use this drop-in fork of the Adafruit library…

https://github.com/bigjosh/Adafruit_DotStar

To get the juicy speed improvements, you must…

use a chip with dedicated SPI hardware. This includes the ATMEGA328 chip on the Arduino UNO. This does not include the ATTINY85 on the Trinket. The library will still work with non-SPI-enabled chips, you just wont get the extra speed.
connect your Dotstar strip to the dedicated SPI pins. On the Arduino UNO, this means that the Dotstar clock line goes to digital pin #13 and the data line goes to digital pin #11.
use the hardware Adafruit_DotStar(NUMPIXELS, DOTSTAR_BRG) constructor. This is the one that does not include data and clock pin arguments.

Note that with the new library, the global brightness setting is free! The code runs the same speed with brightness controlled as it does without it.

Benchmarks

To test, I used the strandtest demo program included in the library with a 60 pixel long string. Times listed are how long it took to execute a single refresh of the whole string.

SPI Method	Without Brightness	With Brightness
Soft (bitbang)	5,690us	5,730us
Standard	573us	625us
Pipelined	341us	364us
Blind Send	279us	279us

Cases:

Soft: The code manually toggles the the clock and data pins to shift out the bitstream.
Standard: This code uses the datasheet SPI sending code that waits for each byte to complete before computing and sending the next one.
Pipelined: This code is smart enough to start computing the next byte to be transmitted while the current byte is still being shifted out by the SPI hardware. Once the new byte is computed, it polls the SPI hardware to determine when it is ready to accept the next byte.
Blind Send: This mysterious and edgy code counts every cycle used while computing bytes to ensure that it proffers a new byte at exactly the clock tick when the hardware is finished sending the previous one.

Code Size

The Blind Send code is not only faster, it is 20 bytes smaller too! Cake time!

Changes

You can see the code changes in the new version here…

https://github.com/bigjosh/Adafruit_DotStar/commit/52f573a9681909261029f149af785c539756ec69

Everything is local to the USE_HW_SPI section of the show() function.

FAQ

Q: Could running things so much faster lead to signal transmission problems, like if my cables are not so great?

A: Probably not since the speed improvements here come completely from reducing the idle time between bytes rather than changing the speed of bits inside each byte.

Q: Why not save some space and use _delay_loop_2() for your delays?

A: The docs for these functions are just too squishy. “The loop executes three CPU cycles per iteration, not including the overhead the compiler needs to setup the counter register.” How many cycles is that? Why don’t you want to tell me so I know how long it will take? I know you can just read the code, but it is easier and safer to write my code than to read someone else’s code. Plus, don’t you like my handy multiple entry point subroutine trick for having multiple delays possible from a single call?

Q: Wouldn’t completely disabling the global brightness setting speed things up even more?

A: No. With the current code, the hardware SPI clock is the limiting factor for maximum speed. We have about 16 instructions between each SPI byte to do with what we please, and it turns out that is plenty of time to do the brightness transformation, so it is effectively free.

Q: Can we make it faster?

A: I do not think you could squeeze even 1 clock cycle of extra performance out of this SPI sending code. That said, there are probably 10-20 cycles wasted in avoidable preambles and compares in the show() function. If you care enough, you could pull the blind send code out of the show() function and inline it into your code that was calling show().

Q: I want EVEN faster!

A: Well, you could overclock your Arduino with a faster crystal and some liquid nitrogen. Or just get a Raspberry Pi or Beagle Bone since these can SPI much, much faster than our humble Arduino.

Q: What’s the point? Wasn’t it fast enough before?

A: If you have to ask… If, however, you are doing hardcore light painting with temporally dithered colors, the extra performance could make a big difference, especially considering that Dotstar pixels update asynchronously so the longer the delay before the 1st and last pixel, the more visible tearing will be. Keep in mind that all this extra speed is completely free, so why not use it? (It is actually cheaper than free, because it uses less memory too!)

Q: You expect me to believe that your massive ~100 lines of ASM takes up 20 bytes less flash than the terse ~10 lines of C it replaces?

A: Dissemble and find out for yourself! (Or just compile the old version and then the new version and compare the “bytes used” message).

Q: I still see random glitching and tearing on my strips!

A: I bet your string refresh is getting interrupted by an interrupt. Try adding a cli() before and an sei() after your call to show().

10 comments

December 23, 2015 - 3:41 am David Grayson

The APA102 has a 5-bit brightness setting you can send in the first byte which allows for dimmer colors than would otherwise be possible. Is there any particular reason that your library and Adafruit’s library do not expose that as a feature to the user?

Loading...

Reply
- December 23, 2015 - 7:37 am bigjosh2
  
  The master 5-bit brightness setting on these pixels uses a different, much slower PWM generator than the one for the 8-bit RGB color brightnesses. It is so slow that it pretty much ruins the advantage of using the APA102. It is much better adjust the brightness of the RGB values before you send them to the strip to preserve the high PWM rate. It would have been better if they had omitted those bits so we could have slightly faster refresh rates. It would have been *much* better if they had used those extra bits to give us slightly more dynamic range on the RGB values!
  
  Loading...
  
  Reply
January 3, 2016 - 1:05 am Ralph Doncaster (Nerd Ralph)

Faster spi is possible with USI; I’ve tested USI clocking data out on every cpu cycle. For sck you’d have to use ckout and a p-channel mosfet or some sort of tri-state switch to turn on/off the clock.

Loading...

Reply
January 3, 2016 - 3:05 am bigjosh2

Add hardware?!? If I am willing to take that drastic step, mind as well sell a kidney and buy a Pi Zero! :)

Loading...

Reply
March 8, 2016 - 12:44 pm Pingback: Turbocharge the Adafruit DotStar Library Using Blind Send (20% Faster!) « Adafruit Industries – Makers, hackers, artists, designers and engineers!
November 20, 2018 - 9:39 am Christopher

I think there is a problem when driving more than 255 LEDs, because of a uint8 in the library. At least the last 3 LEDs of my 16×16 DotStar Matrix act weird when using this library. Can you help me fix this problem?

Loading...

Reply
- November 20, 2018 - 11:55 am bigjosh2
  
  Is the problem with this TurboSPI version of the library only, or do you see it with the unmodified Adafruit version of the DotStar library as well?
  
  Loading...
  
  Reply
November 21, 2018 - 1:40 am Christopher

Only with this TurboSPI version. The unmodified Adafruit version works just fine.

Loading...

Reply
November 21, 2018 - 11:40 am bigjosh2

You are correct Sir!

Here is the line that decrements the length counter…

https://github.com/bigjosh/Adafruit_DotStar/blob/52f573a9681909261029f149af785c539756ec69/Adafruit_DotStar.cpp#L255

…which is held in the single byte register i.

To make this code work with lengths longer than 255, you’d need to expand that counter to be a word rather than a byte.

Luckily the AVR has a handy SBIW instruction that can decrement a word in only 2 cycles, and there are 3 cycles here to work with!

So you would need to change the size of the value passed into i to be a word and then change that dec to sbiw 0x01… and then count all the cycles and make sure it still added up right.

Want to give it a go?

Loading...

Reply
- November 21, 2018 - 11:48 am bigjosh2
  
  Wait! It looks like I already did this?!? Try using the master branch from the repo here…
  
  https://github.com/bigjosh/Adafruit_DotStar
  
  It seems to have all the changes described above!
  
  Loading...
  
  Reply

josh.com