SPI with a Blindfold On- speed up by letting go and trusting the machine

Constantly checking to see if the coast is clear feels responsible, but it wastes cycles. Sometimes it is better to leap (or load) without looking. With a little hand-coded assembly, we can run our AVR processor lock-step with the SPI hardware and blindly dump new bytes into it at precisely the right moment. Because we don’t spend any time reading and testing status bits, we can increase the maximum throughput by more than 20%. If the prospect of screamingly fast yet perfectly safe SPI turns you on, read on…

Slow SPI=Dim LEDs

I was working on a board to drive an awesome giant display made out of vintage 1980’s LED modules. Back in the 80’s, buffered shift registers were a luxury few could afford so these old modules are unbuffered. This means we have to turn the display completely off while shifting new data into the registers or the flying bits will show up on the display.

Vintage LED Controller

The longer the display is off, the dimmer it looks. Dim is bad, so we need to squirt the bits out as fast as possible.

I coded everything up usingthe top SPI speed setting (8 Mb/s) and… it was too dim.

Why so slow?

Slapping the scope on immediately showed the problem…

DS1Z_QuickPrint10

See the big gaps between bytes? What a waste of time. These gaps reduce the overall speed to only about 5.4 Mb/s!

Why so slow, SPI?

To the code

Here is the library code for the transfer() function

[code lang=”cpp”]
inline static void transfer(void *buf, size_t count) {
if (count == 0) return;
uint8_t *p = (uint8_t *)buf;
SPDR = *p;
while (–count > 0) {
uint8_t out = *(p + 1);
while (!(SPSR & _BV(SPIF))) ;
uint8_t in = SPDR;
SPDR = out;
*p++ = in;
}
while (!(SPSR & _BV(SPIF))) ;
*p = SPDR;
}
[/code]

This is very nice code. I tried rewriting it in hand-crafted assembler and could only squeeze a single cycle out out of it after an hour worth of work. Whoever wrote this library code did a great job.

So if we can’t optimize our way out of this bag, how else can we get some extra speed?

If I had a buffer, I’d buffer in the morning

No SPI transmit buffer on AVR

Unfortunately, the SPI transmitter on the AVR is not buffered, so we have to wait for one byte to be completely transmitted before loading the next one. For each byte, the above code must…

  1. Send the byte
  2. Load the the SPI status into a register
  3. Check the a bit in the register to see if the transmission completed…
  4. Go back to step #2 if not

No matter how you code it, steps 2-4 burn cycles. Any time we spend loading, checking, and branching is time we are not spending actually getting the next byte out the door.

If only we could somehow just magically know when the the current byte completed without having to do time consuming loading and testing. We need a psychic processor!

The Lock-step Blind Send

The SPI hardware on the AVR is completely deterministic. It takes exactly 2 processor cycles to send each bit, so 16 cycles to send a full byte. This means that if we close our eyes and start counting cycles the moment we start to send a byte, then the SPI should be done exactly when we get to 16. We don’t have to do any costly checks- we just have to count! Determinism is even better than mysticism!

For precise cycle counting like this, we are going to need to more control than we can reliably get from C. We are going to have to drop down to pure assembly.

Here is code that sends one byte exactly every 16 cycles…

[code]
LOOP_LEN:
// Cycles
// ——
ld __tmp_reg__,Z+ // 2 – load next byte
out SPDR,__tmp_reg__ // (transmit byte!)

rjmp .+0 // 2 – twiddle thumbs
rjmp .+0 // 2 – twiddle thumbs
rjmp .+0 // 2 – twiddle thumbs
rjmp .+0 // 2 – twiddle thumbs
rjmp .+0 // 2 – twiddle thumbs

sbiw len, 1 // 2 – dec counter
brne LOOP_LEN // 2 – loop back until done
// ======
// 16

[/code]

Note that those rjmp .0+ lines do nothing but waste 2 cycles. It is exactly the same as putting two consecutive nops, but uses half as much code space.

Too early for the party

Unfortunately, when we run the above code, we get very fast throughput, but no data past the first byte! That is not very useful.

DS1Z_QuickPrint14

If we try adding one extra cycle in our delay loop…

[code highlight=”12″]
LOOP_LEN:
// Cycles
// ——
ld __tmp_reg__,Z+ // 2 – load next byte
out SPDR,__tmp_reg__ // (transmit byte!)

rjmp .+0 // 2 – twiddle thumbs
rjmp .+0 // 2 – twiddle thumbs
rjmp .+0 // 2 – twiddle thumbs
rjmp .+0 // 2 – twiddle thumbs
rjmp .+0 // 2 – twiddle thumbs
nop // 1 – twiddle thumb

sbiw len, 1 // 2 – dec counter
brne LOOP_LEN // 2 – loop back until done
// ======
// 17
[/code]

…then we have high throughput and good data!. Yeay!

DS1Z_QuickPrint15

Why do we need that extra cycle? I don’t know for sure, but my guess is that the silicon uses that extra cycle to copy the contents of the SPI shift register into the SPI Read Buffer.

We win!

By using blind lock-step sending, we’ve increased our maximum SPI throughput from 84.34 kBps to 107.57315 kBps!

That’s a throughput increase of more than 20% just for letting go of our constant necrotic testing and trusting the state machine!

Usable code

To clean things up, we need to add a line that checks for zero length messages.

We also need to make sure that we use up an extra cycle at the end of each transmit so that consecutive transmits don’t run into each other.

Here is the finished fast transmit function…

[code lang=”cpp”]
inline static void fastSpiTransmit( const void *buf , unsigned int len ) {

if (len == 0) return; // Do nothin if len zero

asm volatile (

"LOOP_LEN_%=: \n\t"
"ld __tmp_reg__,%a[buf]+ \n\t"
"out %[spdr],__tmp_reg__ \n\t"

"rjmp .+0 \n\t"
"rjmp .+0 \n\t"
"rjmp .+0 \n\t"
"rjmp .+0 \n\t"
"rjmp .+0 \n\t"
"nop \n\t"

"sbiw %[len], 1 \n\t"
"brne LOOP_LEN_%= \n\t"

"nop \n\t" // use up the cycle we saved from the above branch not taken
// this makes sure that if we have two transmits inlined in a row, the second one will not
// step on the first.

: // Outputs: (these are actually inputs, but we mark as read/write output since they get changed during execution)
// "there is no way to specify that input operands get modified without also specifying them as output operands."

[buf] "+e" (buf), // pointer to buffer
[len] "+w" (len) // length of buffer

: // Inputs:
"[buf]" (buf), // pointer to buffer
"[len]" (len), // length of buffer

[spdr] "I" (_SFR_IO_ADDR(SPDR)) // SPI data register

: // Clobbers
"cc" // special name that indicates that flags may have been clobbered

);

}
[/code]

Here is a ready-to-run demo project for Arduino that will send the message SPI is fun out the SPI as fast as an AVR SPI can possibly go!…
https://github.com/bigjosh/FastestSPI/blob/master/FastestSPI.ino

Note that the fastSpitransmit() function will work unaltered on either Arduino or direct bare-metal avr-gcc.

The Takeaway

Even if you don’t care about SPI at all, hopefully this exercise has still helped to remove some magic. These amazing little computers are still just a big pile of gates. If we see them as simple state machines, sometimes we can get them to do more than if we only see them as abstract devices.

FAQ

Q: Will this work with slower SPI speeds?
A: This code is hand crafted to work with the fastest SPI speed on AVR-8, which is 1/2 the system clock. If you are running at slower speeds then you probably won’t care about the couple of lost cycles between bytes. That said, it would certainly be possible to make this code work with any SPI speed, but it would add complexity and make it harder to understand what is going on.

Q: The library SPI function both sends and receives, while your code only sends. Is this why yours is faster?
A: No. This code is faster because it has less delay between bytes. Since the received SPI byte is buffered on AVR, it would not be hard to also save this byte during the extra time that is spent in the NOPs in the above code.

Q: Ok, this is interesting in theory, but how could I possibly use this in practice?
A: How about ready-to-run drop in replacement library that uses blind-send SPI to update your DotStar LEDs 18-23% faster?

Update!!!

Yikes! An extra RJMP creeped into the final C code in a horrible copy/paste accident! If you grabbed this code before 12/21/2015, switch to this new version and get an extra 2 cycle savings!

7 comments

    • bigjosh2

      Exactly! And your C code is so much more beautiful than the harsh ASM.

      I also tried doing it in vanilla C, but found it too brittle since the compiler has a lot of leeway in translating it.

      For example, with size optimization enabled (the default on the Arduino compiler), the C code compiles to…

      11a: 81 e0 ldi r24, 0x01 ; 1
      11c: ea 30 cpi r30, 0x0A ; 10
      11e: f8 07 cpc r31, r24
      120: 71 f0 breq .+28 ; 0x13e
      122: 81 91 ld r24, Z+
      124: 8e bd out 0x2e, r24 ; 46
      ...
      13a: 00 00 nop
      13c: ee cf rjmp .-36 ; 0x11a

      …which comes out 1 cycle too long. Size optimization is now the default setting on the Arduino compiler, which a lot of people use.

      When every cycle counts, I think we have no choice but to drop down to ASM to get reliable results. Thanks!

    • bigjosh2

      Fine code, sir! Browsing your blog, I see that this is not the first time I’ve found myself walking in your (rather large and wide ranging) footsteps… :)

    • bigjosh2

      I felt like the above code was already kind of gnarly to follow so I didn’t want to add control flow jumping around to make people dizzy. In the actual example library code, I did use rcalls for space efficient delays – even reusing the same return with multiple call entry points.

  1. Wolfgang Schreiter

    Hi, thanks for your post which provided me with valuable insights about SPI. I’ve made a small change to the code above which may be of value, but don’t have the experience or equipment to check it out (other than that it works in my test setup).
    Here’s the idea: instead of burning cycles with rjmp/nop, I check for SPIF as one would normally do. Each iteration burns 4 cycles (3 on exit), for a total of 9 + 4*n. If I’m not mistaken, this should work for all SPI speeds, and should always be optimal (for send only), assuming timing and hardware of AVR processors like Atmega328.

    void SPIClass::fastTransfer(const void *buffer, size_t length)
    {
    if (length == 0) return;

    asm volatile (      
        "LOOP_%=:                  \n\t"
        "ld __tmp_reg__,%a[buf]+   \n\t"       //   2    - load next byte
        "out %[spdr],__tmp_reg__   \n\t"       //        - transmit byte
        "WAIT_%=:                  \n\t"
        "in __tmp_reg__,%[spsr]    \n\t"       //   1    - load status register
        "sbrs  __tmp_reg__,7       \n\t"       //   1/2  - test for SPIF set
        "rjmp WAIT_%=              \n\t"       //   2
        "sbiw %[len],1             \n\t"       //   2    - decrease length
        "brne LOOP_%=              \n\t"       //   2
        :         // Outputs: (these are actually inputs, but we mark as read/write output since they get changed during execution)
                  // "there is no way to specify that input operands get modified without also specifying them as output operands."
        [buf]   "+e" (buffer),                 // pointer to buffer
        [len]   "+w" (length)                  // length of buffer
        :         // Inputs:
        "[buf]" (buffer),                      // pointer to buffer
        "[len]" (length),                      // length of buffer
        [spsr]  "M" (_SFR_IO_ADDR(SPSR)),
        [spdr]  "M" (_SFR_IO_ADDR(SPDR))       // SPI data register
        :         // Clobbers
        "cc"                                   // special name that indicates that flags may have been clobbered
    );
    

    }

    P.S. Your code on github still has one rjmp too many.

    • Wolfgang Schreiter

      Well, it would have been nice… please forget my comment above. I was too focussed on counting CPU cycles, but of course the code after the “in” instruction will also be executed when the transfer is complete. So, back to the drawing board.

Leave a Reply