A software only solution to the vexing Beagle Bone Black PHY issue

Power cycling BBB’s with clicky relays

Sometimes the Ethernet port on a Beagle Bone Black does not work on power up. It takes either a physical reset button press or a power cycle to fix it. This problem affects all BBB’s and until now could only be solved with hardware hacks.

The final official word from TI on this problem:

There is no solution for this on the BB Black

This sucks. If you are thinking of using a Beagle Bone Black for anything important… then don’t. And don’t bother reading the rest of this article.

If you already are stuck with the Beagle Bone Black and wish that it worked right each and every time you turn it on, then read this article.

Here is a workaround that can guarantee that an unmodified Beagle Bone Black will have a working Ethernet port within a few minutes after power up.

IMPORTANT UPDATE
It looks like the 4.x kernels break this code by locking the MDIO registers. I only use the 3.x kernels because of issues with the PRUSS subsystem in the newer ones, so I am not going to try to get this working on 4.x (and if you use the PRU, I’d recommend sticking with 3.x. also). If you really need 4.x and decide to get it working, please send me a PR!

tldr;

Install this package on your BBB and it will always have a working Ethernet port when it powers up – although it might take a couple of minutes and a few automatic power cycles.

Most interesting thing I learned

Relay contacts weld on closing rather than opening. This makes perfect sense once someone explains it, but was not my intuition before I welded a pile of relays running the tests for this fix.

The Problem

If you use Beagle Bone Blacks (BBBs), then I bet you can remember a few times when you plugged one in and waited a while for it to connect to the network. Eventually you gave up and power cycled it, chalking it up to maybe a loose network cable or DHCP server. Nope.

There is a design issue in the BBB that causes the Ethernet PHY chip to sometimes power up in an undefined state where it can not make a valid link.

This problem affects every single unit I have ever tested, and I’ve tested at least 100 of them. The chances that the problem will happen varies from board to board, and is also dependent which power connector you use…

Chance that BBB will power up with non-working Ethernet
Power connector Average Range
USB 2.15% 0.45%-4.95%
Barel Jack 3.68% (Only one unit tested)
Vcc Header Pins 16.18% 3.89%-42.43%

So USB is the best, but even on USB it is still possible to see thew problem almost 5% of the time. The header pins are by for the worst case, with one board coming up without Ethernet a shocking 42% of the time! These results match the geist in the forums where people powering via capes seem to be the ones freaking out while USB people try to convince them the problem is not so bad.

Just write a BASH script to PING and REBOOT when necessary?

If only it were so easy.

Doing a software reset on the PHY chip does not get it back to a defined state. There is no way to get the chip back unless you do a hardware reset on it, and on the BBB the hardware reset line of the PHY is tied to the hardware reset line of the ARM chip. Yuck.

What you can do

  1. If you are thinking of using a BBB for something, don’t. You do not want to live in an ecosystem where a deal-breaker problem like this could happen at all, and not be fixed several years and board revs later.
  2. If you are making a rev on the BBB, fix the reset circuit. Add a Schmidt trigger to the output of the reset button and/or switch the RESET line of the PHY from the master reset to a GPIO.
  3. If you have access to the hardware, use a custom cape or wire jumper to connect the header reset pin to a GPIO pin so you can make the board reset itself under software control.
  4. Use the software only workaround below.

How it works

If we look closely at the schematics, we see that the PHY reset line is connected directly to the reset button.

This is part of the cause of the problem, but also gives us a path to the workaround. This line is also connected to the nRESETIN_OUT pin of the ARM.

As it’s name hints, this line can be used as both an in and an out, so if we can get this line to go low then we can reset the PHY chip!

It might be as easy as just resetting the ARM… but there is a damn debouncing cap on the reset line (also a cause to the base issue) so we need that reset line to go low for a very long time.

There is a register that sets how long it will go low…

…but the longest we can set it to is not nearly long enough. We need a way to make that reset line stay down for hundreds of milliseconds.

The RESET line does go low if we power down the ARM, but if we power down then we are dead in the water so that does not help since we would need someone to push the power button to turn us back on.  Luckily there is a a crazy way to make this chip turn itself on!

Detecting when the PHY needs to be reset

After analyzing the PHY register contents from thousands of boots, , I found one bit that is always 1 when the Ethernet is dead and 0 when the Ethernet came up ok.

We can use the phyreg utility to test this bit in a script, and use the bbbrtc utility to keep power cycling until we get that bit in the right state with Ethernet working!

Code drop

https://github.com/bigjosh/bbbphyfix

Follow the instructions in the readme to install the package. It pulls in the bbbrtc and the phyreg utilities that it needs and then installs a little script called chkphy that runs at startup from init.d. This script checks the magic bit, and if the Ethernet is dead then it initiates a power cycle using bbbrtc. It pauses for 30 seconds before running to (1) let Linux finish mucking with the rtc before we try to use it, and (2) give you a chance to stop the process in case something goes wrong so you don’t end up in an infinite restart cycle.

There is also a directory in the repo called testing that has the scripts I used for testing this problem and fix.

Testing

All testing was done using a Raspberry Pi as the master controller. The Pi was connected to this handly device…

https://amzn.to/2LhHGhZ

..to be able to power cycle the BBBs, which were plugged into a power strip.

For the BBB’s powered via the header, I used a Meanwell RSP-200-5 to generate the 5 VDC.

Raw test data is here…

https://docs.google.com/spreadsheets/d/1WmNm-7ocITAZ6FfwkT9HOeOWOosVu3ErbyIZcFE-7zY/edit?usp=sharing

The 7 columns represent the 7 boards under test, each row represents one power cycle. The number in the cell is the number of 30-second time periods it took for the board to reply to a ping (0=board never replied after 20 tries, so assumed dead).

Look in the testing directory of the above repo for the scripts that turn the power on and off and look for the pings. There is also a little c program called pingb that pings a bunch of addresses periodically.

FAQ

Why bother checking the magic bit when you could just use ping test to see if the Ethernet is working?

The bit tells you directly if the Ethernet is dead. Using a PING test, you might end up power cycling over and over again if, say, the network cable got unplugged or the ping target is not reachable.

Why bother checking the magic bit when you could just use Yona’s test script?

Yona invented the reset-pin-to-GPIO hack that was the least invasive and most effective workaround available. Here is his test…

# Wait until eth0 is either initialized or failed to initialize
while ! dmesg | grep 'net eth0: phy found';
   do echo 'Waiting for eth0...'
   sleep 1
   if dmesg | egrep -i 'mdio:03 not found'; then
      echo "NO" >>/var/tmp/test.txt
      sync
      exit
   fi
done

This script script is very good and catches almost every time the Ethernet is dead, but it is not perfect. Depending on the board, as many as 2.5% of dead Ethernet powerups will pass this test.

After more than 10,000 test cycles, I have yet to see a case where the magic bit test did not correctly detect a dead Ethernet port.

You are depending on a hard coded value in an undocumented reserved bit! For shame! For Shame!

I 100% agree, but desperate times call for desperate measures. I’ve tested 100+ boards, and this bit works on all of them all the time. You should test any board you plan to use this fix on. Please let me know if you ever find one that breaks it.

BTW, Microchip has responded with a tiny bit of color on this bit…

“I was able to find that Bit 13 is called CO_CLK_FREQ and has to do with the internal clock generation.  The factory indicated that this is likely a NASR bit, as this condition where it is 1 is needed initially after a hardware reset until the defaults and configuration straps are programmed, then it would return to 0 either after the initial configuration was done or after a hardware reset with the first write/read to a register.”

So seems like this 1 is indicating that the PHY was unable to complete initial initialization, which is consistent with the port not working.

What is the big deal? Just push the damn reset button if the Ethernet port is not working. 

You can’t always reach the reset button if, say, the board is inside a machine or on the side of a building. Considering the BBB is touted for use as a headless industrial computer, these are places they can be found.

What is the big deal? Just power cycle the board if the Ethernet is not working.

You can’t always power cycle, especially if the board is inside some other piece of equipment, or is in a remote location, or is part of a system that you need to always power up into a working state even if unattended.

In my case, the laws of statistics made power cycling completely impractical. I have an installation with 72 BBB’s on the side of a building. Each BBB powers up with no Ethernet on average about 10% of the time. This means each time you turn the assembly on, there is only about a 1/1920 chance that all the Ethernet ports will work. It takes about 2 minutes to power cycle the system, so on average it would take about 32 hours of cycling to get the system to power up with all BBB’s working.

There is a kernel patch that fixes this

There is a kernel patch that fixes a related problem (the PHY comes up with the wrong address, but otherwise functional) caused by the same design issue, but it does not help in the failure mode we are addressing here where the PHY will never link.

###

21 comments

  1. eduardvanraalte

    Hi Josh,
    You nailed it again. I’m just curious. What are you using those 72 BBB’s in the side of a building for?

  2. David Grayson

    Thanks for the warning! I used to think that the Beagle Bone Blacks were actually better than the Raspberry Pi because of their realtime/microcontroller stuff but I hadn’t actually tried to use them. I guess I’ll stick to Raspberry Pis.

    Also, that looks like an awesome project you’re doing there, nice.

    • bigjosh2

      Yes, the 2x real-time units (PRU’s) on that ARM are really amazing and useful hardware.. but sadly the design of the BBB has too many complications and problems. :(
      I think it would be nice to design a real-time unit to snap onto a RaspberryPi to get the same kind of functionality, although you could never reproduce the amazing ability of the PRU and the ARM to share memory at full speed.

  3. Brad Griffis

    Josh,

    I had an idea of how you might resolve the issue entirely. From the bits and pieces I’ve seen, it sounds like the PHY is not getting a sufficiently long reset. Does that match your understanding? If not, the rest of this might not apply…

    The AM335x nRESET_INOUT assertion time is configurable through the PRM_RSTTIME register. Note that on a cold boot this always reverts to its default value, so in order to leverage this capability we need to configure the register and then initiate a warm reset.

    I think you could add some code to u-boot to make this change and trigger a warm reset on EVERY cold power up. The sequence would look like this:

    #define PRM_RSTCTRL (volatile unsigned int)0x44E00F00
    #define PRM_RSTTIME (volatile unsigned int)0x44E00F04
    #define PRM_RSTST (volatile unsigned int)0x44E00F08

    if ( PRM_RSTST&1 ) // check if cold reset has occurred
    {
    PRM_RSTST = 1; // clear the cold reset bit
    PRM_RSTTIME = 0x10FF; // extend nRESET_INOUT to maximum value
    PRM_RSTCTRL = 1; // initiate warm reset
    }

    FYI, I’ve not tested (or for that matter even compiled!) the above code. I hope this helps.

    • bigjosh2

      Thanks for the suggestion! This is one of the first strategies I tried, but unfortunately the longest delay possible with PRM_RSTTIME is 255 * CLK_M_OSC, which orders of magnitude too short to overcome the filter cap on the reset line.

  4. K

    Taking a quick look at the latest revision schematic, it looks like this could be a reset violation on the LAN8710. The LAN8710 primary voltage looks like it is regulated from the 5V input, to a rail labelled VDD_3V3B. The PMIC does not appear to have any influence or monitoring of the VDD_3V3B rail. The LAN8710 nRST line is pulled low by the PMIC_GOOD signal, through a buffer.

    The specification for the LAN8710, assuming I’m reading it correctly, expects nRST low when the primary voltage (VDD_3V3B in this case) reaches 80%, and expects the nRST to remain asserted for a minimum of 25ms. To meet specifications of the LAN8710, this means the output of the 7407 would have to be low before VDD_3V3B reaches 80%, and the PGOOD of the TPS65217C would have to remain low for 25ms from VDD_3V3B reaching 80%.

    The default configuration for the TPS65217C is to have 20ms delay on the PGOOD signal (DEFPG register @ address 0x0D). I’m not sure you can reliably state that the 25ms reset to the LAN8710 is being observed. By setting the lowest 2-bits of the DEFPG register, the PGDLY, to 01b the delay could be increased to 100ms, and could resolve the issue.

    Again, I didn’t look deep enough to see if there is something else that guarantees a clean reset to the LAN8710, but that is where I would look. If reprogramming the PMIC is a hassle, this could be tested by using a voltage supervisor (maybe TPS3800G27?) hooked into the LAN8710 nRST, which would monitor VDD_3V3B and guarantee the required reset time.

  5. John Buck

    Josh,
    The bbbrtc utility does not work under Linux 4.14.40 on the BBB.
    No errors are reported, but it sits and spins waiting for the RTC to stop.
    Debugging reveals that programming of the RTC_CTRL_REG with 0 does not work. For that matter, it seems, that any write does not work (set32reg()). I inserted some calls to set each of the scratch registers and they don’t get set either. By-passing the spin loop (after 100000 loops), indicates that the /dev/mem is mapped properly since the values printed by bbbrtc dump look fine, and the seconds changes between calls. However, any write to any register does not work.

    There is a fair amount of chatter on the various BB forums and Linux forums that certain restrictions may have been, or have been added to /dev/mem preventing writing to certain areas.

    The only other thought I have is that perhaps the RTC registers are “write protected”. There is mention of write-protecting the scratch registers in the BBB technical manual, so I’m wondering if maybe the newer kernel somehow write-protects them all somehow? I was wondering if you had heard about this or experienced this issue.

    I will continue digging to see why it no longer works. BTW the same binary runs fine on a 3.18.13 kernel.

    • John Buck

      Josh,
      It appears that the newer kernel locks the RTC using the KICKx registers. So, in order for bbbrtc to work, you have to unlock it:

      #define RTC_KICK0_REG 0x6c
      #define RTC_KICK1_REG 0x70
      #define KICK0_VALUE 0x83e70b13
      #define KICK1_VALUE 0x95a4f1e0

      set32reg( base, RTC_KICK0_REG, KICK0_VALUE);
      set32reg( base, RTC_KICK1_REG, KICK1_VALUE);

                  set32reg( base , RTC_CTRL_REG ,  0x00);     // Write a 0 to bit 0 to freeze the RTC so we can update
      

      I might also “relock” them at the end by writing 0’s to those registers.
      The above “fix” seems to work.

      Now, is it safe to write the KICK registers? While the doc does not explicilty say you can write those registers while the device is running, it makes sense that you should be able to… The doc does say you can only write to control and status, but if you couldn’t write to KICKx then you’d have no way to gain access to the RTC other than a (hard) reboot.

      John

    • bigjosh2

      Ah yes, I won’t touch the 4.x kernals because of all the PRUSS subsystem issues. I should have made clear this was only tested on 3.x. I will update the article tonight. Thanks!

  6. Leon Smith

    I am curious about some of the Linux 4.* PRUSS issues you allude to. Could you elaborate? (This is something I would like to play with at some point, but I haven’t gotten around to it yet.)

    • bigjosh2

      When I tried the new version, they had swapped out the uio_pruss for remote_proc, and the support for remote_proc was incomplete. Then things seamed like things bounced around for a couple of releases, so I gave up. Only so many hours you can spend chasing this stuff – I want things that work to keep working unless there is a good reason to break them! :)

  7. Chris

    Could this fix be ported to work in U-Boot rather than under Linux?

    If it would, it seems like that could both speed it up by not spending the time to boot Linux until it can be done so lastingly, and potentially avoid issues around Kernel changes.

    Trying to decide if this disqualifies the platform or if it’s worth trying to work on. BBB’s ability to boot from SPI flash was a positive for embedded applications. Could use the Pocket Beagle (or the Octavo chip itself) and try to implement the Ethernet right, but would also have to provide the eMMC for full image storage, which starts to make what was proposed as a “physical connector” board to integrate the BBB, into a product itself.

    • bigjosh2

      Could this fix be ported to work in U-Boot rather than under Linux?

      Yes, U-Boot would be the best place to do this.

      Trying to decide if this disqualifies the platform or if it’s worth trying to work on.

      Sadly I would discourage you from using this platform for production. This problem alone is a deal-killer for me. All platforms have issues, but the fact that a huge and unsubtle one like this has not be fixed (or even documented) over a half dozen revs is a signal.

      • Chris

        What would you pick instead? Pi’s just about unavoidable reliance on an SD card (or at best eMMC on the compute module) pretty much rules that out.

        Fortunately it’s sounding like we only need Ethernet during development / interactive configuration, not in deployment. And a USB Ethernet solution might be viable if that ever changed.

        I do understand the annoyance with unfixed issues and share the hesitance over that; however I also wonder how much is that this platform has been around longer for people to find the issues in, vs. a lot of others that seem to rev chip generations faster than experience can keep up with.

        • bigjosh2

          What would you pick instead?

          Yea, it is a hard question. I’ve switched to only PI’s when I need something to run linux. There are lots of issues to be sure, but at least they are known and not deal killers. I too was concerned about the reliance on an external SD and that was one of my motivations for picking BBB in the first place, but so far it has not been a problem for me.

          BBB’s have persistent power issues. These are deal killers. The standard answer seams to be “request an RMA if under warranty, throw away if not”. This is an unsatisfying answer, especially since they do not seem to be doing any post-mortem on these RMA’ed units to actually document and resolve the root problems.

          https://groups.google.com/forum/#!msg/beagleboard/jBFshjlPeHI/kPhWTnRQmnkJ

          I currently have two units in production that occasionally power off spontaneously. I think there is something going on with the PMIC. This is a persistent problem that I have already spent many hours on and will likely have to spend hours more on. It sucks.

          Here is my box BBBs that died in production….

          I hate to be so negative because there are things about the BBB I like, but I’ve spent so much time debugging and dealing with platform problems that I can not be positive.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s