A software only solution to the vexing Beagle Bone Black PHY issue

Power cycling BBB’s with clicky relays

Sometimes the Ethernet port on a Beagle Bone Black does not work on power up. It takes either a physical reset button press or a power cycle to fix it. This problem affects all BBB’s and until now could only be solved with hardware hacks.

The final official word from TI on this problem:

There is no solution for this on the BB Black

This sucks. If you are thinking of using a Beagle Bone Black for anything important… then don’t. And don’t bother reading the rest of this article.

If you already are stuck with the Beagle Bone Black and wish that it worked right each and every time you turn it on, then read this article.

Here is a workaround that can guarantee that an unmodified Beagle Bone Black will have a working Ethernet port within a few minutes after power up.

IMPORTANT UPDATE UPDATE 9/27/2021
This seems to now work on 4.x kernels. See this comment.

tldr;

Install this package on your BBB and it will always have a working Ethernet port when it powers up – although it might take a couple of minutes and a few automatic power cycles.

Most interesting thing I learned

Relay contacts weld on closing rather than opening. This makes perfect sense once someone explains it, but was not my intuition before I welded a pile of relays running the tests for this fix.

The Problem

If you use Beagle Bone Blacks (BBBs), then I bet you can remember a few times when you plugged one in and waited a while for it to connect to the network. Eventually you gave up and power cycled it, chalking it up to maybe a loose network cable or DHCP server. Nope.

There is a design issue in the BBB that causes the Ethernet PHY chip to sometimes power up in an undefined state where it can not make a valid link.

This problem affects every single unit I have ever tested, and I’ve tested at least 100 of them. The chances that the problem will happen varies from board to board, and is also dependent which power connector you use…

Power connectorAverageRange
USB2.15%0.45%-4.95%
Barel Jack3.68%(Only one unit tested)
Vcc Header Pins16.18%3.89%-42.43%

So USB is the best, but even on USB it is still possible to see thew problem almost 5% of the time. The header pins are by for the worst case, with one board coming up without Ethernet a shocking 42% of the time! These results match the geist in the forums where people powering via capes seem to be the ones freaking out while USB people try to convince them the problem is not so bad.

Just write a BASH script to PING and REBOOT when necessary?

If only it were so easy.

Doing a software reset on the PHY chip does not get it back to a defined state. There is no way to get the chip back unless you do a hardware reset on it, and on the BBB the hardware reset line of the PHY is tied to the hardware reset line of the ARM chip. Yuck.

What you can do

  1. If you are thinking of using a BBB for something, don’t. You do not want to live in an ecosystem where a deal-breaker problem like this could happen at all, and not be fixed several years and board revs later.
  2. If you are making a rev on the BBB, fix the reset circuit. Add a Schmidt trigger to the output of the reset button and/or switch the RESET line of the PHY from the master reset to a GPIO.
  3. If you have access to the hardware, use a custom cape or wire jumper to connect the header reset pin to a GPIO pin so you can make the board reset itself under software control.
  4. Use the software only workaround below.

How it works

If we look closely at the schematics, we see that the PHY reset line is connected directly to the reset button.

This is part of the cause of the problem, but also gives us a path to the workaround. This line is also connected to the nRESETIN_OUT pin of the ARM.

As it’s name hints, this line can be used as both an in and an out, so if we can get this line to go low then we can reset the PHY chip!

It might be as easy as just resetting the ARM… but there is a damn debouncing cap on the reset line (also a cause to the base issue) so we need that reset line to go low for a very long time.

There is a register that sets how long it will go low…

…but the longest we can set it to is not nearly long enough. We need a way to make that reset line stay down for hundreds of milliseconds.

The RESET line does go low if we power down the ARM, but if we power down then we are dead in the water so that does not help since we would need someone to push the power button to turn us back on.  Luckily there is a a crazy way to make this chip turn itself on!

Detecting when the PHY needs to be reset

After analyzing the PHY register contents from thousands of boots, , I found one bit that is always 1 when the Ethernet is dead and 0 when the Ethernet came up ok.

We can use the phyreg utility to test this bit in a script, and use the bbbrtc utility to keep power cycling until we get that bit in the right state with Ethernet working!

Code drop

Follow the instructions in the readme to install the package. It pulls in the bbbrtc and the phyreg utilities that it needs and then installs a little script called chkphy that runs at startup from init.d. This script checks the magic bit, and if the Ethernet is dead then it initiates a power cycle using bbbrtc. It pauses for 30 seconds before running to (1) let Linux finish mucking with the rtc before we try to use it, and (2) give you a chance to stop the process in case something goes wrong so you don’t end up in an infinite restart cycle.

There is also a directory in the repo called testing that has the scripts I used for testing this problem and fix.

Testing

All testing was done using a Raspberry Pi as the master controller. The Pi was connected to this handly device…

..to be able to power cycle the BBBs, which were plugged into a power strip.

For the BBB’s powered via the header, I used a Meanwell RSP-200-5 to generate the 5 VDC.

Raw test data is here…

The 7 columns represent the 7 boards under test, each row represents one power cycle. The number in the cell is the number of 30-second time periods it took for the board to reply to a ping (0=board never replied after 20 tries, so assumed dead).

Look in the testing directory of the above repo for the scripts that turn the power on and off and look for the pings. There is also a little c program called pingb that pings a bunch of addresses periodically.

FAQ

Why bother checking the magic bit when you could just use ping test to see if the Ethernet is working?

The bit tells you directly if the Ethernet is dead. Using a PING test, you might end up power cycling over and over again if, say, the network cable got unplugged or the ping target is not reachable.

Why bother checking the magic bit when you could just use Yona’s test script?

Yona invented the reset-pin-to-GPIO hack that was the least invasive and most effective workaround available. Here is his test…

# Wait until eth0 is either initialized or failed to initialize
while ! dmesg | grep 'net eth0: phy found';
   do echo 'Waiting for eth0...'
   sleep 1
   if dmesg | egrep -i 'mdio:03 not found'; then
      echo "NO" >>/var/tmp/test.txt
      sync
      exit
   fi
done

This script script is very good and catches almost every time the Ethernet is dead, but it is not perfect. Depending on the board, as many as 2.5% of dead Ethernet powerups will pass this test.

After more than 10,000 test cycles, I have yet to see a case where the magic bit test did not correctly detect a dead Ethernet port.

You are depending on a hard coded value in an undocumented reserved bit! For shame! For Shame!

I 100% agree, but desperate times call for desperate measures. I’ve tested 100+ boards, and this bit works on all of them all the time. You should test any board you plan to use this fix on. Please let me know if you ever find one that breaks it.

BTW, Microchip has responded with a tiny bit of color on this bit…

“I was able to find that Bit 13 is called CO_CLK_FREQ and has to do with the internal clock generation.  The factory indicated that this is likely a NASR bit, as this condition where it is 1 is needed initially after a hardware reset until the defaults and configuration straps are programmed, then it would return to 0 either after the initial configuration was done or after a hardware reset with the first write/read to a register.”

So seems like this 1 is indicating that the PHY was unable to complete initial initialization, which is consistent with the port not working.

What is the big deal? Just push the damn reset button if the Ethernet port is not working. 

You can’t always reach the reset button if, say, the board is inside a machine or on the side of a building. Considering the BBB is touted for use as a headless industrial computer, these are places they can be found.

What is the big deal? Just power cycle the board if the Ethernet is not working.

You can’t always power cycle, especially if the board is inside some other piece of equipment, or is in a remote location, or is part of a system that you need to always power up into a working state even if unattended.

In my case, the laws of statistics made power cycling completely impractical. I have an installation with 72 BBB’s on the side of a building. Each BBB powers up with no Ethernet on average about 10% of the time. This means each time you turn the assembly on, there is only about a 1/1920 chance that all the Ethernet ports will work. It takes about 2 minutes to power cycle the system, so on average it would take about 32 hours of cycling to get the system to power up with all BBB’s working.

There is a kernel patch that fixes this

There is a kernel patch that fixes a related problem (the PHY comes up with the wrong address, but otherwise functional) caused by the same design issue, but it does not help in the failure mode we are addressing here where the PHY will never link.

What about 4.x kernels?

Check out this comment.

###

92 comments

  1. eduardvanraalte

    Hi Josh,
    You nailed it again. I’m just curious. What are you using those 72 BBB’s in the side of a building for?

  2. David Grayson

    Thanks for the warning! I used to think that the Beagle Bone Blacks were actually better than the Raspberry Pi because of their realtime/microcontroller stuff but I hadn’t actually tried to use them. I guess I’ll stick to Raspberry Pis.

    Also, that looks like an awesome project you’re doing there, nice.

    • bigjosh2

      Yes, the 2x real-time units (PRU’s) on that ARM are really amazing and useful hardware.. but sadly the design of the BBB has too many complications and problems. :(
      I think it would be nice to design a real-time unit to snap onto a RaspberryPi to get the same kind of functionality, although you could never reproduce the amazing ability of the PRU and the ARM to share memory at full speed.

  3. Brad Griffis

    Josh,

    I had an idea of how you might resolve the issue entirely. From the bits and pieces I’ve seen, it sounds like the PHY is not getting a sufficiently long reset. Does that match your understanding? If not, the rest of this might not apply…

    The AM335x nRESET_INOUT assertion time is configurable through the PRM_RSTTIME register. Note that on a cold boot this always reverts to its default value, so in order to leverage this capability we need to configure the register and then initiate a warm reset.

    I think you could add some code to u-boot to make this change and trigger a warm reset on EVERY cold power up. The sequence would look like this:

    #define PRM_RSTCTRL (volatile unsigned int)0x44E00F00
    #define PRM_RSTTIME (volatile unsigned int)0x44E00F04
    #define PRM_RSTST (volatile unsigned int)0x44E00F08

    if ( PRM_RSTST&1 ) // check if cold reset has occurred
    {
    PRM_RSTST = 1; // clear the cold reset bit
    PRM_RSTTIME = 0x10FF; // extend nRESET_INOUT to maximum value
    PRM_RSTCTRL = 1; // initiate warm reset
    }

    FYI, I’ve not tested (or for that matter even compiled!) the above code. I hope this helps.

    • bigjosh2

      Thanks for the suggestion! This is one of the first strategies I tried, but unfortunately the longest delay possible with PRM_RSTTIME is 255 * CLK_M_OSC, which orders of magnitude too short to overcome the filter cap on the reset line.

  4. K

    Taking a quick look at the latest revision schematic, it looks like this could be a reset violation on the LAN8710. The LAN8710 primary voltage looks like it is regulated from the 5V input, to a rail labelled VDD_3V3B. The PMIC does not appear to have any influence or monitoring of the VDD_3V3B rail. The LAN8710 nRST line is pulled low by the PMIC_GOOD signal, through a buffer.

    The specification for the LAN8710, assuming I’m reading it correctly, expects nRST low when the primary voltage (VDD_3V3B in this case) reaches 80%, and expects the nRST to remain asserted for a minimum of 25ms. To meet specifications of the LAN8710, this means the output of the 7407 would have to be low before VDD_3V3B reaches 80%, and the PGOOD of the TPS65217C would have to remain low for 25ms from VDD_3V3B reaching 80%.

    The default configuration for the TPS65217C is to have 20ms delay on the PGOOD signal (DEFPG register @ address 0x0D). I’m not sure you can reliably state that the 25ms reset to the LAN8710 is being observed. By setting the lowest 2-bits of the DEFPG register, the PGDLY, to 01b the delay could be increased to 100ms, and could resolve the issue.

    Again, I didn’t look deep enough to see if there is something else that guarantees a clean reset to the LAN8710, but that is where I would look. If reprogramming the PMIC is a hassle, this could be tested by using a voltage supervisor (maybe TPS3800G27?) hooked into the LAN8710 nRST, which would monitor VDD_3V3B and guarantee the required reset time.

    • Darryl

      I have around 15 BBBs and BBGs with custom o-scope capes monitoring some signals. Of course, I have seen the ethernet issues many times. I started using BBGs on later installs and even replaced a couple of the blacks with greens. I have yet (around 1.5 years) to see the BBGs have the ethernet issue but sill see the BBBs have it. Granted, the boards don’t get power cycled very often so maybe I have just been lucky.

      • bigjosh2

        Unfortunately the BBG has the same design problem as the BBB, but how often you see the problem depends on the power supply type and manufacturing variances. You might have just gotten a lucky batch of BBGs where the problem is not as frequent with your power supplies. I have about 100 BBG’s and I see the problem painfully often.

        • Darryl

          Well, during the 6 months since I made my post I’ve had a few of the BBGs exhibit the problem. That prompted me search out your solution, which I will implement. Thank you for helping us with this issue.

  5. John Buck

    Josh,
    The bbbrtc utility does not work under Linux 4.14.40 on the BBB.
    No errors are reported, but it sits and spins waiting for the RTC to stop.
    Debugging reveals that programming of the RTC_CTRL_REG with 0 does not work. For that matter, it seems, that any write does not work (set32reg()). I inserted some calls to set each of the scratch registers and they don’t get set either. By-passing the spin loop (after 100000 loops), indicates that the /dev/mem is mapped properly since the values printed by bbbrtc dump look fine, and the seconds changes between calls. However, any write to any register does not work.

    There is a fair amount of chatter on the various BB forums and Linux forums that certain restrictions may have been, or have been added to /dev/mem preventing writing to certain areas.

    The only other thought I have is that perhaps the RTC registers are “write protected”. There is mention of write-protecting the scratch registers in the BBB technical manual, so I’m wondering if maybe the newer kernel somehow write-protects them all somehow? I was wondering if you had heard about this or experienced this issue.

    I will continue digging to see why it no longer works. BTW the same binary runs fine on a 3.18.13 kernel.

    • John Buck

      Josh,
      It appears that the newer kernel locks the RTC using the KICKx registers. So, in order for bbbrtc to work, you have to unlock it:

      #define RTC_KICK0_REG 0x6c
      #define RTC_KICK1_REG 0x70
      #define KICK0_VALUE 0x83e70b13
      #define KICK1_VALUE 0x95a4f1e0

      set32reg( base, RTC_KICK0_REG, KICK0_VALUE);
      set32reg( base, RTC_KICK1_REG, KICK1_VALUE);

                  set32reg( base , RTC_CTRL_REG ,  0x00);     // Write a 0 to bit 0 to freeze the RTC so we can update
      

      I might also “relock” them at the end by writing 0’s to those registers.
      The above “fix” seems to work.

      Now, is it safe to write the KICK registers? While the doc does not explicilty say you can write those registers while the device is running, it makes sense that you should be able to… The doc does say you can only write to control and status, but if you couldn’t write to KICKx then you’d have no way to gain access to the RTC other than a (hard) reboot.

      John

    • bigjosh2

      Ah yes, I won’t touch the 4.x kernals because of all the PRUSS subsystem issues. I should have made clear this was only tested on 3.x. I will update the article tonight. Thanks!

  6. Leon Smith

    I am curious about some of the Linux 4.* PRUSS issues you allude to. Could you elaborate? (This is something I would like to play with at some point, but I haven’t gotten around to it yet.)

    • bigjosh2

      When I tried the new version, they had swapped out the uio_pruss for remote_proc, and the support for remote_proc was incomplete. Then things seamed like things bounced around for a couple of releases, so I gave up. Only so many hours you can spend chasing this stuff – I want things that work to keep working unless there is a good reason to break them! :)

  7. Chris

    Could this fix be ported to work in U-Boot rather than under Linux?

    If it would, it seems like that could both speed it up by not spending the time to boot Linux until it can be done so lastingly, and potentially avoid issues around Kernel changes.

    Trying to decide if this disqualifies the platform or if it’s worth trying to work on. BBB’s ability to boot from SPI flash was a positive for embedded applications. Could use the Pocket Beagle (or the Octavo chip itself) and try to implement the Ethernet right, but would also have to provide the eMMC for full image storage, which starts to make what was proposed as a “physical connector” board to integrate the BBB, into a product itself.

    • bigjosh2

      Could this fix be ported to work in U-Boot rather than under Linux?

      Yes, U-Boot would be the best place to do this.

      Trying to decide if this disqualifies the platform or if it’s worth trying to work on.

      Sadly I would discourage you from using this platform for production. This problem alone is a deal-killer for me. All platforms have issues, but the fact that a huge and unsubtle one like this has not be fixed (or even documented) over a half dozen revs is a signal.

      • Chris

        What would you pick instead? Pi’s just about unavoidable reliance on an SD card (or at best eMMC on the compute module) pretty much rules that out.

        Fortunately it’s sounding like we only need Ethernet during development / interactive configuration, not in deployment. And a USB Ethernet solution might be viable if that ever changed.

        I do understand the annoyance with unfixed issues and share the hesitance over that; however I also wonder how much is that this platform has been around longer for people to find the issues in, vs. a lot of others that seem to rev chip generations faster than experience can keep up with.

        • bigjosh2

          What would you pick instead?

          Yea, it is a hard question. I’ve switched to only PI’s when I need something to run linux. There are lots of issues to be sure, but at least they are known and not deal killers. I too was concerned about the reliance on an external SD and that was one of my motivations for picking BBB in the first place, but so far it has not been a problem for me.

          BBB’s have persistent power issues. These are deal killers. The standard answer seams to be “request an RMA if under warranty, throw away if not”. This is an unsatisfying answer, especially since they do not seem to be doing any post-mortem on these RMA’ed units to actually document and resolve the root problems.

          https://groups.google.com/forum/#!msg/beagleboard/jBFshjlPeHI/kPhWTnRQmnkJ

          I currently have two units in production that occasionally power off spontaneously. I think there is something going on with the PMIC. This is a persistent problem that I have already spent many hours on and will likely have to spend hours more on. It sucks.

          Here is my box BBBs that died in production….

          I hate to be so negative because there are things about the BBB I like, but I’ve spent so much time debugging and dealing with platform problems that I can not be positive.

  8. Mattias

    Thanks for the great information!

    We have bumped in to this issue as well
    The kernel patch for finding PHY regardless of address GREATLY reduces problem, but does not completely fix all cases where it does not link

    I have a question about the solution for detecting the error though:
    I cannot get your phyreg application working, it just halts waiting for ACK (4.4 kernel)
    Is the kick-registers solution applicable here as well as with RTC?

    Also, if I just read register 0x4a101080 without writing to it I seem to be able to detect error in bit 13. Why is the write needed?

  9. bigjosh2

    Is the kick-registers solution applicable here as well as with RTC?

    I have not looked into this, but it seems like the new kernel is blocking devmem access to these address. It might not be possible to do this from userspace any more, in which case you’d need to make a (very tiny & simple) driver to do these accesses.

    Also, if I just read register 0x4a101080 without writing to it I
    seem to be able to detect error in bit 13. Why is the write needed?

    0x4a101080 is just a communications channel used to talk to the state machine inside the MDIO. To read a register inside the MDIO, you have to go though a dance where you tell the statemachine which register you want to read, then you wait for a while for it to go get the register, and then you read back the value it got.

    Is this the write you are referring to?

    If so, then without the write that tells the MDIO to go fetch the value, the value you see at 0x4a101080 will be just whaever the last value anyone happened to have read from any MDIO register on any PHY device.

    • Mattias

      I do seem to be able to write to the register with devmem.
      Immidiate readback gives me the value i wrote (0x82400000) but if I read again the value will have changed to something closely resembling what is in the devmem examples.

      I could interpret this as that I’m actually allowed to write, and as you said; it might need some time to fetch the value.

      That still does not explain why the phyreg application locks up though.

      • bigjosh2

        If you are writing 0x82400000 to the address 0x4a101080 then you are actually issuing a READ request to the MDIO register rather than a write since the 30th bit in the value is the write flag. The changes you are seeing after are likely the top bit going to 0 which indicates that the command has been executed, and possibly the 29th bit (the ACK bit) going to 1 if the read was successful. BTW, if the ACK bit is set then the bottom 16 bits have the value that was actually read from the specified register (in your example . register 2 of phy 0 if I count my bits correctly!).

        What does phyreg print immediately before locking up?

        What happens if you try to scan for all PHY’s? How about if you try to dump all regs on a found phy?

  10. Mattias

    First of, thank you for taking the time to respond! Not expected but much appreciated!

    Put shortly, I just want to do a “phyreg test 18 13”, or a devmem equivalent of it, so that we can trigger reset on ethernet failure.

    I can’t get phyreg to work for me
    If I instead use devmem directly, my understanding is that “devmem2 0x4a101080 w 0x82400000” and reading back 0x4a101080 would do the same thing
    (from the picture example with the registers)

    (in your example . register 2 of phy 0 if I count my bits correctly!).

    I’m TRYING to read register 18 on first phy, I think :)

    What does phyreg print immediately before locking up?

    
    $ phyreg test 18 13
    Alive bits:0000-0000-0000-0001
    First PHY found at address 0.
    PHY=00 REG=18 : IDLE READ    # waits forever
    

    I guess phyreg waits for ACK in this state

    What happens if you try to scan for all PHY’s?

    
    $ phyreg
    ALIVE ADDRESSES:0000-0000-0000-0001
    LINK  ADDRESSES:0000-0000-0000-0001
    

    How about if you try to dump all regs on a found phy?

    It locks up on first register 0, same as before, after IDLE READ

      • bigjosh2

        Click to access spruh73p.pdf

        To access the registers on the PHY chip, you go though the MDIO on the ARM. Here is the MDIO register you use to pick which registers on the PHY you are talking to and has the ACK bit to tell you how that communication went…

        14.5.10.11 MDIOUSERACCESS0 Register

        The process for using the MDIO to get to the PHY is documented in…

        14.4.4 Writing Data to a PHY Register

        You can see how to get actually to the MDIO regs under Linux though devmem by looking though the phyreg code.

        There are other MDIO registers for figuring out which PHY addresses have a PHY actually connected to them, which sadly is important on the BBB since, as you know, the PHY has a tenancy to come up with random addresses. :)

        LMK if any questions and how you make out!

    • bigjosh2

      PHY=00 REG=18 : IDLE READ # waits forever
      I guess phyreg waits for ACK in this state

      It looks like it is probably hanging waiting for the GO bit to clear. The MDIO clears this bit to tell you that it sent your request out to the PHY and only then will the ACK bit tell you how that went. If it is ACK, then the read or write worked.

      So maybe put a printf() inside that while loop that is waiting for the GO bit to clear so we can see what is going on with *useraccessaddress while we wait?

      • Mattias

        Finally got some time to investigate.
        Thank you for all the great information

        I found out why phyreg freezes, not because of ACK or GO bit but another stupid reason:
        while (*useraccessaddress & MDIO_USERACCESS0_GO_BIT);
        These usy-waits:

        
        while (*useraccessaddress & MDIO_USERACCESS0_GO_BIT);
        

        probably gets optimized away for me since the while loop is noop.
        Making *useraccessaddress volatile fixes it (or adding a sleep in the loop).

        Our buildsystem(buildroot) adds -Os to $(CFLAGS), which is missing when compiling in Beaglebone Debian

        Also, I with the help of the documentation I understand the MDIO-registers a lot better now than when started.

        We will probably “solve” this issue with a GPIO connection to SYS_RESETn in our peripheral hardware.

        • bigjosh2

          Wow, great detective work! This makes perfect sense! It’s not the kernel – it’s the compiler! There are many people waiting for this – I will update the package accordingly! Thank you so much for figuring this out!!!

  11. bigjosh2

    “If I instead use devmem directly, my understanding is that “devmem2 0x4a101080 w 0x82400000″ and reading back 0x4a101080 would do the same thing”

    The value you read back from 0x4a101080 would have to have the highest bit cleared (the GO bit) to tell you that the MDIO transaction happened (not that it worked, just that it happened. The ACK bit tells you if it worked.)

    If the value you read back has the high bit set (is higher than 0x7fffffff), then the MDIO didn’y do anything with your request. This is almost certainly what is going on and this is why phyreg is hanging, so using devmem2 directly does not change that.

    Also keep in mind that your values above assume that PHY landed at address 0 which is not a safe assumption on BBB! That is the main reason why I wrote phyreg because you need to look around for the phy address before you can talk to it so you can’t hard code values to read and write to the MDIO regs.

  12. Bill Marriott

    Just found this page YIKES! Worrying stuff since I have many BBB’s and BBG’s in the field.
    I do have a board rev in the works so I could relatively easily add a connection between the BB reset pin and a GPIO in future but is there any way to tell Debian to ‘almost or prepare-for shutdown’ prior to commanding the HW reset and hoping that nothing is being written to flash at the time of reset?

    Also, have you raised this in the Beaglebone Google group? Robert Nelson who maintains the images seems to be the guru there and may have some insight.

    • bigjosh2

      is there any way to tell Debian to ‘almost or prepare-for shutdown’ prior to commanding the HW reset and hoping that nothing is being written to flash at the time of reset?

      You can stop all the daemons that write to the disk and then do a flush and wait for it to finish. Or you could make a new init level that does nothing but do the reset. I’m sure there are other ways too.

      Also, have you raised this in the Beaglebone Google group?

      Yes, here.There is quite a bit of flack around this issue because of the different ways it shows up. When the PHY comes up, there are a few different ways it can be messed up. One way is that it can have a random address, which would cause the ethernet to not come up becuase the OS was talking to the wrong address. At some point they added code to Linux to search for the PHY if it was not where it was supposed to be, and then use it where ever they happened to find it. This made it so the ethernet now would work if the only thing that got messed up in the PHY at boot was the address, but there are other things that can also get scrambled and some of them will prevent the Ethernet from ever coming up (without a hardware reset) no matter what address is is at. Because of this, and also because the frequency of the problem changes from board to board and is also dependent on the power supply, some people only see the very rarely so they think it is fixed…. but it is not. With the existing BBB hardware design the PHY can randomly come up in a state where it can not make an ethernet link without a hardware reset.

      • Bill Marriott

        Thanks. I had already ordered one of those $15 usb controlled 2-relay boards from Amazon to test how my new comm code handles Ethernet interruptions (by depowering a switch) but after reading your analysis, I will also use the second relay to introduce random board repowers into my test setup and try and trigger PHY issues too. I could potentially have a rack of 10 units to test at once and will let you know when I have solid test results.

        In terms of HW design, do you think simply tying reset switch (P9-10) to a GPIO is good enough or will I need R&C to ensure no false reset triggers during boot?

  13. EFE GmbH (@EFE_GmbH)

    Thanks for this great in-depth analysis of the PHY reset issue. We have a three-figure number of BBG deployed in the field and are experiencing another, yet related phy issue. In fact it looks like it can be a combination of both with a wide range of symptoms. Some of them leading to a board that needs to be power-cycled to get the phy back working!
    I will try to explain what the symptoms are, what seems to be causing and also solving them, and how it relates to the known reset & power-supply problem.

    We have seen BBB & BBG (referred to as BBx here) with the following issues: Unreliable Ethernet, also depending on the power supply source (USB or header); no ethernet link on power-up or during runtime; MDIO address Ethernet problems, that could be solved by asserting the reset, and others that needed a power cycle. When we had to power-cycle the board, usually the PHY crystal did not oscillate and also would not start just by resetting the system. Sometimes, when the ethernet link went down for no obvious reason, we were seeing errors in the NAND flash as well. Last but not least, some BBx have been returned from the field with resistor R136 being destroyed by an over-current (i.e. burnt or vaporized). It took us months until we were able to make the connection between these issues.
    We developed a software diagnosis tool to detect various of these symptoms in order to make a power-cycle (fortunately, there is some external hardware hooked up to a GPIO). While trying to cure the symptoms, we wanted to understand the underlying reason for this behaviour. Quite early in the process we have noticed that the 3.3V LDO supplying PHY, flash, and SD card (VDD_3V3B) does not feature a PGOOD output and “bypasses” the PMIC and SYS_RESETn. This seemed to explain some of the failures, in particular those related to undetermined reset behaviour at start-up, and (assuming that there might be voltage drops on VDD_3V3B because of too much power drawn by PHY, NAND flash, SD card, and external circuits) why the PHY had lost its links during runtime altogether with errors in the flash. An average SD card can easily draw more than 200 mA, and VDD_3V3B also provides VDD_PHYA which has to drive even long ethernet lines. Perhaps the voltage drop has not been seen on the VDD_3V3A rail, leaving reset deasserted and the PHY in an undetermined state. These issues can be diagnosed in SW and solved by “manually” issuing a reset by the BBx.
    There have been situations which left the PHY in an unrecoverable state (link down, LEDs off), mostly at power-up (reportedly during runtime as well, unconfirmed though). The only solution to this was power-cycling the BBx. In most or even all cases, the crystal oscillator was not running and there was no way to kick it off. Further down the road, this seemed to coincide with the PHY’s internal 1.2V regulater not being enabled (while the REGOFF strap pin should present a logical 0 here). We forced an external 1.2V here which brought the oscillator and the LEDs to life. So something was going on with the strap pins. Eventually, we found that there are 4 EMI capacitors connected to the LED’s anode (i.e. REGOFF & nINTSEL strap pins), and the second pin of the capacitors is connected to the shield. This by itself is debatable, but not a big deal since R136 is more or less a shortcut between GND and SHIELD. (On a side note, these capacitors do not exist on the original BB, as well as the additional 3.3V LDO.) We then checked the R136 on a problematic board to find that it was destroyed. In other words, whatever is going on at SHIELD potential, will eventually be seen by the PHY strap pins REGOFF and nINTSEL, only filtered by the 4 caps. It is no secret that there can be large ground potential differences, especially in big LANs. A big compesation current will first destroy R136 (as we have seen several times), and then perhaps compromise the PHY functionality. At least what we could see is that, with a destroyed R136 the ethernet cable was influencing the proper detection of the two strap pins. Furthermore, the USB shield also connects to this so-called ESD ring on the board, which could explain why some of the PHY issues seem to vary with the power source.
    Based on the aforementioned findings, we are considering 3 options that (some of them or alltogether) will hopefully solve our problems with the BBx PHY.
    The first option is a hardware add-on hooked-up to the extension header that monitors VDD_3V3B and releases SYS_RESETn (via an open-drain FET) only if VDD_3V3B is within reasonable limits. The second option is a PHY monitoring software (which partly already exists) which either issues a reset or, to be on the safe side, power-cycles the board in case of a non-recoverable error. (Power is supplied by our hardware extension board and can be controlled by the BBx, so it’s not a big deal for us, while it might be too much effort in other cases.) In addition to option 1 and 2, instead of drawing current from the VDD_3V3B rail on our external hardware, we could even try to buffer or stabilize this voltage in order to prevent voltage drops under heavy load. Third option is to make some minor changes on the BBx board, namely either removing C163 to C166, or adding one larger cap (~ 1uF) on REGOFF and nINTSEL, respectively, connecting them to DGND instead of SHIELD. That way, even if R136 gets destroyed, the potential on these two strap pins will be better defined compared to the stock version. It could also make sense to replace R136 with a larger version (e.g. 1206 SMT) that can handle more power instead of vanishing into thin air. Some sources recommend using a capacitor of 2n2 rated at 2kV to connect ethernet shield to digital ground, others a SMT ferrite bead or an additional ESD protection diode. This has to be evaluated.

    Conclusion:
    We have seen a broad range of PHY issues on our BBxs, some of them because of the power source and the poor reset circuitry (either at start-up or because of the maximum power drawn on the VDD_3V3B rail), and others related to the way how the ethernet cable is connected to the PHY and its strap pins. Sometimes it seemed to be a combination of both, leading to various symptoms, in some cases only recoverable by making a power cycle. We found that there is a significant difference compared to the original BB, in the way how the VDD_3V3B rail and the reset signal is generated, as well as how REGOFF and nINTSEL strap pins of the PHY are connected. We’ve had several cased where R136 had been destroyed, probably because of large compensation currents, consequently leading to more stress on the PHY via C163 to C166. As of now, we are working on various SW and HW fixes in order to get rid of the problems. I tried to describe everything to my best knowledge and in as much detail as possible to give you some insight what our company did in order to track down the problems. I cannot give any warranty whatsoever for the correctness or accuracy of the information provided. Maybe it will be helpful to others or leads to better solutions to work around the issues.

  14. zotditzmyo

    Thank you for an EXCELLENT article! We are launching a product with Beaglebone Greens in it and the Phy issues has started to show up. Thanks to all the info on this page (the discussions are so awesome!) I managed to track down the possible causes in my case.

    We are using SeedStudio BeagleboneGreens with three capes on top. The OS is Debian 8.7 2017-03-19 4GB SD IoT (kernel 4.4.54-ti-r93) straigt from the Beagleboard repos. Sometimes at power-up there is no link. Some systems seem to be more susceptible than others and I would eyeball the frequency of problems at about 10% (some systems NEVER have problems), making it hard to diagnose.

    I finally got my hands on one board that has the link problem about 25% of the time with the only extra hardware being a usb webcam. Thanks to the stats I powered-it from the P9 header to increase my chances.

    Using the PhyReg too in scan model, I was able look at it in more detail and compare the good vs bad boots, here are my findings:

    Good boot with link:

    [code lang=text]
    ALIVE ADDRESSES:0000-0000-0000-0001
    LINK ADDRESSES:0000-0000-0000-0001
    [/code]

    Bad boot without link:

    [code lang=text]
    ALIVE ADDRESSES:0000-0000-0000-0100
    LINK ADDRESSES:0000-0000-0000-0000
    [/code]

    So in my case, the only problem seems to be that the PHY was assigned to address 2 instead of 0. Furthermore, I tried looking at the Phy registers from u-boot:

    Good boot with link:

    [code lang=text]
    => mdio list
    cpsw:
    0 – SMSC LAN8710/LAN8720 <–> cpsw
    => mii info
    PHY 0x00: OUI = 0x01F0, Model = 0x0F, Rev = 0x01, 100baseT, FDX
    [/code]

    Bad boot without link:

    [code lang=text]
    => mdio list
    cpsw:
    => mii info
    PHY 0x02: OUI = 0x01F0, Model = 0x0F, Rev = 0x01, 10baseT, FDX
    =>
    => mdio read cpsw 2 0
    Reading from bus cpsw
    PHY at address 2:
    0 – 0x100
    => mdio read cpsw 2 12
    Reading from bus cpsw
    PHY at address 2:
    18 – 0x6022
    => mdio write cpsw 2 12 6020
    => mii info
    PHY 0x00: OUI = 0x01F0, Model = 0x0F, Rev = 0x01, 10baseT, FDX
    => mdio list
    cpsw:
    => mdio read
    Reading from bus cpsw
    PHY at address 2:
    Error
    [/code]

    As you can see I even tried, without success, to reset the address using register 18 (0x12).

    I have tried issuing a long reset using the bbbrtc tool (I indeed had to use unlock first) but upon reset both address bits were set in the register and the link never came up.

    In my case at least, the problem is clearly the address of the phy. Instead of trying to apply the kernel patch that iterated through the addreses, I gave Debian 9.7 a go and managed to get the PHY up and running, even when it was at address 2.

    Remaining questions:
    – Should I be worried about other kinds of problems then a change of address?
    – Why does my PHY address switch between 0 and 2 solely?

    I will make a small rig for some continuous testing and report back here.

    • zotditzmyo

      Well, after further testing, it turns out that the Debian 8.7 I am using does work on address 2 also. So it’s probably a coincidence that on THAT particular board when the PHY is assigned addres 2, it is not working.

      I tried using phyreg with the test parameter but have problemes getting consistent results. When the PHY is up, it always reports ok, but when it is down it reports that it is up anout 4 times out of 5. So instead of receiving the expected answer:
      ”’
      root@dragon:~# /data/bbbphyfix/phyreg/phyreg test 18 13
      Alive bits:0000-0000-0000-0100
      First PHY found at address 2.
      PHY=02 REG=18 : IDLE READ ACK 0110-0000-0010-0010
      1
      ”’
      I Get the following:
      ”’
      root@dragon:~# /data/bbbphyfix/phyreg/phyreg test 18 13
      Alive bits:0000-0000-0000-0100
      First PHY found at address 2.
      PHY=02 REG=18 : IDLE READ ACK 0000-0000-0000-0010
      0
      ”’

      I am not too sure where to go from here?

  15. bigjosh2

    No coincidence, the mangled address is a symptom of the same issue of the PHY coming up in an ill defined state. While the recent kernels can still find a PHY that happens to come up at the wrong address and potentially talk to it, I do not know of any way to recover a PHY that has the clocking bits in reg 18 scrambled except hardware reset. Believe me, I have tried EVERYTHING to try and get that PHY into a known good state by updating the registers and I just do not think it is possible. I think this bit in reg 18 is actually just a artifact that is show that an oscillator in the chip never finished starting up correctly.

    When this happens, you should be able to eventually reset the PHY using the RESET pin on the headers. It may take a couple tries, but when it finally comes up then it should work fine. If you are able to, the best solution is to connect this RESET pin to an IO pin so the BBB can reset itself when it detects the PHY as bad in a startup script. If you can’t get access to the header, then I think the BBBRTC reset presenting above is the only option.

    Have you tried letting bbbrtc continue to reset the board until the PHY comes up correctly? Sometimes it can take more than one try.

    • zotditzmyo

      After further testing, I can confirm that the BBBRTC reset method is working just great! I still have problems in detecting if the PHY is up correctly.

      As written above, I have to run the phyreg test quite a few times to be sure to get at least one negative answer, even if I can see that the link light is off!

      Do you know of a more reliable way to detect this? Or is this juste my newer kernel playing with me?

      Thanks again for everything!

      • bigjosh2

        ” I have to run the phyreg test quite a few times to be sure to get at least one negative answer, even if I can see that the link light is off!”

        I would want to look into this more. I have done a lot of testing and never seen a case where the PHY was in an unrecoverable state that the PHY register 18 bit test did not catch.

        Are you sure there is not some other networking issue here as well? Does this always happen on the same unit or on multiple units? Have you tried replacing the cable and switch with known good ones?

        Thanks!

        • zotditzmyo

          This is all on the same unit, which exhibits the problem more frequently than others. Bench power supply through P9 header, serial console cable and 2 USB webcams through a powered hub to replicate end load.
          Here is a sequence of runs:

          root@dragon:/data# ./phyreg test 18 13
          Alive bits:0000-0000-0000-0100
          First PHY found at address 2.
          PHY=02 REG=18 : IDLE READ ACK 0000-0000-0000-0010
          0
          root@dragon:/data# ./phyreg test 18 13
          Alive bits:0000-0000-0000-0100
          First PHY found at address 2.
          PHY=02 REG=18 : IDLE READ ACK 0000-0000-0000-0010
          0
          root@dragon:/data# ./phyreg test 18 13
          Alive bits:0000-0000-0000-0100
          First PHY found at address 2.
          PHY=02 REG=18 : IDLE READ ACK 0000-0000-0000-0010
          0
          root@dragon:/data# ./phyreg test 18 13
          Alive bits:0000-0000-0000-0100
          First PHY found at address 2.
          PHY=02 REG=18 : IDLE READ ACK 0110-0000-0010-0010
          1root@dragon:/data# ./phyreg test 18 13
          Alive bits:0000-0000-0000-0100
          First PHY found at address 2.
          PHY=02 REG=18 : IDLE READ ACK 0000-0000-0000-0010
          0
          root@dragon:/data# ./phyreg test 18 13
          Alive bits:0000-0000-0000-0100
          First PHY found at address 2.
          PHY=02 REG=18 : IDLE READ ACK 0000-0000-0000-0010
          0
          root@dragon:/data# ./phyreg test 18 13
          Alive bits:0000-0000-0000-0100
          First PHY found at address 2.
          PHY=02 REG=18 : IDLE READ ACK 0000-0000-0000-0010
          0

          As shown above, the missing “\n” can be seen here: https://github.com/bigjosh/phyreg/blob/83c8c8963a06ac3b12ec87a99780d559dd15328d/phyreg.c#L305

          As a substitute I found that using mii-tool was always spot on.

          I can carry out tests if you’d like, no problems.

          thanks!

          • bigjosh2

            And so the PHY was in an unrecoverable state in each of the above cases? And you are sure that it was really that the PHY was unrecoverable and not some other factor that might have prevented a network link? Thanks!

          • zotditzmyo

            I think I have reached the maximum thread comment depth, replying one level up :-)
            I have the same exact symptoms from the begining. I can vouch for the network cable and the network switch I am connected to. Here is a summary of things so far:
            About 1 in 10 times (sorry, no precise metrics) the network link will not come up. Everytime this happens, the PHY address is 2 instead of 0. I have a single board that exhibits this problem on a regular basis while with others the frequency of problems drops to the point of not being able to reproduce it easily.
            I found no way of briging up the link with any service restart, writing values in the PHY registers through u-boot or phyreg or even with mii-tool -R (this resets the chip, brings the link led up for a few seconds and then returns to the not working state after that).
            Since testing register 18 of the PHY does not yield repeatable results when the link is not up and that our end system contains a BeagleBone Green connected to a network switch, I will probably use the mii-tool eth0 command after boot to test the link and not the phy, and issue a reboot command with the bbbrtc tool.

            Thanks for all the fish !

          • bigjosh2

            Have you tried doing a straight reboot (not power cycle, not bbbrtc) when you see no link but reg 18 bit is 0? Just wondering if something else is making the link not come up in this case. Maybe failed autonegotiate or something like that?

      • Mattias

        Yeah, we are seeing this as well, phyreg is missing faulty state (on some devices)

        One peculiar observation is that IF I use mii-tool to detect link BEFORE running phyreg, phyreg ALWAYS detects state correctly (at least with my limited testing)

        • zotditzmyo

          That’s really good to know! I will add that to my tests!
          I had a board cycling for a few days now. One of the lasts services tests the link with mii-tools and reboots if it’s good. I wanted it to stop when the link wasn’t up automatically, but IT NEVER DID!!! that board usually failed 1 out of ten time…
          At least I’m other prejects for the time being, else I would’nt be able to sleep!

        • Philipp

          I can report the exact same findings. I had the same unreliable detections and readouts from phyreg you described above. Running on kernel 4.14.71-ti-r80. Executing “mii-tool eth0” immediately before phyreg has solved the issue for now (no problem after 1300 reboots).

          Interesting to note is, running phyreg in a loop shows that it has some rhythm to it. It reads fast about 6 times, then blocks for a second or two. When running mii-tool and phyreg alternating in a loop, it continues to run fast forever. So mii-tool has atleast some “provable” effects.

  16. Yona Appletree

    Josh — thanks so much for looking deeper into this a finding a solution that’s more reliable than mine. I’m just now refurbishing my installation, and I did notice previously that occasionally a board slips by my script.

    I’ll give your script a shot. I’m curious about using this rtc-reboot trick. All of my boards have the GPIO60-RESET hack in hardware, and that seems to work very well. Any reasons not to use that?

    Thanks agian.

    • bigjosh2

      Yona!

      Your GPIO reset jumper should be good as long as the reset says low long enough to reliably reset the PHY as well, which it seems to. I only came up with the crazy rtc-reboot hack because there was no good way for me to add a jumper to my setups.

      I do think that the PHY register 18 bit will be a more reliable indicator of the problem than your mdio:03 grep.

      LMK your results- you are one of the few people around who has enough BBB’s in service to care about this stuff too! :)

  17. Mohamed N

    Hi Josh,
    I have my own custom board that uses the OSD335X, which is basically what is inside the beaglebone. I am having similar issues where the ethernet sometimes won’t connect. However, I tried restarting the phy by pulling rst to ground which didn’t do anything. I can see the phy resting because of the led behaviors, however, I still can’t talk to it through ethernet. Also, I have tried manually assigning it an IP address, that allows me to ping the processor, but that is about it. I can’t do anything else beside pinging.

    • bigjosh2

      If you isolate the PHY’s reset line and ground it and let it return to Vcc after the the Vcc and the strapping pins are stable (not long after the power supply is stable) and that does not reset the PHY, then I think you have a different problem. This problem has to do with that chip coming out of reset before the power supply and strapping lines are stable.

      Did you check the above canary bit on the PHY when you are having the problem?

  18. Matthijs van Duin

    This software workaround seems like a very very bad idea. You are entering the defeatured RTC-only mode (and iirc measurements I’ve done in the past suggest current flowing through the am335x in strange ways when you enter RTC-only mode anyway on BBB rev A6A or later), which also keeps SYS_5V powered which triggers the 3.3V regulator bug: VDD_3V3B remains supplied while VDD_3V3A is shut off (because enough current leaks to 3V3A to keep the regulator’s enable-pin high), creating excellent opportunity for violating absolute maximum ratings and destroying I/O cells. This hazard becomes especially prominent if there’s external 3V3B-powered logic driving pins of the BBB, but even the on-board hardware (notably the console serial port) can cause this problem.

    In one test with powering off with SYS_5V remaining active while having a serial console cable attached (causing UART0_RXD to be driven high by the 3V3B-powered buffer) I saw 45 mA more current than normal (flowing through UART0_RXD’s protection diode to 3V3A, yikes). The current injected into 3V3A also caused it to exceed the (powered-off) 1.8V supplies by more than 2V, a situation the am335x datasheet repeatedly and emphatically warns should be avoided under all circumstances.

    • bigjosh2

      As mentioned in the article, this is a last resort solution. If you have access to the BBB then you should manually reboot it or add a jumper to the reset pin. But in my case that is not possible so no other choice but this software only solution. For the record, I’ve been running this on 80 BBB units for over a year and many hundreds of reboots without any problems so far. Definitely always wise to not have any pins with externally driven voltage on them when the BBB is powered down. Thanks!

      • Matthijs van Duin

        Like I said it will depend on the hardware situation how dangerous the state you’re inducing is. With no external connections I measured 35 mA, but this would have been spread across (the protection diodes of) the 25 I/Os that have pull-ups to VDD_3V3B, and the current thus injected to VDD_3V3A was only sufficient to raise it to about 1.4V. I don’t know for sure how healthy this is for the AM335x in the long run, but it seems quite possible it might be able to tolerate this.

        But if even a single pin is driven high by some driver powered from VDD_3V3B you’ve got a serious situation, and this is exactly the 3.3V supply available on the P9 header, which is typically what capes and other external electronics will use to power things that will drive signals into the BBB. This supply is normally shut down when the BBB is powered off (since the 3V3B regulator is powered from SYS_5V, which is cut off at the start of the poweroff sequence), but that doesn’t happen in the RTC-only state you’re using.

  19. Aaron Lockton

    Hi Josh

    Can I ask what state are the Ethernet LEDs on the RJ45 when you are seeing this Ethernet/PHY issue? I’ve seen a small proportion of boards (< 1% probably) with an Ethernet connection problem which is characterised by inverted logic on the LEDs (LEDs on when cable unplugged). Sometimes power cycling makes a difference, as does whether the board is powered with or without cable plugged in, but generally these specific boards just seem “bad” and basically almost always show some symptoms whereas I have never seen the inverted LED behaviour in the general population (although of course it’s a tricky thing to monitor in deployed populations). I am mainly using the e14 industrial version.

    Thanks, Aaron

    • bigjosh2

      My production BBB’s at not physically accessible so I don’t generally get to see the LEDs when the network does not come up. But I will say that in the 2+ years since I’ve put this fix into place, I have not had a single BBB network not come up (eventually). Previously about 10%20% of units would never come up after a power cycle.

    • Matthijs van Duin

      I’ve seen the inverted link led thing as well. It suggests the logic level of the led pin was somehow incorrectly recorded at reset, which is also the strapping option for REGOFF, hence the phy will not work in that case. In general, all of the phy problems (ranging from having an incorrect phy address to not working at all) appear to be due to incorrect strapping options being latched at reset.

      Based on testing I’ve done the primary cause seems to be the slow rise of the reset line, which is caused by a 2.2μF capacitor on it (C24), apparently to ensure the phy’s specified reset timing is met, and to lesser extent by a 0.1μF capacitor (C30).

      I’ve done some tests on a beaglebone (known to be susceptible to the phy issue) with a reset extender added to ensure reset timing is met and additional pull-up to increase the rise time on reset deassertion. The impact on the phy failure rate was pretty clear:

      2.4% (34/1431) with no external pull-up (just the on-board 10K).
      1.0% (12/1189) with 1K pull-up.
      0.4% (5/1153) with 240Ω pull-up.
      0.15% (2/1354) with 1K pull-up and C24 removed.
      0 failures in 16901 power cycles with both caps (C24 and C30) removed.

      In other words, the faster the reset rise time, the less frequently it failed.

      How or why the phy is managing to misread the strapping options is still a mystery to me. We tried shorting the link led to make REGOFF pulled down more convincingly and reduce the opportunity for noise pickup, but it did absolutely nothing. Adding 0.25s delay between bootrom and U-Boot SPL, just in case the AM335x is released from reset earlier than the phy, likewise had zero impact. Perhaps the phy is just really intolerant of a slow-rising reset, but that seems very odd given that the datasheet actually suggests using an RC-circuit on the reset input to generate the required reset timing.

      Our current workaround consists of:
      1. a voltage supervisor (TPS3839K) with a push-pull output connected to the BBB reset (P9.10) via 330Ω series resistor (to limit current), both to ensure reset is asserted for longer and to make it rise faster after deassertion. This should greatly reduce (but not eliminate) phy issues.
      2. a Schottky diode from the pmic power button input (P9.09) to the VDD_3V3 to ensure that if the beaglebone powers off, it automatically powers back on. This allows software to power-cycle the board if the phy issue is detected.

      • Matthijs van Duin

        > additional pull-up to increase the rise time on reset

        I meant decrease the rise time of course

        • Hamish

          We ended up using a Schmitt trigger between the RC (slightly adjusted values) to generate a clean and sharp reset signal. I think the reason for a lot of these problems is that the RC reset delay circuit is shared with the processor reset line and is effectively a high output impedance source. Any load on this as the processor switches causes dips on the reset line. I think the phy datasheet does not consider that the suggested circuit would also feed other logic in parallel. Nor have long traces off to the pushbutton either…
          Nice work everyone to identify the causes!

          • Matthijs van Duin

            I like this observation, though I don’t think this is it either. The problem with this hypothesis is that the AM335x’s extremely brief reset output pulse gets completely absorbed by the large capacitance on the reset net. It doesn’t show up on the scope at all. This is also why a soft reboot (which also generates this output pulse) doesn’t reset the phy: even if I configure the reset output pulse length to its maximum length (10.625 μs), it barely makes a ripple.

  20. kurtnelle

    I’m trying to get the RTC wake to work on the PocketBeagle (an OSD335x-SM device). As far as I understand the thread, RTC-Only is a dangerous mode, that requires external pins and power to the RTC so that the system will even turn back on. A note from Octavo suggests that using an external RTC module is the best bet?! (https://octavosystems.com/forums/topic/no-need-for-external-rtc-on-c-sip/?highlight=rtc)

    Is that the gist of it? Further, can the system be suspended to RAM for a fast resume, or has all this conversation solely been for the RTC-Only mode?

  21. Ant

    HI Josh,
    your entry really helped us find and fight the problem. However, what seems to work for us is to check /sys/class/net/eth0/operstate and /carrier 3 seconds after the network.target is done. I don’t see this approach mentioned, and forgive me if I overlooked it, but have you by any chance tried it? And have you found anything problematic about it?
    Cheers

    • bigjosh2

      My register-based test has been working for me for 2 years without a single misfire in either direction, so I am sticking with it – but more options are always better!

      Have you checked to see if your test fires a false positive when there is no ethernet cable connected?

      • Ant

        Agreed! If it works, why change it!
        And I am currently running a false positive test. So far 100 power cycles and I am successfully stuck in a power cycle loop.

    • Matthijs van Duin

      The criterion I used back when I did some testing was whether or not /sys/bus/mdio_bus/devices/4a101000.mdio:00 existed. I did not thoroughly validate whether this is a reliable indicator though.

  22. Dave L.

    I’m wondering if a pulse stretcher that sits on the reset line would mitigate this issue?

    I thinking about designing a cape for the BBB for my application. I’m thinking that a one shot that is falling edge triggered could sense a falling edge on the reset and hold it low for 25mS (or whatever it needs to be) which should satisfy the PHY reset requirements.

    Does this sound reasonable? Am I missing something? It might interfere with JTAG, since I believe there is some twiddling of the reset during JTAG, but I’m not planning on using JTAG.

    This would also be an easy plug-in hardware fix for someone that really needs a more deterministic fix.

    Anyone have thoughts on this?

    • bigjosh2

      If you are making a cape, then all you have to do is tie the RESET pin to any GPIO pin.

      Then have a script check the PHY on bootup. If it is in a sad state, then have the script reset the board using that GPIO pin. Then the board will come back up with the PHY correctly reset.

      • Matthijs van Duin

        The problem with this idea is that you’ll only be able to discharge the reset line until the AM335x goes into reset, while it’s not guaranteed that it will reach the phy’s V_IL(min) while doing so. It would be better to use a gpio to trigger a pulse generator (e.g. the manual reset input of a reset supervisor).

        Or, what we did, since the BBB is always supposed to be powered up in our product: connect a schottky diode from the power button pin (P9.09) to 3.3V so that the BBB gets powered on whenever it powers off. This lets you use poweroff as a way to power cycle the BBB, which you can use if you detect the phy is in a dark and sad place.

      • Dave L.

        Thanks for the feedback, @bigjosh2.

        This would reset the processor, too, right? When the processor goes into reset, won’t the outputs go tri-state? …and if that is true, that doesn’t guarantee a 25mS reset to the PHY. THe one shot could guarantee the minimum reset time for the PHY.

        Sorry, I’m kind of new to the issue and may not have the complete picture of the problem yet.

        • bigjosh2

          In practice the reset pin seems to get held long enough to reset the PHY correctly every time based on testing. Even if there was a time when it did not get reset correctly, the system self corrects and would just reset again when the script saw that the PHY was bad again on booting.

    • Matthijs van Duin

      We had the same idea (using a voltage supervisor to stretch the reset pulse), but the idea does not work. The phy is getting a sufficiently long reset pulse, the problem seems to be the slow rising edge. If you scroll up a bit in the comments section you can find a long comment where I explain the tests I’ve done and their results.

  23. Hartley Sweeten

    Has anyone figured out how to make this work with a 4.x kernel?

    The bbb-check-phy script seems to work fine. It least I’m seeing the “chkphy:eth0 good” message. I have not had a power cycle, yet, today where the PHY does not work.

    But the bbb-long-reset script does not make my board reset. It runs without any errors, that I can tell, but does not reboot.

    This PHY issue is really annoying!

    • bigjosh2

      Hmmm. I thought the 4.x problems were fixed by others but I do not have a 4.x setup to test on so you will have to try to narrow things down on your side.

      This article explains the strategy, and includes all the software you need to try to set though the process yourself and see where things might be going wrong. If you can pinpoint the issue then I bet we can figure out how to fix it!

      • bigjosh2

        For example, you can use the `bbbrtc` utility to set up the reset and then use the `dump` command to check that the registers actually got set correctly- and if not which ones did not.

        • Hartley Sweeten

          I’m not sure what I am supposed to be looking for. But the bbb-long-reset output is:

          $ sudo bbb-long-reset
          Opening /dev/mem…opened.
          Mappng in 0x44e3e000…mapped at address 0xb6f01000.
          Unlocking Kick Registers
          Unmaping memory block…unmaped.
          Closing fd…closed.
          Opening /dev/mem…opened.
          Mappng in 0x44e3e000…mapped at address 0xb6fa3000.
          Stopping RTC…waiting for stop…took 75 tries.
          Checking RTC chip rev…checks good.
          Setting SLEEP to 1615405633…set.
          Setting WAKE to 1615405641…set.
          Enable PWR_ENABLE_EN to be controlled ON->OFF by interrupts…
          enabled.
          Enable ALARM interrupt bit… enabled.
          Enable IRQ WAKE ENABLE bit…
          enabled.
          Enable ALARM2 interrupt bit… enabled.
          Reading clocks and printing to stdout…
          NOW: 1615405631
          SLEEP: 1615430833
          WAKE: 1615430841
          done.
          Starting RTC…
          waiting for it to start…took 29 tries.
          Unmaping memory block…unmaped.
          Closing fd…closed.

          $ sudo bbbrtc dump
          Opening /dev/mem…opened.
          Mappng in 0x44e3e000…mapped at address 0xb6fe1000.
          Stopping RTC…waiting for stop…took 104 tries.
          Checking RTC chip rev…checks good.
          Dumping RTC:
          00 00000045
          04 00000050
          08 00000012
          0c 00000010
          10 00000002
          14 00000021
          18 00000000
          1c 00000000
          20 00000021
          24 00000047
          28 00000019
          2c 00000010
          30 00000002
          34 00000021
          38 00000000
          3c 00000000
          40 00000000
          44 00000000
          48 00000018
          4c 00000000
          50 00000000
          54 00000048
          58 00000000
          5c 00000000
          60 8011036c
          64 00000000
          68 010001b0
          6c 010001b0
          70 010001b0
          74 4eb01106
          78 00000003
          7c 00000002
          80 00000013
          84 00000047
          88 00000019
          8c 00000010
          90 00000002
          94 00000021
          98 00010000
          Starting RTC…
          waiting for it to start…took 49 tries.
          Unmaping memory block…unmaped.
          Closing fd…closed.

          • Vlad

            Hi Josh,

            bbbphyfix works on 4.19.94-ti-r68, 5.4.106-ti-r35 and 5.10.59-ti-r20 kernels running in PRU-RPROC mode, provided you have:

            1. i2c-tools installed (with ‘apt install i2c-tools’); this is because the bbb-long-reset calls “i2cset -f -y 0 0x24 0x0a 0x00” at line 4,

            2. set the UTC timezone (with ‘timedatectl set-timezone UTC’); if, for example, I set the timezone as Europe/Zurich (UTC+2), then “NOW” is at UTC-2 (in the past), instead of UTC+2. This will not work as the reboot would be scheduled two hours in the past.

            I also want to thank you for your work and post which are really, really useful to the whole BeagleBone community!

            Vlad

  24. zerodegrekelvin

    I know we are 2022, I just want to say very neat hack to check for that phy “reserved” bit.
    I did have similar issue but not on BBB, I was working on NXP LS1043A chip and Vitesse Phy, I had same kind of issue, so I did implement something similar, during kernel boot, I first try to reset up to 10X the MAC within the LS1043 (dpaa-eth) and check on every MAC reset the “Good” status, if I reach 10X, I just reset the whole board and it solved the issue, the testers no longer reporting the ethernet problem when they do power-cycle test 8-)
    People said it is a big hack, but between the rhetoric and real life where you have to deliver a real product, the code stayed in 8–)
    Cheers!

  25. Matthijs van Duin

    Note that the current BBB hardware revision (C3) revised the reset circuit of the Ethernet PHY and also added the ability to reset the it via a GPIO, so if the new reset circuit doesn’t already fix the problem then it should be completely fixable by adding some u-boot code that checks if the phy is in a bad state and resets it until it works.

    • bigjosh2

      This is good and long awaited news, thanks for letting us know!

      Looking at the rev notes on beagleboard.org, I only see the new “reset option on (GPIO1_8) for Ethernet PHY to avoid possible start-up issue”.

      Sadly, it looks like their git is down so I can not see the actual issue report. Do you know if there is any new software support for detecting a bad PHY state and triggering the reset? Have you tried any of these new boards?

  26. fryman

    The original version of bbbrtc has a bug when used with o/s timezones.
    The author used mktime with gmtime in the get/set functions. “gmtime” does not use time zones, but “mktime” does! This
    caused bbbrtc to not work properly when the o/s uses time zones as the get/set operations were mismatched relating to the current o/s timezone. The fix is to replace “mktime” with “timegm”, which does NOT use time zones!

    Just fyi….

Leave a Reply to Mattias Cancel reply