-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2022.1+] TP-Link WDR4300 hangs during reboot #2904
Comments
does this also happen with the very similar WDR3600 ? |
Probably. We had a few isolated cases where a WDR3600 needed a power cycle after an upgrade but it is not clear if this is at all related to the problem described here. We don't have enough (failing) devices to have a confident answer. |
It might be worth mentioning that the special symbol at the end of the log is printed during a boot as well. I'm not sure if this is printed before or after the bootloader loaded though. EDIT:
|
We also had reports in our community when I rolled out 2022.1 but thought it was random, and we didn't have proper logs or anything else. #2655 |
We observed this when transitioning from 2022.1.2 to 2022.1.4 on WDR4300 and more frequently on Ubiquiti AC lite. In our observation, the update was fine when the machine was rebooted just prior to the update, which may be suggesting an out-of-memory issue. |
@smoe Just to clarify, we were able to reproduce the issue on a freshly booted device as well. |
One thing that comes to my mind is the usage of the newer ar934x SPI controller driver, at least no device reported in this issue uses the older ar71xx driver. This driver was first shipped with OpenWrt 21.02, matching the observation it does not break with older releases based on OpenWrt 19.07 and older. If you are still able to reproduce this issue, you can modify the ar934x DTSI to use the compatible for the ar71xx SPI controller. Ping me in case i should provide you with a patch. If this fixes the reboot issue, we have a better path where to look next. |
@blocktrron thank you for looking into this. To avoid misunderstandings, you suggest to do this change here in OpenWRT? diff --git a/target/linux/ath79/dts/ar934x.dtsi b/target/linux/ath79/dts/ar934x.dtsi
index d88c7bfabc..15201b197e 100644
--- a/target/linux/ath79/dts/ar934x.dtsi
+++ b/target/linux/ath79/dts/ar934x.dtsi
@@ -199,15 +199,17 @@
};
spi: spi@1f000000 {
- compatible = "qca,ar934x-spi";
- reg = <0x1f000000 0x1c>;
+ compatible = "qca,ar7240-spi",
+ "qca,ar7100-spi";
+ reg = <0x1f000000 0x10>;
clocks = <&pll ATH79_CLK_AHB>;
+ clock-names = "ahb";
+
+ status = "disabled";
#address-cells = <1>;
#size-cells = <0>;
-
- status = "disabled";
};
}; |
@grische Almost. Just revert this commit in the file: openwrt/openwrt@ebf0d8d#diff-45ad725f9ec8cc2da88738047b1d5c4d1e69df19194bd22394d3736e03093613 |
@blocktrron I was able to reproduce a hang after reboot even with the above commit reverted using Gluon v2023.1: Here is the respective branch: https://github.com/grische/site-ffm/commits/test/revert-ath79-add-new-ar934x-spi-driver/ |
@grische Are these hangs only reproducible after writing a upgrade image or does a regular reboot invocation also trigger a spurious hang? |
I have a test WDR4300 device where I can reproduce the hangs during a reboot every other time. Surprisingly often actually. |
On the exact same setup, I tested it with
|
Add a cache-barrier after the reset-register write. This fixes spurious reboot issues on TP-Link WDR3600 and WDR4300 devices with Zental DDR2 DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 2fe8ecd)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
The bug was fixed upstream in
|
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 2fe8ecd)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 2fe8ecd880396b5ae25fe9583aaa1d71be0b8468)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 2fe8ecd)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 2fe8ecd)
We have lost several WDR3600 on a recent upgrade to 2023.2.4. I attached a WDR3600 to a serial console and used this script to reboot it in a loop: https://gitlab.freifunk-stuttgart.de/-/snippets/8 I was able to observe failing reboots after 5, 20 and 250 tries. With a patch like this, I have >1500 successful reboots now:
The printk can never be seen, but I suppose that's because there is never a chance to flush out the buffer to the console. It's not clear to me why this works, but neither is the solution of reading back the register (ioremap should already disable cache). |
I'm pretty certain we have seen this on a few wdr4300 in Aachen, too. thanks for reporting a fix ❤️ |
oh wait, I'm wrong. This code is run on boot so it fixes it immediately, right? (not before rebooting) |
@nrbffs can you provide the exact gluon commit of the upgrade failures start version (before upgrading) and commit hashes of the automatic reboots you have seen failing? |
At least the "i" should be incremented :-) |
ec72498 (v2023.2.3)
I have seen the failures also on the commit above as well as v2023.2.4 My proposed fix is in #3397 (tested on main, 1487 successful reboots) |
I can confirmt hat the new patch seem to properly fix the issue at last!
|
I did talk with @nrbffs at 38c3 about this topic and we agreed on working out a upstream compatible version of this patch. Please test https://github.com/freifunk-gluon/gluon/tree/ar9344-reset and post the dmesg output when running this firmware. I don't have compatible hardware at hand for now. |
Thanks, dmesg at https://paste.debian.net/1342540/ |
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 916af73)
I also ran the reboot tests on main and on the ar9344-reset branch on a TP-Link TL-WDR4300 v1:
Serial console output can be found here: https://gist.github.com/grische/be44b330d7f8c7fff88939f979be9d32 |
I might have bad news: the ar9344-reset branch got stuck during a reboot on my WDR4300 at the 705th reboot. I added the last two reboots into the above gist: https://gist.githubusercontent.com/grische/be44b330d7f8c7fff88939f979be9d32/raw/634a35543445ee220a2c77e8ee6a3b7e285e72d3/ttyUSB_2025-01-05T16%25EF%2580%25BA39%25EF%2580%25BA52+01%25EF%2580%25BA00.0.tail |
The hanging reboot is 10 seconds faster and these bits here are not happening:
|
The timings across all reboots are not very reliable as I attempt a reboot "every 30 seconds", so the 10s shift could have been coincidental. Some reboot time stats: Minimum: 82 seconds: grep "Restarting" ttyUSB_2025-01-05T16%EF%80%BA39%EF%80%BA52+01%EF%80%BA00.0 | tr -d '\]' | awk '{ print $2 }' | sort -n | head -n 3
82.410005
82.529950
82.848242 Maximum: 185 seconds grep "Restarting" ttyUSB_2025-01-05T16%EF%80%BA39%EF%80%BA52+01%EF%80%BA00.0 | tr -d '\]' | awk '{ print $2 }' | sort -n | tail -n 3
179.763697
179.999500
185.094121 Average: 93.6 seconds grep "Restarting" ttyUSB_2025-01-05T16%EF%80%BA39%EF%80%BA52+01%EF%80%BA00.0 | tr -d '\]' | awk '{s+=$2}END{print "average:",s/NR}'
average: 93.6536 When I grep the log of almost 700 reboots (see gist above), I can find the following parts
|
@rotanid the tested patch d3f2342 and the upstreamed patch openwrt/openwrt@0c52c9d are identical. And yes, this would need a backport to Gluon v2023.2.x. At least the first version of the patch was easily backported to kernel 5.15. |
This bump includes two major changes / fixes: - c06d4df974 mac80211: set basic-rate for mesh interfaces See freifunk-gluon/gluon#3185 - 0c52c9d6fc ath79: reset ETH switch for AR9344 freifunk-gluon/gluon#2904
Bug report
What is the problem?
Occasionally (>10% of all devices), hang after an autoupdate and need a manual powercycle to reboot.
I managed to reproduce this while a serial cable was attached:
I am not sure if this is related to #185, but we were not able to reproduce it (yet) with a reboot.
What is the expected behaviour?
That the WDR4300 comes back up after an update.
Gluon Version:
v2022.1.2 and v2022.1.3
Probably also earlier v2022.x
We experienced similar behaviour during the initial v2022.1 deployment, but discarded it as "random".
It was more severe with the v2022.1.3 deployment (probably just because of chance) and I was able to reproduce it with a serial cable attached when upgrading from v2022.1.3 to v2022.1.4.
Site Configuration:
https://github.com/freifunkMUC/site-ffm/blob/833829e68f97e4781f175bdd688d7f498a7efe53/site.conf
Custom patches:
https://github.com/freifunkMUC/site-ffm/tree/833829e68f97e4781f175bdd688d7f498a7efe53/patches
The text was updated successfully, but these errors were encountered: