-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UART slows down #1170
Comments
Screencast.from.2024-12-02.17-17-13.webm |
Thank you for the report! Do you have the technical capability to run this simulation in a Questa GUI (or possibly VCS or Verilator) and look for a root cause? The UART is in src/uncore/uartPC16660D. It responds normally for us during Linux boot in hardware and simulation. Perhaps you are using it in a different way that exercises a bug in the UART itself, or a bug related to a UART interrupt handler? |
Hello Mr. Harris, I don’t have professional tools; I am a hobbyist who learned the basics from your two edX courses and even developed a Linux/XV6 SoC with the knowledge from those courses. Thank you very much. |
This sounds like a tricky one to debug. I will try to reproduce on my end with the VCU108 board so we can have more debug signals. I'm very concerned about the reset. |
Interesting this isn't reproducing on the vcu108. I tried playing around with various baud rates. I'm trying the Arty A7 now. If I had to guess you probably found a bug in a UART fifo and it's reporting that the transmit fifo is always full so rather than writing multiple bytes per interrupt it's writing 1 byte hence the slow down. |
After 8 minutes, it slows down. I also tested it with an external power supply, but it didn't fix the issue. |
I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now. |
That’s wacky that it depends on the FPGA.
… On Dec 2, 2024, at 12:32 PM, Rose Thompson ***@***.***> wrote:
I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now.
—
Reply to this email directly, view it on GitHub <#1170 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR4AA37CAWL37ZOAT5HBTVD2DS7XBAVCNFSM6AAAAABS3ZWVECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJSG42TGOJQHA>.
You are receiving this because you commented.
|
It's not too unreasonable to think it could depend on the FPGA because the two FPGAs have different hardware configuration. The Arty A7 has a 20Mhz clock and 256MiB DDR3 memory and the VCU108 has 50 MHz clock and 2GiB DDR4 memory. This inherently makes the timing of interrupts different so it's possible we just can't hit the bug on the VCU108. It's also possible splinedrive's suggestion is correct and it's related to memory since the VCU108 has more. |
I deleted my comment about the memory, but that still seems the most plausible. Still, wait a while and try running find / for longer than 10 minutes. However, the xxd /dev/urandom command always reproduces the issue. |
Interesting the uart's INTR bit is always high. This is causing the OS to take a trap back into the trap handler immediately on exiting starving all other processes. |
Even more interesting the CPU is waiting on a wfi instruction while INTR is high. |
But why does the system reboot when tio is closed? |
That part I haven't reproduced. I've been using screen and it's not rebooting the CPU. What is the tio command you are using? |
|
I think the problem is with either
|
Interesting. I've narrowed the failure down to this section of kernel code.
The sw normally clears the intInProgress bits but for some reason this is not happening. I'm trying to isolate if this because it's not being called at all or if if the stack pointer is corrupted. There are at least two threads using this function which is complicating debugging this. We can't just trigger on this instructions address. |
A couple interesting things to note. The two threads accessing the above function only experience the failure if the ld at ffffffff801d3856 effective address is specific number of bytes apart. Sometimes the bug never bug never occurs. For example. The following runs of the same buildroot just different reboot...
After the slow down the second interesting thing emerges, only one thread executes the plic_ireq_eoi function. |
I have a hypothesis. I bet the hptw messes up during an interrupt (or similar) and the address translation for the claim data which should we written to the plic gets corrupted. |
The memory for the ArtyA7 and VCU108 has the same size configured in the DTS, specifically 256 MiB. But are you suggesting that, because of the difference in actual memory size (2 GiB vs. 256 MiB), they might have different timings? Could that be causing issues with the hardware walker and interrupt logic? |
You might have an older device tree. The current device trees configures 256MiB for the Arty and 2GiB for the VCU108. The fpga/generator/Makefile modifies the zero stage bootloader to accommodate the different memory size based on the derived configs in config/deriv/fpgavcu108 and config/deriv/fpgaArtyA7. The timing is different because the clock speeds are different so the timer interrupt will fire at different times relative to events like page table walks and cache misses. Last night I was able to trigger the ILA on the first event of the UART slow down with triggering on the sret instruction (end of a trap), InstrValidM == 1 and intInProgress == 0x200. This is a condition which should never occur because intInProgress means the PLIC is still handling an external interrupt and has yet to acknowledge handling the interrupt. Unfortunately I wasn't able to figure out why exactly the trap handler did not clear the interrupt. What I do know is it just did not execute any of the code which would have ACKed the interrupt. This means somewhere in the IRQ code Linux thinks there is no longer an external interrupt. When we figure that out we'll actually be closer to the real bug. |
What happens if you configure the VCU108 with 256 MiB as well? Would the thread addresses then be closer to each other, as you described, potentially triggering the issue? |
After a large amount of output, the UART slows down, and commands like Ctrl-C or other controls stop working. Disconnecting the serial tool from the host triggers a reboot.
You can trigger it by running xxd /dev/urandom and waiting.
``
Screencast from 2024-12-02 17-17-13.webm
The text was updated successfully, but these errors were encountered: