Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UART slows down #1170

Open
splinedrive opened this issue Dec 2, 2024 · 22 comments
Open

UART slows down #1170

splinedrive opened this issue Dec 2, 2024 · 22 comments
Labels
bug Something isn't working

Comments

@splinedrive
Copy link

splinedrive commented Dec 2, 2024

After a large amount of output, the UART slows down, and commands like Ctrl-C or other controls stop working. Disconnecting the serial tool from the host triggers a reboot.
You can trigger it by running xxd /dev/urandom and waiting.
``
Screencast from 2024-12-02 17-17-13.webm

@splinedrive
Copy link
Author

Screencast.from.2024-12-02.17-17-13.webm

@davidharrishmc
Copy link
Contributor

Thank you for the report! Do you have the technical capability to run this simulation in a Questa GUI (or possibly VCS or Verilator) and look for a root cause? The UART is in src/uncore/uartPC16660D. It responds normally for us during Linux boot in hardware and simulation. Perhaps you are using it in a different way that exercises a bug in the UART itself, or a bug related to a UART interrupt handler?

@splinedrive
Copy link
Author

splinedrive commented Dec 2, 2024

Hello Mr. Harris,
I am a big fan of yours.
I just managed to get it running on my Artix7 FPGA, and this is what I noticed—it happens within 10 minutes.
There are many root causes: possibly the kernel driver (very unlikely), the UART (makes sense), but what I don’t understand is why the system reboots when I terminate tio.

I don’t have professional tools; I am a hobbyist who learned the basics from your two edX courses and even developed a Linux/XV6 SoC with the knowledge from those courses. Thank you very much.

https://github.com/splinedrive/kianRiscV

@rosethompson
Copy link
Contributor

This sounds like a tricky one to debug. I will try to reproduce on my end with the VCU108 board so we can have more debug signals. I'm very concerned about the reset.

@rosethompson
Copy link
Contributor

Interesting this isn't reproducing on the vcu108. I tried playing around with various baud rates. I'm trying the Arty A7 now. If I had to guess you probably found a bug in a UART fifo and it's reporting that the transmit fifo is always full so rather than writing multiple bytes per interrupt it's writing 1 byte hence the slow down.

@splinedrive
Copy link
Author

After 8 minutes, it slows down. I also tested it with an external power supply, but it didn't fix the issue.

@rosethompson
Copy link
Contributor

I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now.

@davidharrishmc
Copy link
Contributor

davidharrishmc commented Dec 2, 2024 via email

@rosethompson
Copy link
Contributor

It's not too unreasonable to think it could depend on the FPGA because the two FPGAs have different hardware configuration. The Arty A7 has a 20Mhz clock and 256MiB DDR3 memory and the VCU108 has 50 MHz clock and 2GiB DDR4 memory. This inherently makes the timing of interrupts different so it's possible we just can't hit the bug on the VCU108.

It's also possible splinedrive's suggestion is correct and it's related to memory since the VCU108 has more.

@splinedrive
Copy link
Author

@rosethompson

I deleted my comment about the memory, but that still seems the most plausible. Still, wait a while and try running find / for longer than 10 minutes. However, the xxd /dev/urandom command always reproduces the issue.

@rosethompson
Copy link
Contributor

Interesting the uart's INTR bit is always high. This is causing the OS to take a trap back into the trap handler immediately on exiting starving all other processes.

@rosethompson
Copy link
Contributor

Even more interesting the CPU is waiting on a wfi instruction while INTR is high.

@splinedrive
Copy link
Author

But why does the system reboot when tio is closed?

@rosethompson
Copy link
Contributor

That part I haven't reproduced. I've been using screen and it's not rebooting the CPU. What is the tio command you are using?

@splinedrive
Copy link
Author

splinedrive commented Dec 2, 2024

tio -m INLCRNL -o 1 /dev/serial/by-id/usb-Digilent_Digilent_USB_Device_210319AFED71-if01-port0 -b 115200

@rosethompson
Copy link
Contributor

I think the problem is with either

  1. How the driver is claiming the external interrupt. The plic's intIntProgress bit 10 (UART interrupt) never goes low.
  2. Or the hardware has a bug which has caused the above condition to occur and the hardware/software has no way to lower intIntProgress.

@rosethompson
Copy link
Contributor

Interesting. I've narrowed the failure down to this section of kernel code.

ffffffff801d3844 <plic_irq_eoi>:
ffffffff801d3844:	1141                	addi	sp,sp,-16
ffffffff801d3846:	e422                	sd	s0,8(sp)
ffffffff801d3848:	0800                	addi	s0,sp,16
ffffffff801d384a:	0140000f          	fence	w,o
ffffffff801d384e:	04cbd797          	auipc	a5,0x4cbd
ffffffff801d3852:	5ca7b783          	ld	a5,1482(a5) # ffffffff84e90e18 <plic_handlers+0x8>
ffffffff801d3856:	6518                	ld	a4,8(a0)
ffffffff801d3858:	0791                	addi	a5,a5,4
ffffffff801d385a:	c398                	sw	a4,0(a5)
ffffffff801d385c:	6422                	ld	s0,8(sp)
ffffffff801d385e:	0141                	addi	sp,sp,16
ffffffff801d3860:	8082                	ret

The sw normally clears the intInProgress bits but for some reason this is not happening. I'm trying to isolate if this because it's not being called at all or if if the stack pointer is corrupted. There are at least two threads using this function which is complicating debugging this. We can't just trigger on this instructions address.

@rosethompson
Copy link
Contributor

A couple interesting things to note. The two threads accessing the above function only experience the failure if the ld at ffffffff801d3856 effective address is specific number of bytes apart. Sometimes the bug never bug never occurs. For example. The following runs of the same buildroot just different reboot...

  1. Thread 1 reads 0xffffaf800682a220
    Thread 2 reads 0xffffaf8006712e20
    which are 0x117400 bytes apart and this never crashes (at least after about a hour).

  2. Thread 1 reads 0xffffaf800681a220
    Thread 2 reads 0xffffaf80066fae20
    which are 4: 0x11F400 bytes apart and this does crash in less than 10 minutes.

After the slow down the second interesting thing emerges, only one thread executes the plic_ireq_eoi function.
This explains why the vcu108 is never experiencing this bug.

@rosethompson
Copy link
Contributor

I have a hypothesis. I bet the hptw messes up during an interrupt (or similar) and the address translation for the claim data which should we written to the plic gets corrupted.

@splinedrive
Copy link
Author

The memory for the ArtyA7 and VCU108 has the same size configured in the DTS, specifically 256 MiB. But are you suggesting that, because of the difference in actual memory size (2 GiB vs. 256 MiB), they might have different timings? Could that be causing issues with the hardware walker and interrupt logic?

@rosethompson
Copy link
Contributor

You might have an older device tree. The current device trees configures 256MiB for the Arty and 2GiB for the VCU108. The fpga/generator/Makefile modifies the zero stage bootloader to accommodate the different memory size based on the derived configs in config/deriv/fpgavcu108 and config/deriv/fpgaArtyA7.

The timing is different because the clock speeds are different so the timer interrupt will fire at different times relative to events like page table walks and cache misses.

Last night I was able to trigger the ILA on the first event of the UART slow down with triggering on the sret instruction (end of a trap), InstrValidM == 1 and intInProgress == 0x200. This is a condition which should never occur because intInProgress means the PLIC is still handling an external interrupt and has yet to acknowledge handling the interrupt. Unfortunately I wasn't able to figure out why exactly the trap handler did not clear the interrupt. What I do know is it just did not execute any of the code which would have ACKed the interrupt. This means somewhere in the IRQ code Linux thinks there is no longer an external interrupt. When we figure that out we'll actually be closer to the real bug.

@splinedrive
Copy link
Author

What happens if you configure the VCU108 with 256 MiB as well? Would the thread addresses then be closer to each other, as you described, potentially triggering the issue?

@jordancarlin jordancarlin added the bug Something isn't working label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants