UART slows down #1170

splinedrive · 2024-12-02T16:27:12Z

After a large amount of output, the UART slows down, and commands like Ctrl-C or other controls stop working. Disconnecting the serial tool from the host triggers a reboot.
You can trigger it by running xxd /dev/urandom and waiting.
``
Screencast from 2024-12-02 17-17-13.webm

splinedrive · 2024-12-02T16:36:57Z

Screencast.from.2024-12-02.17-17-13.webm

davidharrishmc · 2024-12-02T17:05:57Z

Thank you for the report! Do you have the technical capability to run this simulation in a Questa GUI (or possibly VCS or Verilator) and look for a root cause? The UART is in src/uncore/uartPC16660D. It responds normally for us during Linux boot in hardware and simulation. Perhaps you are using it in a different way that exercises a bug in the UART itself, or a bug related to a UART interrupt handler?

splinedrive · 2024-12-02T17:20:20Z

Hello Mr. Harris,
I am a big fan of yours.
I just managed to get it running on my Artix7 FPGA, and this is what I noticed—it happens within 10 minutes.
There are many root causes: possibly the kernel driver (very unlikely), the UART (makes sense), but what I don’t understand is why the system reboots when I terminate tio.

I don’t have professional tools; I am a hobbyist who learned the basics from your two edX courses and even developed a Linux/XV6 SoC with the knowledge from those courses. Thank you very much.

https://github.com/splinedrive/kianRiscV

rosethompson · 2024-12-02T17:59:06Z

This sounds like a tricky one to debug. I will try to reproduce on my end with the VCU108 board so we can have more debug signals. I'm very concerned about the reset.

rosethompson · 2024-12-02T18:52:52Z

Interesting this isn't reproducing on the vcu108. I tried playing around with various baud rates. I'm trying the Arty A7 now. If I had to guess you probably found a bug in a UART fifo and it's reporting that the transmit fifo is always full so rather than writing multiple bytes per interrupt it's writing 1 byte hence the slow down.

splinedrive · 2024-12-02T20:09:26Z

After 8 minutes, it slows down. I also tested it with an external power supply, but it didn't fix the issue.

rosethompson · 2024-12-02T20:32:25Z

I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now.

davidharrishmc · 2024-12-02T20:33:54Z

That’s wacky that it depends on the FPGA.

…

On Dec 2, 2024, at 12:32 PM, Rose Thompson ***@***.***> wrote: I am able to reproduce on the Arty A7 but the VCU108 is not triggering the bug unfortunately. I'm working on an ILA script to debug right now. — Reply to this email directly, view it on GitHub <#1170 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR4AA37CAWL37ZOAT5HBTVD2DS7XBAVCNFSM6AAAAABS3ZWVECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJSG42TGOJQHA>. You are receiving this because you commented.

rosethompson · 2024-12-02T21:01:22Z

It's not too unreasonable to think it could depend on the FPGA because the two FPGAs have different hardware configuration. The Arty A7 has a 20Mhz clock and 256MiB DDR3 memory and the VCU108 has 50 MHz clock and 2GiB DDR4 memory. This inherently makes the timing of interrupts different so it's possible we just can't hit the bug on the VCU108.

It's also possible splinedrive's suggestion is correct and it's related to memory since the VCU108 has more.

splinedrive · 2024-12-02T21:04:34Z

@rosethompson

I deleted my comment about the memory, but that still seems the most plausible. Still, wait a while and try running find / for longer than 10 minutes. However, the xxd /dev/urandom command always reproduces the issue.

rosethompson · 2024-12-02T21:37:46Z

Interesting the uart's INTR bit is always high. This is causing the OS to take a trap back into the trap handler immediately on exiting starving all other processes.

rosethompson · 2024-12-02T21:40:02Z

Even more interesting the CPU is waiting on a wfi instruction while INTR is high.

splinedrive · 2024-12-02T21:41:49Z

But why does the system reboot when tio is closed?

rosethompson · 2024-12-02T21:44:04Z

That part I haven't reproduced. I've been using screen and it's not rebooting the CPU. What is the tio command you are using?

splinedrive · 2024-12-02T21:45:04Z

tio -m INLCRNL -o 1 /dev/serial/by-id/usb-Digilent_Digilent_USB_Device_210319AFED71-if01-port0 -b 115200

rosethompson · 2024-12-03T04:57:15Z

I think the problem is with either

How the driver is claiming the external interrupt. The plic's intIntProgress bit 10 (UART interrupt) never goes low.
Or the hardware has a bug which has caused the above condition to occur and the hardware/software has no way to lower intIntProgress.

rosethompson · 2024-12-03T20:17:01Z

Interesting. I've narrowed the failure down to this section of kernel code.

ffffffff801d3844 <plic_irq_eoi>:
ffffffff801d3844:	1141                	addi	sp,sp,-16
ffffffff801d3846:	e422                	sd	s0,8(sp)
ffffffff801d3848:	0800                	addi	s0,sp,16
ffffffff801d384a:	0140000f          	fence	w,o
ffffffff801d384e:	04cbd797          	auipc	a5,0x4cbd
ffffffff801d3852:	5ca7b783          	ld	a5,1482(a5) # ffffffff84e90e18 <plic_handlers+0x8>
ffffffff801d3856:	6518                	ld	a4,8(a0)
ffffffff801d3858:	0791                	addi	a5,a5,4
ffffffff801d385a:	c398                	sw	a4,0(a5)
ffffffff801d385c:	6422                	ld	s0,8(sp)
ffffffff801d385e:	0141                	addi	sp,sp,16
ffffffff801d3860:	8082                	ret

The sw normally clears the intInProgress bits but for some reason this is not happening. I'm trying to isolate if this because it's not being called at all or if if the stack pointer is corrupted. There are at least two threads using this function which is complicating debugging this. We can't just trigger on this instructions address.

rosethompson · 2024-12-03T21:29:57Z

A couple interesting things to note. The two threads accessing the above function only experience the failure if the ld at ffffffff801d3856 effective address is specific number of bytes apart. Sometimes the bug never bug never occurs. For example. The following runs of the same buildroot just different reboot...

Thread 1 reads 0xffffaf800682a220
Thread 2 reads 0xffffaf8006712e20
which are 0x117400 bytes apart and this never crashes (at least after about a hour).
Thread 1 reads 0xffffaf800681a220
Thread 2 reads 0xffffaf80066fae20
which are 4: 0x11F400 bytes apart and this does crash in less than 10 minutes.

After the slow down the second interesting thing emerges, only one thread executes the plic_ireq_eoi function.
This explains why the vcu108 is never experiencing this bug.

rosethompson · 2024-12-03T21:35:37Z

I have a hypothesis. I bet the hptw messes up during an interrupt (or similar) and the address translation for the claim data which should we written to the plic gets corrupted.

splinedrive · 2024-12-04T15:22:18Z

The memory for the ArtyA7 and VCU108 has the same size configured in the DTS, specifically 256 MiB. But are you suggesting that, because of the difference in actual memory size (2 GiB vs. 256 MiB), they might have different timings? Could that be causing issues with the hardware walker and interrupt logic?

rosethompson · 2024-12-04T17:35:57Z

You might have an older device tree. The current device trees configures 256MiB for the Arty and 2GiB for the VCU108. The fpga/generator/Makefile modifies the zero stage bootloader to accommodate the different memory size based on the derived configs in config/deriv/fpgavcu108 and config/deriv/fpgaArtyA7.

The timing is different because the clock speeds are different so the timer interrupt will fire at different times relative to events like page table walks and cache misses.

Last night I was able to trigger the ILA on the first event of the UART slow down with triggering on the sret instruction (end of a trap), InstrValidM == 1 and intInProgress == 0x200. This is a condition which should never occur because intInProgress means the PLIC is still handling an external interrupt and has yet to acknowledge handling the interrupt. Unfortunately I wasn't able to figure out why exactly the trap handler did not clear the interrupt. What I do know is it just did not execute any of the code which would have ACKed the interrupt. This means somewhere in the IRQ code Linux thinks there is no longer an external interrupt. When we figure that out we'll actually be closer to the real bug.

splinedrive · 2024-12-04T18:31:35Z

What happens if you configure the VCU108 with 256 MiB as well? Would the thread addresses then be closer to each other, as you described, potentially triggering the issue?

jordancarlin added the bug Something isn't working label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UART slows down #1170

UART slows down #1170

splinedrive commented Dec 2, 2024 •

edited

Loading

splinedrive commented Dec 2, 2024

davidharrishmc commented Dec 2, 2024

splinedrive commented Dec 2, 2024 •

edited

Loading

rosethompson commented Dec 2, 2024

rosethompson commented Dec 2, 2024

splinedrive commented Dec 2, 2024

rosethompson commented Dec 2, 2024

davidharrishmc commented Dec 2, 2024 via email

rosethompson commented Dec 2, 2024

splinedrive commented Dec 2, 2024

rosethompson commented Dec 2, 2024

rosethompson commented Dec 2, 2024

splinedrive commented Dec 2, 2024

rosethompson commented Dec 2, 2024

splinedrive commented Dec 2, 2024 •

edited

Loading

rosethompson commented Dec 3, 2024

rosethompson commented Dec 3, 2024

rosethompson commented Dec 3, 2024

rosethompson commented Dec 3, 2024

splinedrive commented Dec 4, 2024

rosethompson commented Dec 4, 2024

splinedrive commented Dec 4, 2024

UART slows down #1170

UART slows down #1170

Comments

splinedrive commented Dec 2, 2024 • edited Loading

splinedrive commented Dec 2, 2024

davidharrishmc commented Dec 2, 2024

splinedrive commented Dec 2, 2024 • edited Loading

rosethompson commented Dec 2, 2024

rosethompson commented Dec 2, 2024

splinedrive commented Dec 2, 2024

rosethompson commented Dec 2, 2024

davidharrishmc commented Dec 2, 2024 via email

rosethompson commented Dec 2, 2024

splinedrive commented Dec 2, 2024

rosethompson commented Dec 2, 2024

rosethompson commented Dec 2, 2024

splinedrive commented Dec 2, 2024

rosethompson commented Dec 2, 2024

splinedrive commented Dec 2, 2024 • edited Loading

rosethompson commented Dec 3, 2024

rosethompson commented Dec 3, 2024

rosethompson commented Dec 3, 2024

rosethompson commented Dec 3, 2024

splinedrive commented Dec 4, 2024

rosethompson commented Dec 4, 2024

splinedrive commented Dec 4, 2024

splinedrive commented Dec 2, 2024 •

edited

Loading

splinedrive commented Dec 2, 2024 •

edited

Loading

splinedrive commented Dec 2, 2024 •

edited

Loading