"rcu_preempt kthread starved for 10002 jiffies!"

@leftwo saw this on a slightly-but-not-very special build of Propolis on a Cosmo with slightly-but-not-very special host OS bits. on the Propolis side, it's commit 72a2f971ac812a79bb90fb31566dfe1b7be7b39c which is https://github.com/oxidecomputer/propolis/pull/1003. the host OS bits don't have any changes that would relate to VMs, but might relate to the VM's disks (`nvme1`..`nvme10`), excepting its boot disk. the VM was created like `./propolis-cli-1003 --server 0.0.0.0 new -c 192 -m 1048576 --config-toml ubuntu-instance-10-disks-wth.toml aln`. I've collected the `propolis-server`, configuration, kstats, and more recently some guest measurements (`/proc/stat` and `dmesg`), stashed them all at `/staff/core/propolis-1008`. nothing has crashed, so there's no core file :)

Here's the warning from the guest:
```
Ubuntu 24.04.3 LTS cosmomb2vm ttyS0

[  368.798927] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  368.800199] rcu:     31-...0: (4 ticks this GP) idle=430c/1/0x4000000000000000 softirq=616/617 fqs=11246
[  368.801970] rcu:              hardirqs   softirqs   csw/system
[  368.803049] rcu:      number:        0          0            0
[  368.804127] rcu:     cputime:        0          0            0   ==> 30004(ms)
[  368.805472] rcu:     (detected by 28, t=60009 jiffies, g=6089, q=961 ncpus=192)
[  378.808588] nmi_backtrace_stall_check: CPU 31: NMIs are not reaching exc_nmi() handler, last activity: 4295042560 jiffies ago.
[  378.810855] rcu: rcu_preempt kthread starved for 10002 jiffies! g6089 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=17
[  378.812863] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[  378.814617] rcu: RCU grace-period kthread stack dump:
[  378.815647] rcu: Stack dump where RCU GP kthread last ran:
[  392.997356] Uhhuh. NMI received for unknown reason 20 on CPU 31.
[  392.997360] Dazed and confused, but trying to continue
  392.997356] Uhhuh. NMI received for unknown reason 20 on CPU 31.
[  392.997360] Dazed and confused, but trying to continue
```

things that stand out to me here:
* this happened after the VM was up for 368 seconds.
* 10.003 seconds after `rcu_preempt detected stalls on CPUs/tasks`, we get `NMIs are not reaching exc_nmi() handler`.
  * `4295042560 jiffies ago` is 4.29 billion seconds. or might be better interpreted as `2**32 - 4295042560 == -75264 jiffies`? that makes a bit more sense, given `CLOCK_HZ=1000`, but then the last activity was... 75 seconds in the future?
* 14.18 seconds later CPU 31 seems to finally wake up and see NMIs. I'm not sure where these came from (CPU 28 when it noticed the stall?). at this point CPU 31 seems to have been stuck, from the guest's perspective, for at least 24 seconds.

just a bit later, some more console output Alan saw:

```
cosmomb2vm login: [ 1557.983791] I/O error, dev nvme3n1, sector 1896448 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983821] I/O error, dev nvme3n1, sector 280576 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
[ 1557.983824] I/O error, dev nvme3n1, sector 1472512 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983834] I/O error, dev nvme3n1, sector 1474560 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[ 1557.983862] I/O error, dev nvme3n1, sector 1476608 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983864] I/O error, dev nvme3n1, sector 1478656 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
[ 1557.983865] I/O error, dev nvme3n1, sector 1480704 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983866] I/O error, dev nvme3n1, sector 1482752 op 0x1:(WRITE) flags 0x8800 phys_seg 2 prio class 0
[ 1557.983883] I/O error, dev nvme3n1, sector 1484800 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983884] I/O error, dev nvme3n1, sector 1486848 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
[ 1558.045535] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[ 1558.047915] Dev nvme1n1: unable to read RDB block 0
```

not .. great. this is ~1189 seconds after the above warnings about stalls, and at this point it seems Linux has given up on all the zvol-backed disks. `nvme list` only shows `nvme0`, and `dtrace -n 'nvme_admin_cmd { printf("opcode: %08x, prp1: %08x, prp2: %08x\n", arg0, arg1, arg2); }' -p 798` concurrent with `nvme list` only prints one line(!) like `opcode: 00000006, prp1: 46b826000, prp2: 00000000`. likewise, `nvme_doorbell_admin_cq` only gets rung once, so it seems Linux has decided it's no longer worth trying to drive the admin queues on the other disks.

so this is where things get really odd. I'd initially thought #1003 was on a recent-enough base to include the recent CPU binding change that is now in `master`. it is not.
```
2J80WF9K # pgrep propolis-server
798
2J80WF9K # pbind -q 798/236-        # pstack tells me that LWP 236 is "vcpu-31", which would be bound to CPU 95 now
2J80WF9K #                          # .. nevertheless, it is not
```

so, vCPU threads are actually not bound, and should freely migrate to wherever they can be run. stats and kstats are from easily a few hours after all this happened, but the VM and wider system should have been idle. I captured `kstat vmm` which I'm looking at from here as `kstats_after_rcu_preempt_detected_stalls`. the worst `time_sched` is pretty bad, but not tens of seconds: `vcpu7` was `time_sched` for 1.407 seconds. there are a smattering of other vCPUs that were `time_sched` for 413-440ms, and one at 795ms. `time_emu_kern` is more interesting, and ranges from 11 seconds(!) to 80 seconds(!!) and in fact is well above `time_run` across the board. `time_run` is around 26s per CPU, six of those were me running a loop in python on each CPU as a spot check that vCPUs were... working. all CPUs seemed responsive an hour or three after the above events when I was looking.

this is about the point I looked at the guest's `dmesg`. the buffer has wrapped, so history starts at 284 seconds into boot. almost all of the messages are about Linux aborting I/Os because of timeouts. there's this later on:

> `[ 1558.045651] fio: attempt to access beyond end of device`

so `fio` was running and presumably the source of all the failed I/Os. given the vCPUs are not bound, do not have time in `time_sched`, _do_ have significant time in kernel emulation, I'm not really sure what.. has transpired here. 

for good measure, I grabbed `pstack 798` and all 80 file backend threads are there and sitting in `block_for_req` as one would expect for idle disks. since Linux has reset the NVMe controllers a few times, I don't think there's any surviving interesting state in Propolis.

having recorded all this, my remaining theory is that maybe we've started the VM while still initializing the zvols, so the I/Os were unexpectedly slow since wei've bypassed waiting for sled-agent to ensure the volumes are ready? in which case restarting the VM should result in everything being OK, and destroying and recreating the volumes might reproduce the issue. I'd like to follow up with Alan to see if this makes any sense before tearing anything down though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"rcu_preempt kthread starved for 10002 jiffies!" #1008

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"rcu_preempt kthread starved for 10002 jiffies!" #1008

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions