Skip to content

"rcu_preempt kthread starved for 10002 jiffies!" #1008

@iximeow

Description

@iximeow

@leftwo saw this on a slightly-but-not-very special build of Propolis on a Cosmo with slightly-but-not-very special host OS bits. on the Propolis side, it's commit 72a2f97 which is #1003. the host OS bits don't have any changes that would relate to VMs, but might relate to the VM's disks (nvme1..nvme10), excepting its boot disk. the VM was created like ./propolis-cli-1003 --server 0.0.0.0 new -c 192 -m 1048576 --config-toml ubuntu-instance-10-disks-wth.toml aln. I've collected the propolis-server, configuration, kstats, and more recently some guest measurements (/proc/stat and dmesg), stashed them all at /staff/core/propolis-1008. nothing has crashed, so there's no core file :)

Here's the warning from the guest:

Ubuntu 24.04.3 LTS cosmomb2vm ttyS0

[  368.798927] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  368.800199] rcu:     31-...0: (4 ticks this GP) idle=430c/1/0x4000000000000000 softirq=616/617 fqs=11246
[  368.801970] rcu:              hardirqs   softirqs   csw/system
[  368.803049] rcu:      number:        0          0            0
[  368.804127] rcu:     cputime:        0          0            0   ==> 30004(ms)
[  368.805472] rcu:     (detected by 28, t=60009 jiffies, g=6089, q=961 ncpus=192)
[  378.808588] nmi_backtrace_stall_check: CPU 31: NMIs are not reaching exc_nmi() handler, last activity: 4295042560 jiffies ago.
[  378.810855] rcu: rcu_preempt kthread starved for 10002 jiffies! g6089 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=17
[  378.812863] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[  378.814617] rcu: RCU grace-period kthread stack dump:
[  378.815647] rcu: Stack dump where RCU GP kthread last ran:
[  392.997356] Uhhuh. NMI received for unknown reason 20 on CPU 31.
[  392.997360] Dazed and confused, but trying to continue
  392.997356] Uhhuh. NMI received for unknown reason 20 on CPU 31.
[  392.997360] Dazed and confused, but trying to continue

things that stand out to me here:

  • this happened after the VM was up for 368 seconds.
  • 10.003 seconds after rcu_preempt detected stalls on CPUs/tasks, we get NMIs are not reaching exc_nmi() handler.
    • 4295042560 jiffies ago is 4.29 billion seconds. or might be better interpreted as 2**32 - 4295042560 == -75264 jiffies? that makes a bit more sense, given CLOCK_HZ=1000, but then the last activity was... 75 seconds in the future?
  • 14.18 seconds later CPU 31 seems to finally wake up and see NMIs. I'm not sure where these came from (CPU 28 when it noticed the stall?). at this point CPU 31 seems to have been stuck, from the guest's perspective, for at least 24 seconds.

just a bit later, some more console output Alan saw:

cosmomb2vm login: [ 1557.983791] I/O error, dev nvme3n1, sector 1896448 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983821] I/O error, dev nvme3n1, sector 280576 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
[ 1557.983824] I/O error, dev nvme3n1, sector 1472512 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983834] I/O error, dev nvme3n1, sector 1474560 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[ 1557.983862] I/O error, dev nvme3n1, sector 1476608 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983864] I/O error, dev nvme3n1, sector 1478656 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
[ 1557.983865] I/O error, dev nvme3n1, sector 1480704 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983866] I/O error, dev nvme3n1, sector 1482752 op 0x1:(WRITE) flags 0x8800 phys_seg 2 prio class 0
[ 1557.983883] I/O error, dev nvme3n1, sector 1484800 op 0x1:(WRITE) flags 0xc800 phys_seg 2 prio class 0
[ 1557.983884] I/O error, dev nvme3n1, sector 1486848 op 0x1:(WRITE) flags 0xc800 phys_seg 1 prio class 0
[ 1558.045535] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[ 1558.047915] Dev nvme1n1: unable to read RDB block 0

not .. great. this is ~1189 seconds after the above warnings about stalls, and at this point it seems Linux has given up on all the zvol-backed disks. nvme list only shows nvme0, and dtrace -n 'nvme_admin_cmd { printf("opcode: %08x, prp1: %08x, prp2: %08x\n", arg0, arg1, arg2); }' -p 798 concurrent with nvme list only prints one line(!) like opcode: 00000006, prp1: 46b826000, prp2: 00000000. likewise, nvme_doorbell_admin_cq only gets rung once, so it seems Linux has decided it's no longer worth trying to drive the admin queues on the other disks.

so this is where things get really odd. I'd initially thought #1003 was on a recent-enough base to include the recent CPU binding change that is now in master. it is not.

2J80WF9K # pgrep propolis-server
798
2J80WF9K # pbind -q 798/236-        # pstack tells me that LWP 236 is "vcpu-31", which would be bound to CPU 95 now
2J80WF9K #                          # .. nevertheless, it is not

so, vCPU threads are actually not bound, and should freely migrate to wherever they can be run. stats and kstats are from easily a few hours after all this happened, but the VM and wider system should have been idle. I captured kstat vmm which I'm looking at from here as kstats_after_rcu_preempt_detected_stalls. the worst time_sched is pretty bad, but not tens of seconds: vcpu7 was time_sched for 1.407 seconds. there are a smattering of other vCPUs that were time_sched for 413-440ms, and one at 795ms. time_emu_kern is more interesting, and ranges from 11 seconds(!) to 80 seconds(!!) and in fact is well above time_run across the board. time_run is around 26s per CPU, six of those were me running a loop in python on each CPU as a spot check that vCPUs were... working. all CPUs seemed responsive an hour or three after the above events when I was looking.

this is about the point I looked at the guest's dmesg. the buffer has wrapped, so history starts at 284 seconds into boot. almost all of the messages are about Linux aborting I/Os because of timeouts. there's this later on:

[ 1558.045651] fio: attempt to access beyond end of device

so fio was running and presumably the source of all the failed I/Os. given the vCPUs are not bound, do not have time in time_sched, do have significant time in kernel emulation, I'm not really sure what.. has transpired here.

for good measure, I grabbed pstack 798 and all 80 file backend threads are there and sitting in block_for_req as one would expect for idle disks. since Linux has reset the NVMe controllers a few times, I don't think there's any surviving interesting state in Propolis.

having recorded all this, my remaining theory is that maybe we've started the VM while still initializing the zvols, so the I/Os were unexpectedly slow since wei've bypassed waiting for sled-agent to ensure the volumes are ready? in which case restarting the VM should result in everything being OK, and destroying and recreating the volumes might reproduce the issue. I'd like to follow up with Alan to see if this makes any sense before tearing anything down though.

Metadata

Metadata

Assignees

Labels

bugSomething that isn't working.guest-osRelated to compatibility and/or functionality observed by guest software.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions