Skip to content

Kernel bug with PTP locking mutex in an interrupt #43

@omion

Description

@omion
# uname -a
Linux minty 6.14.0-27-generic #27~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 22 17:38:49 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
# cat /sys/kernel/debug/sched/preempt
none (voluntary) full lazy
# modinfo ice
filename:       /lib/modules/6.14.0-27-generic/updates/drivers/net/ethernet/intel/ice/ice.ko
firmware:       updates/intel/ice/ddp/ice.pkg
version:        2.3.10
license:        GPL v2
description:    Intel(R) Ethernet Connection E800 Series Linux Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     D930395D8B0791D679E4981
alias:          pci:v00008086d00001888sv*sd*bc*sc*i*
alias:          pci:v00008086d000012DEsv*sd*bc*sc*i*
alias:          pci:v00008086d000012DAsv*sd*bc*sc*i*
alias:          pci:v00008086d000012DDsv*sd*bc*sc*i*
alias:          pci:v00008086d000012D8sv*sd*bc*sc*i*
alias:          pci:v00008086d000012DCsv*sd*bc*sc*i*
alias:          pci:v00008086d000012D5sv*sd*bc*sc*i*
alias:          pci:v00008086d000012D3sv*sd*bc*sc*i*
alias:          pci:v00008086d000012D2sv*sd*bc*sc*i*
alias:          pci:v00008086d000012D1sv*sd*bc*sc*i*
alias:          pci:v00008086d0000579Fsv*sd*bc*sc*i*
alias:          pci:v00008086d0000579Esv*sd*bc*sc*i*
alias:          pci:v00008086d0000579Dsv*sd*bc*sc*i*
alias:          pci:v00008086d0000579Csv*sd*bc*sc*i*
alias:          pci:v00008086d0000151Dsv*sd*bc*sc*i*
alias:          pci:v00008086d0000124Fsv*sd*bc*sc*i*
alias:          pci:v00008086d0000124Esv*sd*bc*sc*i*
alias:          pci:v00008086d0000124Dsv*sd*bc*sc*i*
alias:          pci:v00008086d0000124Csv*sd*bc*sc*i*
alias:          pci:v00008086d0000189Asv*sd*bc*sc*i*
alias:          pci:v00008086d00001899sv*sd*bc*sc*i*
alias:          pci:v00008086d00001898sv*sd*bc*sc*i*
alias:          pci:v00008086d00001897sv*sd*bc*sc*i*
alias:          pci:v00008086d00001894sv*sd*bc*sc*i*
alias:          pci:v00008086d00001893sv*sd*bc*sc*i*
alias:          pci:v00008086d00001892sv*sd*bc*sc*i*
alias:          pci:v00008086d00001891sv*sd*bc*sc*i*
alias:          pci:v00008086d00001890sv*sd*bc*sc*i*
alias:          pci:v00008086d0000188Esv*sd*bc*sc*i*
alias:          pci:v00008086d0000188Dsv*sd*bc*sc*i*
alias:          pci:v00008086d0000188Csv*sd*bc*sc*i*
alias:          pci:v00008086d0000188Bsv*sd*bc*sc*i*
alias:          pci:v00008086d0000188Asv*sd*bc*sc*i*
alias:          pci:v00008086d0000159Bsv*sd*bc*sc*i*
alias:          pci:v00008086d0000159Asv*sd*bc*sc*i*
alias:          pci:v00008086d00001599sv*sd*bc*sc*i*
alias:          pci:v00008086d00001593sv*sd*bc*sc*i*
alias:          pci:v00008086d00001592sv*sd*bc*sc*i*
alias:          pci:v00008086d00001591sv*sd*bc*sc*i*
depends:        gnss
name:           ice
retpoline:      Y
vermagic:       6.14.0-27-generic SMP preempt mod_unload modversions
parm:           debug:netif level (0=none,...,16=all) (int)

dmesg info:

[Sep19 14:35] BUG: scheduling while atomic: ptp0/396/0x00010001
[  +0.000009] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore cmac nls_utf8 cif>
[  +0.000080]  dm_log hid_generic usbhid cdc_ether usbnet hid mii sdhci_pci nvme polyval_clmulni ice(OE) polyval_generic ghash_clmulni_intel sdhci_uhs2 sha256_ssse3 nv>
[  +0.000020] CPU: 6 UID: 0 PID: 396 Comm: ptp0 Tainted: G           OE      6.14.0-27-generic #27~24.04.1-Ubuntu
[  +0.000003] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  +0.000001] Hardware name: To Be Filled By O.E.M. D1749NTD4U-4Q/D1749NTD4U-4Q, BIOS P1.00 08/17/2023
[  +0.000001] Call Trace:
[  +0.000001]  <IRQ>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000007]  dump_stack+0x10/0x20
[  +0.000002]  __schedule_bug+0x64/0x80
[  +0.000005]  schedule_debug.isra.0+0xd1/0x120
[  +0.000002]  __schedule+0x70/0x640
[  +0.000005]  schedule+0x29/0xd0
[  +0.000002]  schedule_preempt_disabled+0x15/0x30
[  +0.000001]  __mutex_lock.constprop.0+0x722/0x7b0
[  +0.000002]  ? sched_clock+0x10/0x30
[  +0.000004]  __mutex_lock_slowpath+0x13/0x20
[  +0.000002]  mutex_lock+0x3b/0x50
[  +0.000002]  ice_ptp_process_ts+0x6b/0x1b0 [ice]
[  +0.000072]  ice_ptp_ts_irq+0x7e/0x170 [ice]
[  +0.000039]  ? profile_tick+0x34/0x50
[  +0.000004]  ? timerqueue_add+0xa6/0xd0
[  +0.000001]  ice_misc_intr+0x221/0x360 [ice]
[  +0.000035]  __handle_irq_event_percpu+0x4c/0x1b0
[  +0.000003]  ? sched_clock_noinstr+0x9/0x10
[  +0.000002]  handle_irq_event+0x39/0x80
[  +0.000001]  handle_edge_irq+0x8c/0x250
[  +0.000002]  __common_interrupt+0x4e/0x110
[  +0.000003]  common_interrupt+0xb1/0xe0
[  +0.000002]  </IRQ>
[  +0.000000]  <TASK>
[  +0.000001]  asm_common_interrupt+0x27/0x40
[  +0.000003] RIP: 0010:ice_ptp_update_cached_phctime_all+0x56/0x1e0 [ice]
[  +0.000038] Code: 70 f8 ff ff 49 89 c4 49 8b 45 08 48 8b 58 10 48 8d 50 10 48 39 d3 0f 84 08 01 00 00 41 be d0 07 00 00 48 83 bb b8 00 00 00 00 <48> 8b 8b a8 f4 ff f>
[  +0.000001] RSP: 0018:ff6e277d8254bdb0 EFLAGS: 00000202
[  +0.000002] RAX: ff1f73dfc289c000 RBX: ff1f73dfe0128cf8 RCX: ff1f73dfc455d000
[  +0.000001] RDX: ff1f73dfc289c010 RSI: 00000000000000c0 RDI: 0000000000000000
[  +0.000000] RBP: ff6e277d8254bdf8 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: 1866cd9d71f32b80
[  +0.000001] R13: ff1f73dfe01281a0 R14: 00000000000007d0 R15: 0000000000000000
[  +0.000002]  ? finish_task_switch.isra.0+0x9c/0x310
[  +0.000003]  ice_ptp_periodic_work+0x4a/0x310 [ice]
[  +0.000041]  ptp_aux_kworker+0x1d/0x50
[  +0.000004]  kthread_worker_fn+0x97/0x220
[  +0.000004]  ? __pfx_ptp_aux_kworker+0x10/0x10
[  +0.000002]  ? __pfx_kthread_worker_fn+0x10/0x10
[  +0.000001]  kthread+0xfb/0x230
[  +0.000002]  ? __pfx_kthread+0x10/0x10
[  +0.000002]  ret_from_fork+0x44/0x70
[  +0.000002]  ? __pfx_kthread+0x10/0x10
[  +0.000001]  ret_from_fork_asm+0x1a/0x30
[  +0.000004]  </TASK>

I know very little about how the kernel is designed, but my initial guess is that ice_ptp_process_ts may call ice_ptp_tx_tstamp_owner, which calls mutex_lock. In this case the mutex was already locked and __mutex_lock_slowpath tried to get the thread to sleep. However, this is all called inside an interrupt, and sleeping is not allowed, causing the kernel to report a bug.

Edit:
This is running on a Xeon D-1749NT CPU with built-in networking. The motherboard has 4x SFP28 ports, with the one in use currently running at 10Gbps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions