Skip to content

i40e: Fix kernel crash from PTP work queue corruption during resets#33

Open
stephenchengCloud wants to merge 1 commit into
intel:mainfrom
stephenchengCloud:fix-kernel-crash-from-PTP-work-queue
Open

i40e: Fix kernel crash from PTP work queue corruption during resets#33
stephenchengCloud wants to merge 1 commit into
intel:mainfrom
stephenchengCloud:fix-kernel-crash-from-PTP-work-queue

Conversation

@stephenchengCloud

Copy link
Copy Markdown

The driver crashes with NULL pointer dereferences in process_one_work() when PTP is enabled and the device undergoes reset operations:

[ 189646.890619] ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 189646.890632] INFO: PGD 0 P4D 0
[ 189646.890638] WARN: Oops: 0000 [#1] SMP NOPTI
[ 189646.890645] WARN: CPU: 14 PID: 2948135 Comm: kworker/14:3 Tainted: G O 4.19.0+1 #1
[ 189646.890654] WARN: Hardware name: FUJITSU PRIMERGY RX2530 M4/D3383-A1, BIOS V5.0.0.12 R1.64.0 for D3383-A1x 03/06/2025
[ 189646.890676] WARN: Workqueue: (null) (events)
[ 189646.890688] WARN: RIP: e030:process_one_work+0x2e/0x370
[ 189646.890694] WARN: Code: 00 41 57 41 56 41 55 45 31 ed 41 54 55 48 89 fd 53 48 89 f3 48 83 ec 08 48 8b 06 4c 8b 67 40 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41 c1 ee 05 41 83 e6 01 41 f6 44
[ 189646.890711] WARN: RSP: e02b:ffffc9004ec9fe80 EFLAGS: 00010046
[ 189646.890717] WARN: RAX: 0000000fffffffe0 RBX: ffff8882b3f535d8 RCX: ffff8882bc7a4f60
[ 189646.890725] WARN: RDX: 0000000fffffff00 RSI: ffff8882b3f535d8 RDI: ffff88820469b680
[ 189646.890733] WARN: RBP: ffff88820469b680 R08: ffff8882b55a0a00 R09: 8080808080808080
[ 189646.890741] WARN: R10: 0000000000007ff0 R11: 0000000000000000 R12: ffff8882bc7a10c0
[ 189646.890749] WARN: R13: 0000000000000000 R14: ffff88820469b680 R15: ffff88820469b6a8
[ 189646.890766] WARN: FS: 0000000000000000(0000) GS:ffff8882bc780000(0000) knlGS:0000000000000000
[ 189646.890774] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 189646.890781] WARN: CR2: 00000000000000b0 CR3: 00000002b67a2000 CR4: 0000000000040660
[ 189646.890794] WARN: Call Trace:
[ 189646.890801] WARN: worker_thread+0x49/0x3e0
[ 189646.890809] WARN: kthread+0xf8/0x130
[ 189646.890815] WARN: ? rescuer_thread+0x310/0x310
[ 189646.890821] WARN: ? kthread_bind+0x10/0x10
[ 189646.890829] WARN: ret_from_fork+0x35/0x40

Root Cause - INIT_WORK() Called Multiple Times:

The work structure ptp_extts0_work is initialized via INIT_WORK() in i40e_ptp_set_timestamp_mode(), which is called from i40e_ptp_init() during:

  1. Initial probe (i40e_probe -> i40e_setup_pf_switch -> i40e_ptp_init)
  2. Every rebuild after reset (i40e_rebuild -> i40e_setup_pf_switch -> i40e_ptp_init)

When a PF reset occurs, the following race corrupts the work_struct:

  1. i40e_rebuild() re-enables PTP interrupts via i40e_setup_misc_vector()

  2. PTP timestamp interrupt fires immediately

    • ISR schedules work: schedule_work(&pf->ptp_extts0_work)
    • Work is queued with valid list head pointers
  3. i40e_rebuild() continues -> i40e_ptp_init() is called - INIT_WORK(&pf->ptp_extts0_work, ...) overwrites list pointers - work_struct.entry.next = 0x0 (from INIT_WORK macro) - BUT work is still queued in workqueue!

  4. worker_thread processes queued work

    • Reads corrupted next pointer (0x0)
    • NULL pointer dereference in process_one_work()

The hardware does not require link or traffic for this crash. A PF reset alone is sufficient, as it re-enables interrupts before software reconfiguration completes, creating a window where interrupts can fire while INIT_WORK() is about to execute.

Primary Fix - Initialize Work Once:

Move INIT_WORK(&pf->ptp_extts0_work, i40e_ptp_extts0_work) from i40e_ptp_set_timestamp_mode() to i40e_probe(). This ensures the work_struct is initialized exactly once during driver lifetime, never while work is queued or running.

Additionally, add cancel_work_sync(&pf->ptp_extts0_work) at the beginning of i40e_ptp_init(). This safely drains any pending work before PTP hardware reconfiguration during rebuilds.

Secondary Fix - Avoid Race During Driver Removal:

Move i40e_ptp_stop() to after i40e_vsi_release() in i40e_remove(). Previously, i40e_ptp_stop() ran before the VSIs were closed, which left a window where a PTP interrupt could fire after the PTP clock was unregistered:

T0: ptp_clock_unregister() called
T1: pf->ptp_clock = NULL
T2: PTP interrupt fires (still enabled in hardware)
T3: ISR schedules ptp_extts0_work
T4: Worker runs -> ptp_clock_event(NULL, ...) -> potential crash

By moving i40e_ptp_stop() after the VSI release loop, i40e_vsi_close() -> i40e_down() disables the misc interrupt vector first, eliminating this race.

However, this reordering means that by the time i40e_ptp_stop() runs, all VSIs have already been released and pf->vsi[pf->lan_vsi] is NULL. The original "removed PHC from " log message dereferenced pf->vsi[pf->lan_vsi]->netdev->name, which would now cause its own NULL pointer dereference. Drop the interface name from the log message and print only "removed PHC", since the netdev is no longer accessible at this point in the cleanup sequence.

Additionally, in i40e_ptp_stop(), move the PTP interrupt disable and add cancel_work_sync() to the beginning of the function. This provides defense in depth: PTP-specific interrupts are explicitly disabled and any pending work is drained before unregistering the PTP clock, regardless of the surrounding cleanup ordering.

Testing:

Before the fix, one of our customer's kernel crashed repeatedly after running for approximately one day under normal production workload. After applying the fix, no crashes have been observed.

The driver crashes with NULL pointer dereferences in process_one_work()
when PTP is enabled and the device undergoes reset operations:

[ 189646.890619]  ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 189646.890632]   INFO: PGD 0 P4D 0
[ 189646.890638]   WARN: Oops: 0000 [intel#1] SMP NOPTI
[ 189646.890645]   WARN: CPU: 14 PID: 2948135 Comm: kworker/14:3 Tainted: G           O      4.19.0+1 intel#1
[ 189646.890654]   WARN: Hardware name: FUJITSU PRIMERGY RX2530 M4/D3383-A1, BIOS V5.0.0.12 R1.64.0 for D3383-A1x                    03/06/2025
[ 189646.890676]   WARN: Workqueue:            (null) (events)
[ 189646.890688]   WARN: RIP: e030:process_one_work+0x2e/0x370
[ 189646.890694]   WARN: Code: 00 41 57 41 56 41 55 45 31 ed 41 54 55 48 89 fd 53 48 89 f3 48 83 ec 08 48 8b 06 4c 8b 67 40 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41 c1 ee 05 41 83 e6 01 41 f6 44
[ 189646.890711]   WARN: RSP: e02b:ffffc9004ec9fe80 EFLAGS: 00010046
[ 189646.890717]   WARN: RAX: 0000000fffffffe0 RBX: ffff8882b3f535d8 RCX: ffff8882bc7a4f60
[ 189646.890725]   WARN: RDX: 0000000fffffff00 RSI: ffff8882b3f535d8 RDI: ffff88820469b680
[ 189646.890733]   WARN: RBP: ffff88820469b680 R08: ffff8882b55a0a00 R09: 8080808080808080
[ 189646.890741]   WARN: R10: 0000000000007ff0 R11: 0000000000000000 R12: ffff8882bc7a10c0
[ 189646.890749]   WARN: R13: 0000000000000000 R14: ffff88820469b680 R15: ffff88820469b6a8
[ 189646.890766]   WARN: FS:  0000000000000000(0000) GS:ffff8882bc780000(0000) knlGS:0000000000000000
[ 189646.890774]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 189646.890781]   WARN: CR2: 00000000000000b0 CR3: 00000002b67a2000 CR4: 0000000000040660
[ 189646.890794]   WARN: Call Trace:
[ 189646.890801]   WARN:  worker_thread+0x49/0x3e0
[ 189646.890809]   WARN:  kthread+0xf8/0x130
[ 189646.890815]   WARN:  ? rescuer_thread+0x310/0x310
[ 189646.890821]   WARN:  ? kthread_bind+0x10/0x10
[ 189646.890829]   WARN:  ret_from_fork+0x35/0x40

Root Cause - INIT_WORK() Called Multiple Times:
-----------------------------------------------
The work structure ptp_extts0_work is initialized via INIT_WORK() in
i40e_ptp_set_timestamp_mode(), which is called from i40e_ptp_init()
during:
  1. Initial probe (i40e_probe -> i40e_setup_pf_switch -> i40e_ptp_init)
  2. Every rebuild after reset (i40e_rebuild -> i40e_setup_pf_switch
     -> i40e_ptp_init)

When a PF reset occurs, the following race corrupts the work_struct:

  1. i40e_rebuild() re-enables PTP interrupts via
     i40e_setup_misc_vector()

  2. PTP timestamp interrupt fires immediately
     - ISR schedules work: schedule_work(&pf->ptp_extts0_work)
     - Work is queued with valid list head pointers

  3. i40e_rebuild() continues -> i40e_ptp_init() is called
     - INIT_WORK(&pf->ptp_extts0_work, ...) overwrites list pointers
     - work_struct.entry.next = 0x0 (from INIT_WORK macro)
     - BUT work is still queued in workqueue!

  4. worker_thread processes queued work
     - Reads corrupted next pointer (0x0)
     - NULL pointer dereference in process_one_work()

The hardware does not require link or traffic for this crash. A PF
reset alone is sufficient, as it re-enables interrupts before software
reconfiguration completes, creating a window where interrupts can fire
while INIT_WORK() is about to execute.

Primary Fix - Initialize Work Once:
------------------------------------
Move INIT_WORK(&pf->ptp_extts0_work, i40e_ptp_extts0_work) from
i40e_ptp_set_timestamp_mode() to i40e_probe(). This ensures the
work_struct is initialized exactly once during driver lifetime, never
while work is queued or running.

Additionally, add cancel_work_sync(&pf->ptp_extts0_work) at the
beginning of i40e_ptp_init(). This safely drains any pending work
before PTP hardware reconfiguration during rebuilds.

Secondary Fix - Avoid Race During Driver Removal:
-------------------------------------------------
Move i40e_ptp_stop() to after i40e_vsi_release() in i40e_remove().
Previously, i40e_ptp_stop() ran before the VSIs were closed, which
left a window where a PTP interrupt could fire after the PTP clock
was unregistered:

  T0: ptp_clock_unregister() called
  T1: pf->ptp_clock = NULL
  T2: PTP interrupt fires (still enabled in hardware)
  T3: ISR schedules ptp_extts0_work
  T4: Worker runs -> ptp_clock_event(NULL, ...) -> potential crash

By moving i40e_ptp_stop() after the VSI release loop, i40e_vsi_close()
-> i40e_down() disables the misc interrupt vector first, eliminating
this race.

However, this reordering means that by the time i40e_ptp_stop() runs,
all VSIs have already been released and pf->vsi[pf->lan_vsi] is NULL.
The original "removed PHC from <interface>" log message dereferenced
pf->vsi[pf->lan_vsi]->netdev->name, which would now cause its own
NULL pointer dereference. Drop the interface name from the log message
and print only "removed PHC", since the netdev is no longer accessible
at this point in the cleanup sequence.

Additionally, in i40e_ptp_stop(), move the PTP interrupt disable and
add cancel_work_sync() to the beginning of the function. This provides
defense in depth: PTP-specific interrupts are explicitly disabled and
any pending work is drained before unregistering the PTP clock,
regardless of the surrounding cleanup ordering.

Testing:
--------
Before the fix, one of our customer's kernel crashed repeatedly
after running for approximately one day under normal production workload.
After applying the fix, no crashes have been observed.

Signed-off-by: Stephen Cheng <stephen.cheng@citrix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant