i40e: Fix kernel crash from PTP work queue corruption during resets#33
Open
stephenchengCloud wants to merge 1 commit into
Open
i40e: Fix kernel crash from PTP work queue corruption during resets#33stephenchengCloud wants to merge 1 commit into
stephenchengCloud wants to merge 1 commit into
Conversation
The driver crashes with NULL pointer dereferences in process_one_work() when PTP is enabled and the device undergoes reset operations: [ 189646.890619] ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 189646.890632] INFO: PGD 0 P4D 0 [ 189646.890638] WARN: Oops: 0000 [intel#1] SMP NOPTI [ 189646.890645] WARN: CPU: 14 PID: 2948135 Comm: kworker/14:3 Tainted: G O 4.19.0+1 intel#1 [ 189646.890654] WARN: Hardware name: FUJITSU PRIMERGY RX2530 M4/D3383-A1, BIOS V5.0.0.12 R1.64.0 for D3383-A1x 03/06/2025 [ 189646.890676] WARN: Workqueue: (null) (events) [ 189646.890688] WARN: RIP: e030:process_one_work+0x2e/0x370 [ 189646.890694] WARN: Code: 00 41 57 41 56 41 55 45 31 ed 41 54 55 48 89 fd 53 48 89 f3 48 83 ec 08 48 8b 06 4c 8b 67 40 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41 c1 ee 05 41 83 e6 01 41 f6 44 [ 189646.890711] WARN: RSP: e02b:ffffc9004ec9fe80 EFLAGS: 00010046 [ 189646.890717] WARN: RAX: 0000000fffffffe0 RBX: ffff8882b3f535d8 RCX: ffff8882bc7a4f60 [ 189646.890725] WARN: RDX: 0000000fffffff00 RSI: ffff8882b3f535d8 RDI: ffff88820469b680 [ 189646.890733] WARN: RBP: ffff88820469b680 R08: ffff8882b55a0a00 R09: 8080808080808080 [ 189646.890741] WARN: R10: 0000000000007ff0 R11: 0000000000000000 R12: ffff8882bc7a10c0 [ 189646.890749] WARN: R13: 0000000000000000 R14: ffff88820469b680 R15: ffff88820469b6a8 [ 189646.890766] WARN: FS: 0000000000000000(0000) GS:ffff8882bc780000(0000) knlGS:0000000000000000 [ 189646.890774] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 189646.890781] WARN: CR2: 00000000000000b0 CR3: 00000002b67a2000 CR4: 0000000000040660 [ 189646.890794] WARN: Call Trace: [ 189646.890801] WARN: worker_thread+0x49/0x3e0 [ 189646.890809] WARN: kthread+0xf8/0x130 [ 189646.890815] WARN: ? rescuer_thread+0x310/0x310 [ 189646.890821] WARN: ? kthread_bind+0x10/0x10 [ 189646.890829] WARN: ret_from_fork+0x35/0x40 Root Cause - INIT_WORK() Called Multiple Times: ----------------------------------------------- The work structure ptp_extts0_work is initialized via INIT_WORK() in i40e_ptp_set_timestamp_mode(), which is called from i40e_ptp_init() during: 1. Initial probe (i40e_probe -> i40e_setup_pf_switch -> i40e_ptp_init) 2. Every rebuild after reset (i40e_rebuild -> i40e_setup_pf_switch -> i40e_ptp_init) When a PF reset occurs, the following race corrupts the work_struct: 1. i40e_rebuild() re-enables PTP interrupts via i40e_setup_misc_vector() 2. PTP timestamp interrupt fires immediately - ISR schedules work: schedule_work(&pf->ptp_extts0_work) - Work is queued with valid list head pointers 3. i40e_rebuild() continues -> i40e_ptp_init() is called - INIT_WORK(&pf->ptp_extts0_work, ...) overwrites list pointers - work_struct.entry.next = 0x0 (from INIT_WORK macro) - BUT work is still queued in workqueue! 4. worker_thread processes queued work - Reads corrupted next pointer (0x0) - NULL pointer dereference in process_one_work() The hardware does not require link or traffic for this crash. A PF reset alone is sufficient, as it re-enables interrupts before software reconfiguration completes, creating a window where interrupts can fire while INIT_WORK() is about to execute. Primary Fix - Initialize Work Once: ------------------------------------ Move INIT_WORK(&pf->ptp_extts0_work, i40e_ptp_extts0_work) from i40e_ptp_set_timestamp_mode() to i40e_probe(). This ensures the work_struct is initialized exactly once during driver lifetime, never while work is queued or running. Additionally, add cancel_work_sync(&pf->ptp_extts0_work) at the beginning of i40e_ptp_init(). This safely drains any pending work before PTP hardware reconfiguration during rebuilds. Secondary Fix - Avoid Race During Driver Removal: ------------------------------------------------- Move i40e_ptp_stop() to after i40e_vsi_release() in i40e_remove(). Previously, i40e_ptp_stop() ran before the VSIs were closed, which left a window where a PTP interrupt could fire after the PTP clock was unregistered: T0: ptp_clock_unregister() called T1: pf->ptp_clock = NULL T2: PTP interrupt fires (still enabled in hardware) T3: ISR schedules ptp_extts0_work T4: Worker runs -> ptp_clock_event(NULL, ...) -> potential crash By moving i40e_ptp_stop() after the VSI release loop, i40e_vsi_close() -> i40e_down() disables the misc interrupt vector first, eliminating this race. However, this reordering means that by the time i40e_ptp_stop() runs, all VSIs have already been released and pf->vsi[pf->lan_vsi] is NULL. The original "removed PHC from <interface>" log message dereferenced pf->vsi[pf->lan_vsi]->netdev->name, which would now cause its own NULL pointer dereference. Drop the interface name from the log message and print only "removed PHC", since the netdev is no longer accessible at this point in the cleanup sequence. Additionally, in i40e_ptp_stop(), move the PTP interrupt disable and add cancel_work_sync() to the beginning of the function. This provides defense in depth: PTP-specific interrupts are explicitly disabled and any pending work is drained before unregistering the PTP clock, regardless of the surrounding cleanup ordering. Testing: -------- Before the fix, one of our customer's kernel crashed repeatedly after running for approximately one day under normal production workload. After applying the fix, no crashes have been observed. Signed-off-by: Stephen Cheng <stephen.cheng@citrix.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The driver crashes with NULL pointer dereferences in process_one_work() when PTP is enabled and the device undergoes reset operations:
[ 189646.890619] ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 189646.890632] INFO: PGD 0 P4D 0
[ 189646.890638] WARN: Oops: 0000 [#1] SMP NOPTI
[ 189646.890645] WARN: CPU: 14 PID: 2948135 Comm: kworker/14:3 Tainted: G O 4.19.0+1 #1
[ 189646.890654] WARN: Hardware name: FUJITSU PRIMERGY RX2530 M4/D3383-A1, BIOS V5.0.0.12 R1.64.0 for D3383-A1x 03/06/2025
[ 189646.890676] WARN: Workqueue: (null) (events)
[ 189646.890688] WARN: RIP: e030:process_one_work+0x2e/0x370
[ 189646.890694] WARN: Code: 00 41 57 41 56 41 55 45 31 ed 41 54 55 48 89 fd 53 48 89 f3 48 83 ec 08 48 8b 06 4c 8b 67 40 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41 c1 ee 05 41 83 e6 01 41 f6 44
[ 189646.890711] WARN: RSP: e02b:ffffc9004ec9fe80 EFLAGS: 00010046
[ 189646.890717] WARN: RAX: 0000000fffffffe0 RBX: ffff8882b3f535d8 RCX: ffff8882bc7a4f60
[ 189646.890725] WARN: RDX: 0000000fffffff00 RSI: ffff8882b3f535d8 RDI: ffff88820469b680
[ 189646.890733] WARN: RBP: ffff88820469b680 R08: ffff8882b55a0a00 R09: 8080808080808080
[ 189646.890741] WARN: R10: 0000000000007ff0 R11: 0000000000000000 R12: ffff8882bc7a10c0
[ 189646.890749] WARN: R13: 0000000000000000 R14: ffff88820469b680 R15: ffff88820469b6a8
[ 189646.890766] WARN: FS: 0000000000000000(0000) GS:ffff8882bc780000(0000) knlGS:0000000000000000
[ 189646.890774] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 189646.890781] WARN: CR2: 00000000000000b0 CR3: 00000002b67a2000 CR4: 0000000000040660
[ 189646.890794] WARN: Call Trace:
[ 189646.890801] WARN: worker_thread+0x49/0x3e0
[ 189646.890809] WARN: kthread+0xf8/0x130
[ 189646.890815] WARN: ? rescuer_thread+0x310/0x310
[ 189646.890821] WARN: ? kthread_bind+0x10/0x10
[ 189646.890829] WARN: ret_from_fork+0x35/0x40
Root Cause - INIT_WORK() Called Multiple Times:
The work structure ptp_extts0_work is initialized via INIT_WORK() in i40e_ptp_set_timestamp_mode(), which is called from i40e_ptp_init() during:
When a PF reset occurs, the following race corrupts the work_struct:
i40e_rebuild() re-enables PTP interrupts via i40e_setup_misc_vector()
PTP timestamp interrupt fires immediately
i40e_rebuild() continues -> i40e_ptp_init() is called - INIT_WORK(&pf->ptp_extts0_work, ...) overwrites list pointers - work_struct.entry.next = 0x0 (from INIT_WORK macro) - BUT work is still queued in workqueue!
worker_thread processes queued work
The hardware does not require link or traffic for this crash. A PF reset alone is sufficient, as it re-enables interrupts before software reconfiguration completes, creating a window where interrupts can fire while INIT_WORK() is about to execute.
Primary Fix - Initialize Work Once:
Move INIT_WORK(&pf->ptp_extts0_work, i40e_ptp_extts0_work) from i40e_ptp_set_timestamp_mode() to i40e_probe(). This ensures the work_struct is initialized exactly once during driver lifetime, never while work is queued or running.
Additionally, add cancel_work_sync(&pf->ptp_extts0_work) at the beginning of i40e_ptp_init(). This safely drains any pending work before PTP hardware reconfiguration during rebuilds.
Secondary Fix - Avoid Race During Driver Removal:
Move i40e_ptp_stop() to after i40e_vsi_release() in i40e_remove(). Previously, i40e_ptp_stop() ran before the VSIs were closed, which left a window where a PTP interrupt could fire after the PTP clock was unregistered:
T0: ptp_clock_unregister() called
T1: pf->ptp_clock = NULL
T2: PTP interrupt fires (still enabled in hardware)
T3: ISR schedules ptp_extts0_work
T4: Worker runs -> ptp_clock_event(NULL, ...) -> potential crash
By moving i40e_ptp_stop() after the VSI release loop, i40e_vsi_close() -> i40e_down() disables the misc interrupt vector first, eliminating this race.
However, this reordering means that by the time i40e_ptp_stop() runs, all VSIs have already been released and pf->vsi[pf->lan_vsi] is NULL. The original "removed PHC from " log message dereferenced pf->vsi[pf->lan_vsi]->netdev->name, which would now cause its own NULL pointer dereference. Drop the interface name from the log message and print only "removed PHC", since the netdev is no longer accessible at this point in the cleanup sequence.
Additionally, in i40e_ptp_stop(), move the PTP interrupt disable and add cancel_work_sync() to the beginning of the function. This provides defense in depth: PTP-specific interrupts are explicitly disabled and any pending work is drained before unregistering the PTP clock, regardless of the surrounding cleanup ordering.
Testing:
Before the fix, one of our customer's kernel crashed repeatedly after running for approximately one day under normal production workload. After applying the fix, no crashes have been observed.