Skip to content

Commit 83214a7

Browse files
kaihuanghansendc
authored andcommitted
x86/sme: Use percpu boolean to control WBINVD during kexec
TL;DR: Prepare to unify how TDX and SME do cache flushing during kexec by making a percpu boolean control whether to do the WBINVD. -- Background -- On SME platforms, dirty cacheline aliases with and without encryption bit can coexist, and the CPU can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel otherwise the dirty cachelines could silently corrupt the memory used by the new kernel due to different encryption property. TDX also needs a cache flush during kexec for the same reason. It would be good to have a generic way to flush the cache instead of scattering checks for each feature all around. When SME is enabled, the kernel basically encrypts all memory including the kernel itself and a simple memory write from the kernel could dirty cachelines. Currently, the kernel uses WBINVD to flush the cache for SME during kexec in two places: 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU stops them; 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the new kernel. -- Solution -- Unlike SME, TDX can only dirty cachelines when it is used (i.e., when SEAMCALLs are performed). Since there are no more SEAMCALLs after the aforementioned WBINVDs, leverage this for TDX. To unify the approach for SME and TDX, use a percpu boolean to indicate the cache may be in an incoherent state and needs flushing during kexec, and set the boolean for SME. TDX can then leverage it. While SME could use a global flag (since it's enabled at early boot and enabled on all CPUs), the percpu flag fits TDX better: The percpu flag can be set when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU obviates the need for a kexec-time WBINVD. Saving kexec-time WBINVD is valuable, because there is an existing race[*] where kexec could proceed while another CPU is active. WBINVD could make this race worse, so it's worth skipping it when possible. -- Side effect to SME -- Today the first WBINVD in the stop_this_cpu() is performed when SME is *supported* by the platform, and the second WBINVD is done in relocate_kernel() when SME is *activated* by the kernel. Make things simple by changing to do the second WBINVD when the platform supports SME. This allows the kernel to simply turn on this percpu boolean when bringing up a CPU by checking whether the platform supports SME. No other functional change intended. [*] The aforementioned race: During kexec native_stop_other_cpus() is called to stop all remote CPUs before jumping to the new kernel. native_stop_other_cpus() firstly sends normal REBOOT vector IPIs to stop remote CPUs and waits them to stop. If that times out, it sends NMI to stop the CPUs that are still alive. The race happens when native_stop_other_cpus() has to send NMIs and could potentially result in the system hang (for more information please see [1]). Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/kvm/b963fcd60abe26c7ec5dc20b42f1a2ebbcc72397.1750934177.git.kai.huang@intel.com/ [1] Link: https://lore.kernel.org/all/20250901160930.1785244-3-pbonzini%40redhat.com
1 parent 744b02f commit 83214a7

6 files changed

Lines changed: 52 additions & 22 deletions

File tree

arch/x86/include/asm/kexec.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717

1818
#include <linux/bits.h>
1919

20-
#define RELOC_KERNEL_PRESERVE_CONTEXT BIT(0)
21-
#define RELOC_KERNEL_HOST_MEM_ENC_ACTIVE BIT(1)
20+
#define RELOC_KERNEL_PRESERVE_CONTEXT BIT(0)
21+
#define RELOC_KERNEL_CACHE_INCOHERENT BIT(1)
2222

2323
#endif
2424

arch/x86/include/asm/processor.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
731731
void microcode_check(struct cpuinfo_x86 *prev_info);
732732
void store_cpu_caps(struct cpuinfo_x86 *info);
733733

734+
DECLARE_PER_CPU(bool, cache_state_incoherent);
735+
734736
enum l1tf_mitigations {
735737
L1TF_MITIGATION_OFF,
736738
L1TF_MITIGATION_AUTO,

arch/x86/kernel/cpu/amd.c

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -545,6 +545,23 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
545545
{
546546
u64 msr;
547547

548+
/*
549+
* Mark using WBINVD is needed during kexec on processors that
550+
* support SME. This provides support for performing a successful
551+
* kexec when going from SME inactive to SME active (or vice-versa).
552+
*
553+
* The cache must be cleared so that if there are entries with the
554+
* same physical address, both with and without the encryption bit,
555+
* they don't race each other when flushed and potentially end up
556+
* with the wrong entry being committed to memory.
557+
*
558+
* Test the CPUID bit directly because with mem_encrypt=off the
559+
* BSP will clear the X86_FEATURE_SME bit and the APs will not
560+
* see it set after that.
561+
*/
562+
if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
563+
__this_cpu_write(cache_state_incoherent, true);
564+
548565
/*
549566
* BIOS support is required for SME and SEV.
550567
* For SME: If BIOS has enabled SME then adjust x86_phys_bits by

arch/x86/kernel/machine_kexec_64.c

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
#include <asm/set_memory.h>
3030
#include <asm/cpu.h>
3131
#include <asm/efi.h>
32+
#include <asm/processor.h>
3233

3334
#ifdef CONFIG_ACPI
3435
/*
@@ -426,11 +427,11 @@ void __nocfi machine_kexec(struct kimage *image)
426427
relocate_kernel_flags |= RELOC_KERNEL_PRESERVE_CONTEXT;
427428

428429
/*
429-
* This must be done before load_segments() since if call depth tracking
430-
* is used then GS must be valid to make any function calls.
430+
* This must be done before load_segments() since it resets
431+
* GS to 0 and percpu data needs the correct GS to work.
431432
*/
432-
if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
433-
relocate_kernel_flags |= RELOC_KERNEL_HOST_MEM_ENC_ACTIVE;
433+
if (this_cpu_read(cache_state_incoherent))
434+
relocate_kernel_flags |= RELOC_KERNEL_CACHE_INCOHERENT;
434435

435436
/*
436437
* The segment registers are funny things, they have both a
@@ -441,6 +442,11 @@ void __nocfi machine_kexec(struct kimage *image)
441442
*
442443
* Take advantage of this here by force loading the segments,
443444
* before the GDT is zapped with an invalid value.
445+
*
446+
* load_segments() resets GS to 0. Don't make any function call
447+
* after here since call depth tracking uses percpu variables to
448+
* operate (relocate_kernel() is explicitly ignored by call depth
449+
* tracking).
444450
*/
445451
load_segments();
446452

arch/x86/kernel/process.c

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,16 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
8888
DEFINE_PER_CPU(bool, __tss_limit_invalid);
8989
EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
9090

91+
/*
92+
* The cache may be in an incoherent state and needs flushing during kexec.
93+
* E.g., on SME/TDX platforms, dirty cacheline aliases with and without
94+
* encryption bit(s) can coexist and the cache needs to be flushed before
95+
* booting to the new kernel to avoid the silent memory corruption due to
96+
* dirty cachelines with different encryption property being written back
97+
* to the memory.
98+
*/
99+
DEFINE_PER_CPU(bool, cache_state_incoherent);
100+
91101
/*
92102
* this gets called so that we can store lazy state into memory and copy the
93103
* current task into the new thread.
@@ -827,19 +837,7 @@ void __noreturn stop_this_cpu(void *dummy)
827837
disable_local_APIC();
828838
mcheck_cpu_clear(c);
829839

830-
/*
831-
* Use wbinvd on processors that support SME. This provides support
832-
* for performing a successful kexec when going from SME inactive
833-
* to SME active (or vice-versa). The cache must be cleared so that
834-
* if there are entries with the same physical address, both with and
835-
* without the encryption bit, they don't race each other when flushed
836-
* and potentially end up with the wrong entry being committed to
837-
* memory.
838-
*
839-
* Test the CPUID bit directly because the machine might've cleared
840-
* X86_FEATURE_SME due to cmdline options.
841-
*/
842-
if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
840+
if (this_cpu_read(cache_state_incoherent))
843841
wbinvd();
844842

845843
/*

arch/x86/kernel/relocate_kernel_64.S

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -198,14 +198,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
198198
movq %r9, %cr3
199199

200200
/*
201+
* If the memory cache is in incoherent state, e.g., due to
202+
* memory encryption, do WBINVD to flush cache.
203+
*
201204
* If SME is active, there could be old encrypted cache line
202205
* entries that will conflict with the now unencrypted memory
203206
* used by kexec. Flush the caches before copying the kernel.
207+
*
208+
* Note SME sets this flag to true when the platform supports
209+
* SME, so the WBINVD is performed even SME is not activated
210+
* by the kernel. But this has no harm.
204211
*/
205-
testb $RELOC_KERNEL_HOST_MEM_ENC_ACTIVE, %r11b
206-
jz .Lsme_off
212+
testb $RELOC_KERNEL_CACHE_INCOHERENT, %r11b
213+
jz .Lnowbinvd
207214
wbinvd
208-
.Lsme_off:
215+
.Lnowbinvd:
209216

210217
call swap_pages
211218

0 commit comments

Comments
 (0)