Skip to content

Commit 15bfba1

Browse files
ryanhrobctmarinas
authored andcommitted
arm64: mm: Handle invalid large leaf mappings correctly
It has been possible for a long time to mark ptes in the linear map as invalid. This is done for secretmem, kfence, realm dma memory un/share, and others, by simply clearing the PTE_VALID bit. But until commit a166563 ("arm64: mm: support large block mapping when rodata=full") large leaf mappings were never made invalid in this way. It turns out various parts of the code base are not equipped to handle invalid large leaf mappings (in the way they are currently encoded) and I've observed a kernel panic while booting a realm guest on a BBML2_NOABORT system as a result: [ 15.432706] software IO TLB: Memory encryption is active and system is using DMA bounce buffers [ 15.476896] Unable to handle kernel paging request at virtual address ffff000019600000 [ 15.513762] Mem abort info: [ 15.527245] ESR = 0x0000000096000046 [ 15.548553] EC = 0x25: DABT (current EL), IL = 32 bits [ 15.572146] SET = 0, FnV = 0 [ 15.592141] EA = 0, S1PTW = 0 [ 15.612694] FSC = 0x06: level 2 translation fault [ 15.640644] Data abort info: [ 15.661983] ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000 [ 15.694875] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 15.723740] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 15.755776] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000081f3f000 [ 15.800410] [ffff000019600000] pgd=0000000000000000, p4d=180000009ffff403, pud=180000009fffe403, pmd=00e8000199600704 [ 15.855046] Internal error: Oops: 0000000096000046 [#1] SMP [ 15.886394] Modules linked in: [ 15.900029] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-dirty #4 PREEMPT [ 15.935258] Hardware name: linux,dummy-virt (DT) [ 15.955612] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 15.986009] pc : __pi_memcpy_generic+0x128/0x22c [ 16.006163] lr : swiotlb_bounce+0xf4/0x158 [ 16.024145] sp : ffff80008000b8f0 [ 16.038896] x29: ffff80008000b8f0 x28: 0000000000000000 x27: 0000000000000000 [ 16.069953] x26: ffffb3976d261ba8 x25: 0000000000000000 x24: ffff000019600000 [ 16.100876] x23: 0000000000000001 x22: ffff0000043430d0 x21: 0000000000007ff0 [ 16.131946] x20: 0000000084570010 x19: 0000000000000000 x18: ffff00001ffe3fcc [ 16.163073] x17: 0000000000000000 x16: 00000000003fffff x15: 646e612065766974 [ 16.194131] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 16.225059] x11: 0000000000000000 x10: 0000000000000010 x9 : 0000000000000018 [ 16.256113] x8 : 0000000000000018 x7 : 0000000000000000 x6 : 0000000000000000 [ 16.287203] x5 : ffff000019607ff0 x4 : ffff000004578000 x3 : ffff000019600000 [ 16.318145] x2 : 0000000000007ff0 x1 : ffff000004570010 x0 : ffff000019600000 [ 16.349071] Call trace: [ 16.360143] __pi_memcpy_generic+0x128/0x22c (P) [ 16.380310] swiotlb_tbl_map_single+0x154/0x2b4 [ 16.400282] swiotlb_map+0x5c/0x228 [ 16.415984] dma_map_phys+0x244/0x2b8 [ 16.432199] dma_map_page_attrs+0x44/0x58 [ 16.449782] virtqueue_map_page_attrs+0x38/0x44 [ 16.469596] virtqueue_map_single_attrs+0xc0/0x130 [ 16.490509] virtnet_rq_alloc.isra.0+0xa4/0x1fc [ 16.510355] try_fill_recv+0x2a4/0x584 [ 16.526989] virtnet_open+0xd4/0x238 [ 16.542775] __dev_open+0x110/0x24c [ 16.558280] __dev_change_flags+0x194/0x20c [ 16.576879] netif_change_flags+0x24/0x6c [ 16.594489] dev_change_flags+0x48/0x7c [ 16.611462] ip_auto_config+0x258/0x1114 [ 16.628727] do_one_initcall+0x80/0x1c8 [ 16.645590] kernel_init_freeable+0x208/0x2f0 [ 16.664917] kernel_init+0x24/0x1e0 [ 16.680295] ret_from_fork+0x10/0x20 [ 16.696369] Code: 927cec03 cb0e0021 8b0e0042 a9411c26 (a900340c) [ 16.723106] ---[ end trace 0000000000000000 ]--- [ 16.752866] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 16.792556] Kernel Offset: 0x3396ea200000 from 0xffff800080000000 [ 16.818966] PHYS_OFFSET: 0xfff1000080000000 [ 16.837237] CPU features: 0x0000000,00060005,13e38581,957e772f [ 16.862904] Memory Limit: none [ 16.876526] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- This panic occurs because the swiotlb memory was previously shared to the host (__set_memory_enc_dec()), which involves transitioning the (large) leaf mappings to invalid, sharing to the host, then marking the mappings valid again. But pageattr_p[mu]d_entry() would only update the entry if it is a section mapping, since otherwise it concluded it must be a table entry so shouldn't be modified. But p[mu]d_sect() only returns true if the entry is valid. So the result was that the large leaf entry was made invalid in the first pass then ignored in the second pass. It remains invalid until the above code tries to access it and blows up. The simple fix would be to update pageattr_pmd_entry() to use !pmd_table() instead of pmd_sect(). That would solve this problem. But the ptdump code also suffers from a similar issue. It checks pmd_leaf() and doesn't call into the arch-specific note_page() machinery if it returns false. As a result of this, ptdump wasn't even able to show the invalid large leaf mappings; it looked like they were valid which made this super fun to debug. the ptdump code is core-mm and pmd_table() is arm64-specific so we can't use the same trick to solve that. But we already support the concept of "present-invalid" for user space entries. And even better, pmd_leaf() will return true for a leaf mapping that is marked present-invalid. So let's just use that encoding for present-invalid kernel mappings too. Then we can use pmd_leaf() where we previously used pmd_sect() and everything is magically fixed. Additionally, from inspection kernel_page_present() was broken in a similar way, so I'm also updating that to use pmd_leaf(). The transitional page tables component was also similarly broken; it creates a copy of the kernel page tables, making RO leaf mappings RW in the process. It also makes invalid (but-not-none) pte mappings valid. But it was not doing this for large leaf mappings. This could have resulted in crashes at kexec- or hibernate-time. This code is fixed to flip "present-invalid" mappings back to "present-valid" at all levels. Finally, I have hardened split_pmd()/split_pud() so that if it is passed a "present-invalid" leaf, it will maintain that property in the split leaves, since I wasn't able to convince myself that it would only ever be called for "present-valid" leaves. Fixes: a166563 ("arm64: mm: support large block mapping when rodata=full") Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
1 parent f12b435 commit 15bfba1

5 files changed

Lines changed: 48 additions & 59 deletions

File tree

arch/arm64/include/asm/pgtable-prot.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@
2525
*/
2626
#define PTE_PRESENT_INVALID (PTE_NG) /* only when !PTE_VALID */
2727

28+
#define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG)
29+
2830
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
2931
#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
3032
#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */

arch/arm64/include/asm/pgtable.h

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -322,9 +322,11 @@ static inline pte_t pte_mknoncont(pte_t pte)
322322
return clear_pte_bit(pte, __pgprot(PTE_CONT));
323323
}
324324

325-
static inline pte_t pte_mkvalid(pte_t pte)
325+
static inline pte_t pte_mkvalid_k(pte_t pte)
326326
{
327-
return set_pte_bit(pte, __pgprot(PTE_VALID));
327+
pte = clear_pte_bit(pte, __pgprot(PTE_PRESENT_INVALID));
328+
pte = set_pte_bit(pte, __pgprot(PTE_PRESENT_VALID_KERNEL));
329+
return pte;
328330
}
329331

330332
static inline pte_t pte_mkinvalid(pte_t pte)
@@ -594,6 +596,7 @@ static inline int pmd_protnone(pmd_t pmd)
594596
#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
595597
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
596598
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
599+
#define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd)))
597600
#define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd)))
598601
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
599602
#define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd))
@@ -635,6 +638,8 @@ static inline pmd_t pmd_mkspecial(pmd_t pmd)
635638

636639
#define pud_young(pud) pte_young(pud_pte(pud))
637640
#define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud)))
641+
#define pud_mkwrite_novma(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud)))
642+
#define pud_mkvalid_k(pud) pte_pud(pte_mkvalid_k(pud_pte(pud)))
638643
#define pud_write(pud) pte_write(pud_pte(pud))
639644

640645
static inline pud_t pud_mkhuge(pud_t pud)

arch/arm64/mm/mmu.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -602,6 +602,8 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
602602
tableprot |= PMD_TABLE_PXN;
603603

604604
prot = __pgprot((pgprot_val(prot) & ~PTE_TYPE_MASK) | PTE_TYPE_PAGE);
605+
if (!pmd_valid(pmd))
606+
prot = pte_pgprot(pte_mkinvalid(pfn_pte(0, prot)));
605607
prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
606608
if (to_cont)
607609
prot = __pgprot(pgprot_val(prot) | PTE_CONT);
@@ -647,6 +649,8 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
647649
tableprot |= PUD_TABLE_PXN;
648650

649651
prot = __pgprot((pgprot_val(prot) & ~PMD_TYPE_MASK) | PMD_TYPE_SECT);
652+
if (!pud_valid(pud))
653+
prot = pmd_pgprot(pmd_mkinvalid(pfn_pmd(0, prot)));
650654
prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
651655
if (to_cont)
652656
prot = __pgprot(pgprot_val(prot) | PTE_CONT);

arch/arm64/mm/pageattr.c

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,11 @@ static ptdesc_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
2525
{
2626
struct page_change_data *masks = walk->private;
2727

28+
/*
29+
* Some users clear and set bits which alias each other (e.g. PTE_NG and
30+
* PTE_PRESENT_INVALID). It is therefore important that we always clear
31+
* first then set.
32+
*/
2833
val &= ~(pgprot_val(masks->clear_mask));
2934
val |= (pgprot_val(masks->set_mask));
3035

@@ -36,7 +41,7 @@ static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
3641
{
3742
pud_t val = pudp_get(pud);
3843

39-
if (pud_sect(val)) {
44+
if (pud_leaf(val)) {
4045
if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
4146
return -EINVAL;
4247
val = __pud(set_pageattr_masks(pud_val(val), walk));
@@ -52,7 +57,7 @@ static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
5257
{
5358
pmd_t val = pmdp_get(pmd);
5459

55-
if (pmd_sect(val)) {
60+
if (pmd_leaf(val)) {
5661
if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
5762
return -EINVAL;
5863
val = __pmd(set_pageattr_masks(pmd_val(val), walk));
@@ -132,11 +137,12 @@ static int __change_memory_common(unsigned long start, unsigned long size,
132137
ret = update_range_prot(start, size, set_mask, clear_mask);
133138

134139
/*
135-
* If the memory is being made valid without changing any other bits
136-
* then a TLBI isn't required as a non-valid entry cannot be cached in
137-
* the TLB.
140+
* If the memory is being switched from present-invalid to valid without
141+
* changing any other bits then a TLBI isn't required as a non-valid
142+
* entry cannot be cached in the TLB.
138143
*/
139-
if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask))
144+
if (pgprot_val(set_mask) != PTE_PRESENT_VALID_KERNEL ||
145+
pgprot_val(clear_mask) != PTE_PRESENT_INVALID)
140146
flush_tlb_kernel_range(start, start + size);
141147
return ret;
142148
}
@@ -237,18 +243,18 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
237243
{
238244
if (enable)
239245
return __change_memory_common(addr, PAGE_SIZE * numpages,
240-
__pgprot(PTE_VALID),
241-
__pgprot(0));
246+
__pgprot(PTE_PRESENT_VALID_KERNEL),
247+
__pgprot(PTE_PRESENT_INVALID));
242248
else
243249
return __change_memory_common(addr, PAGE_SIZE * numpages,
244-
__pgprot(0),
245-
__pgprot(PTE_VALID));
250+
__pgprot(PTE_PRESENT_INVALID),
251+
__pgprot(PTE_PRESENT_VALID_KERNEL));
246252
}
247253

248254
int set_direct_map_invalid_noflush(struct page *page)
249255
{
250-
pgprot_t clear_mask = __pgprot(PTE_VALID);
251-
pgprot_t set_mask = __pgprot(0);
256+
pgprot_t clear_mask = __pgprot(PTE_PRESENT_VALID_KERNEL);
257+
pgprot_t set_mask = __pgprot(PTE_PRESENT_INVALID);
252258

253259
if (!can_set_direct_map())
254260
return 0;
@@ -259,8 +265,8 @@ int set_direct_map_invalid_noflush(struct page *page)
259265

260266
int set_direct_map_default_noflush(struct page *page)
261267
{
262-
pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
263-
pgprot_t clear_mask = __pgprot(PTE_RDONLY);
268+
pgprot_t set_mask = __pgprot(PTE_PRESENT_VALID_KERNEL | PTE_WRITE);
269+
pgprot_t clear_mask = __pgprot(PTE_PRESENT_INVALID | PTE_RDONLY);
264270

265271
if (!can_set_direct_map())
266272
return 0;
@@ -296,8 +302,8 @@ static int __set_memory_enc_dec(unsigned long addr,
296302
* entries or Synchronous External Aborts caused by RIPAS_EMPTY
297303
*/
298304
ret = __change_memory_common(addr, PAGE_SIZE * numpages,
299-
__pgprot(set_prot),
300-
__pgprot(clear_prot | PTE_VALID));
305+
__pgprot(set_prot | PTE_PRESENT_INVALID),
306+
__pgprot(clear_prot | PTE_PRESENT_VALID_KERNEL));
301307

302308
if (ret)
303309
return ret;
@@ -311,8 +317,8 @@ static int __set_memory_enc_dec(unsigned long addr,
311317
return ret;
312318

313319
return __change_memory_common(addr, PAGE_SIZE * numpages,
314-
__pgprot(PTE_VALID),
315-
__pgprot(0));
320+
__pgprot(PTE_PRESENT_VALID_KERNEL),
321+
__pgprot(PTE_PRESENT_INVALID));
316322
}
317323

318324
static int realm_set_memory_encrypted(unsigned long addr, int numpages)
@@ -404,15 +410,15 @@ bool kernel_page_present(struct page *page)
404410
pud = READ_ONCE(*pudp);
405411
if (pud_none(pud))
406412
return false;
407-
if (pud_sect(pud))
408-
return true;
413+
if (pud_leaf(pud))
414+
return pud_valid(pud);
409415

410416
pmdp = pmd_offset(pudp, addr);
411417
pmd = READ_ONCE(*pmdp);
412418
if (pmd_none(pmd))
413419
return false;
414-
if (pmd_sect(pmd))
415-
return true;
420+
if (pmd_leaf(pmd))
421+
return pmd_valid(pmd);
416422

417423
ptep = pte_offset_kernel(pmdp, addr);
418424
return pte_valid(__ptep_get(ptep));

arch/arm64/mm/trans_pgd.c

Lines changed: 7 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -31,36 +31,6 @@ static void *trans_alloc(struct trans_pgd_info *info)
3131
return info->trans_alloc_page(info->trans_alloc_arg);
3232
}
3333

34-
static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
35-
{
36-
pte_t pte = __ptep_get(src_ptep);
37-
38-
if (pte_valid(pte)) {
39-
/*
40-
* Resume will overwrite areas that may be marked
41-
* read only (code, rodata). Clear the RDONLY bit from
42-
* the temporary mappings we use during restore.
43-
*/
44-
__set_pte(dst_ptep, pte_mkwrite_novma(pte));
45-
} else if (!pte_none(pte)) {
46-
/*
47-
* debug_pagealloc will removed the PTE_VALID bit if
48-
* the page isn't in use by the resume kernel. It may have
49-
* been in use by the original kernel, in which case we need
50-
* to put it back in our copy to do the restore.
51-
*
52-
* Other cases include kfence / vmalloc / memfd_secret which
53-
* may call `set_direct_map_invalid_noflush()`.
54-
*
55-
* Before marking this entry valid, check the pfn should
56-
* be mapped.
57-
*/
58-
BUG_ON(!pfn_valid(pte_pfn(pte)));
59-
60-
__set_pte(dst_ptep, pte_mkvalid(pte_mkwrite_novma(pte)));
61-
}
62-
}
63-
6434
static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp,
6535
pmd_t *src_pmdp, unsigned long start, unsigned long end)
6636
{
@@ -76,7 +46,11 @@ static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp,
7646

7747
src_ptep = pte_offset_kernel(src_pmdp, start);
7848
do {
79-
_copy_pte(dst_ptep, src_ptep, addr);
49+
pte_t pte = __ptep_get(src_ptep);
50+
51+
if (pte_none(pte))
52+
continue;
53+
__set_pte(dst_ptep, pte_mkvalid_k(pte_mkwrite_novma(pte)));
8054
} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end);
8155

8256
return 0;
@@ -109,8 +83,7 @@ static int copy_pmd(struct trans_pgd_info *info, pud_t *dst_pudp,
10983
if (copy_pte(info, dst_pmdp, src_pmdp, addr, next))
11084
return -ENOMEM;
11185
} else {
112-
set_pmd(dst_pmdp,
113-
__pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY));
86+
set_pmd(dst_pmdp, pmd_mkvalid_k(pmd_mkwrite_novma(pmd)));
11487
}
11588
} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
11689

@@ -145,8 +118,7 @@ static int copy_pud(struct trans_pgd_info *info, p4d_t *dst_p4dp,
145118
if (copy_pmd(info, dst_pudp, src_pudp, addr, next))
146119
return -ENOMEM;
147120
} else {
148-
set_pud(dst_pudp,
149-
__pud(pud_val(pud) & ~PUD_SECT_RDONLY));
121+
set_pud(dst_pudp, pud_mkvalid_k(pud_mkwrite_novma(pud)));
150122
}
151123
} while (dst_pudp++, src_pudp++, addr = next, addr != end);
152124

0 commit comments

Comments
 (0)