Skip to content

aarch64: boot + render on real hardware (UTM / bare metal) — UEFI ISO, crash-free#12

Merged
squirelboy360 merged 6 commits into
mainfrom
feat/arm-uefi-render2
Jun 11, 2026
Merged

aarch64: boot + render on real hardware (UTM / bare metal) — UEFI ISO, crash-free#12
squirelboy360 merged 6 commits into
mainfrom
feat/arm-uefi-render2

Conversation

@squirelboy360

Copy link
Copy Markdown
Contributor

Brings the aarch64 port from "renders only in TCG emulation" to boots and renders the full Flutter shell on real ARM hardware (Apple HVF / UTM / bare metal / Raspberry Pi), crash-free, and ships a real bootable UEFI ISO.

Fixes (each root-caused + verified)

  1. UEFI boot (SPSel) — Limine hands off at EL1t (SPSel=0) so the kernel sp aliased SP_EL0 and clobbered the userspace stack. msr spsel,#1 at entry. The ISO now boots under edk2/UTM.
  2. daifset resume-corruptioneret_to_el0* reset SP before reading the resume image off the abandoned yield frame; an IRQ in that window corrupted it (~25% of long runs). Mask IRQs across the asm. Verified 0/12.
  3. crash#2 (cond_timedwait timeout)finish_cond_timedout_return advanced the resume PC by a hardcoded 2 bytes (x86 syscall len); aarch64 svc is 4B → misaligned PC → EC=0x22, deterministic ~280 frames in. Use cfg'd SYSCALL_INSN_LEN. Verified 0/3 long boots (~2685 frames each, was 2/2 crashing).
  4. SP alignment (real-hardware blocker) — spawned threads got an 8-aligned SP (x86 RSP%16==8 convention). TCG tolerates it; real silicon faults EC=0x26 instantly. 16-byte-align on aarch64. This is why it never ran under HVF/UTM.
  5. Virtual timer (real-hardware render) — ticked the EL1 physical timer (CNTP/PPI30), which a hypervisor (HVF) doesn't deliver to the guest → no vsync → no present. Switched to the architected virtual timer (CNTV/PPI27), delivered on HVF + bare metal + TCG.

Verified

  • TCG (-cpu cortex-a72): renders, crash-free (~2685 frames/boot).
  • HVF (-cpu host, real Apple Silicon, the UTM path): present_callback 2063 (-kernel) and 600+ (final UEFI ISO), crash=0.
  • x86_64 kernel still compiles (CI-safe; all changes are aarch64-cfg'd).

CI / release

release.yml now also builds the aarch64 UEFI ISO (build-iso-aarch64.sh, LIMINE_DIR=$HOME/limine) and publishes oscortex-aarch64-<tag>.iso + .sha256 alongside the raw kernel — the first UTM/VM/bare-metal-bootable ARM image. Merging to main cuts v0.0.6.

Re-port of the Limine UEFI scaffolding (boot_limine.rs, aarch64-limine.ld,
mmu::limine_setup_ttbr0, kernel_main_arch_limine, limine-boot feature,
build-iso-aarch64.sh) onto origin/main — which has ALL the render fixes (epoll/
va_list ABI, x3 arg restore, cooperative scheduling, virtio-input). The earlier
attempt was mistakenly branched off stale develop (pre-fixes).
…the shell

Limine hands off at EL1t (SPSel=0), where `sp` aliases SP_EL0. Every kernel
stack write then went through SP_EL0, and enter_user's kernel-stack reclaim
(`mov sp, <syscall stack top>`) clobbered the user SP it had just delivered:
pid1 entered EL0 with SP pointing at the kernel syscall stack and its first
push faulted (EC=0x24, FAR=SP-0xa0). The QEMU -kernel path sets up EL1h
itself, which is why the identical kernel worked there.

One instruction: msr spsel, #1 before the boot-stack switch. Validated under
edk2: present_callback=281 on a full run.
…% resume crash

eret_to_el0/eret_to_el0_fp reset SP_EL1 then read the img/fp resume arrays that
still live on the abandoned cooperative-yield frame. A wait loop that yields
with IRQs unmasked leaves them unmasked here; a timer IRQ pushes a TrapFrame
over img/fp → corrupted resume image (intermittent data abort in eret_to_el0_fp
at the SPSR load with x18=0). Fix: msr daifset, #0xf before the SP reset; the
eret restores SPSR_EL0T so EL0 runs with interrupts enabled. 12/12 -kernel
boots clean (was ~2/8).
…UEFI ISO in CI

Kernel: finish_cond_timedout_return advanced the resume PC by a hardcoded 2
bytes — correct for x86's 2-byte `syscall`, but on aarch64 `svc #0` is 4 bytes,
so the parked cond_timedwait waiter resumed at svc+2 (a misaligned PC) →
EC=0x22 PC-alignment fault, deterministically ~280 frames into a run (every
cond_timedwait timeout was a live grenade). Use the cfg'd SYSCALL_INSN_LEN
(4 on aarch64) and set aarch64_ret_in_x0 so ETIMEDOUT actually reaches x0.

CI: release.yml now builds the aarch64 UEFI/Limine ISO via build-iso-aarch64.sh
(LIMINE_DIR=$HOME/limine) and publishes oscortex-aarch64-<tag>.iso + .sha256
alongside the raw -kernel ELF — a real UTM/VM/bare-metal-bootable ARM image.
build-iso-aarch64.sh: LIMINE_DIR is now overridable for CI.
…e (HVF/bare-metal) SP-alignment fault

The x86 SysV convention seeds entry SP at stack_top-8 (RSP%16==8, return addr
on the stack). AArch64 is the opposite: SP must be 16-byte aligned at ALL times
(SCTLR_EL1.SA, hardware-enforced) and the return address lives in x30/LR (already
seeded via p.user_lr=thread_return_trampoline_va). TCG doesn't enforce SP
alignment so it silently worked; real silicon (HVF / UTM / Raspberry Pi) faults
EC=0x26 the instant a spawned engine thread touches its stack — which is why it
rendered headless but never on real hardware. cfg-split spawn_thread and
spawn_with_bootstrap to align down to 16 on aarch64; x86 keeps the -8.

Verified under HVF (cpu=host, real Apple Silicon): engine now spawns all worker
threads with no SP fault and reaches the render event loop.
…PI27) — renders on real hardware

The kernel ticked the EL1 PHYSICAL timer (CNTP_CTL_EL0, PPI 30). On QEMU -M virt
TCG that interrupt is delivered, so it worked headless. But OSCortex runs at EL1
and under a hypervisor — Apple's HVF (what UTM uses), KVM, Xen — EL2 owns the
physical timer: EL1 CNTP accesses are trapped and PPI 30 is NOT delivered to the
guest. So under HVF the tick never fired → no vsync baton → the engine reached
its event loop, scheduled frames, and stalled with present=0 (nothing ever
rendered on real silicon, only in TCG emulation).

Switch to the architected EL1-guest VIRTUAL timer (CNTV_CTL_EL0 / CNTV_TVAL_EL0 /
CNTVCT_EL0, PPI 27). It is delivered under HVF, on bare metal (CNTVOFF=0 → virtual
time == physical time), AND on TCG virt — strictly more portable. vsync_due
already measured cadence against CNTVCT, so this is now consistent.

Verified under HVF (cpu=host, real Apple Silicon): present_callback reaches 2063
frames, crash=0 — the full Flutter shell renders on real hardware.
@squirelboy360 squirelboy360 merged commit e9fd481 into main Jun 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant