Skip to content

aarch64: Flutter shell bring-up — engine runs AOT on ARM (reaches ICU init + worker spawn)#9

Merged
squirelboy360 merged 37 commits into
developfrom
feat/arch-aarch64-shell
Jun 9, 2026
Merged

aarch64: Flutter shell bring-up — engine runs AOT on ARM (reaches ICU init + worker spawn)#9
squirelboy360 merged 37 commits into
developfrom
feat/arch-aarch64-shell

Conversation

@squirelboy360

@squirelboy360 squirelboy360 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What this brings

The aarch64 Flutter shell path: one source tree, the same Material shell that renders on x86, now builds and boots into the Flutter engine on ARM via direct qemu-system-aarch64 -M virt -kernel (RAMFB display, GICv2 + generic-timer preemption, EL0 syscalls). The arm64 engine + AOT shell snapshot are staged into the initramfs by scripts/build-aarch64-shell.sh (artifacts published as release oscortex-engine-1 / prebuilt in the oscx-engine container).

Boot reaches: dlopen engine → FlutterEngineInitialize OK → AOT confirmed on ARM (FlutterEngineRunsAOTCompiledDartCode=true) → snapshot ptrs resolve → RunInitialized → ICU init (icudtl mmap) → worker-thread spawn.

Kernel fixes (final commit b98ff19) — single-core register/lock coherence

  1. Eager user-GPR capture at the SVC boundary — port of the proven x86 GPRS_CAPTURED fix; on aarch64 it runs inside the IRQ-masked SVC window (ARM unmasks IRQs before the handler). Stops a timer-preempted sibling's syscall from clobbering the shared per-CPU snapshot and leaking stale callee-saved regs / x30 into a yielding thread's resume context.
  2. IRQ-masked outer page-table critical section — the reentrant lock_page_table depth counter desynced from the spinlock when the timer preempted a lock holder, deadlocking the demand-pager. Now masks IRQs for the outer section (cfg-gated to aarch64; x86 untouched).
  3. Crash reporter scans the EL0 stack for engine return addresses (FP chain empty when x30/FP=0) for symbolization.

x86 safety

All shared-code changes are cfg-gated to aarch64. x86_64 kernel verified to compile (cargo check --target x86_64-unknown-none).

Status / not-done

Engine reaches ICU init + worker spawn; a residual cooperative-yield corruption remains (nondeterministic ret-to-0 vs abort) — same class the x86 port hardened over several commits. Next targets: save_return_context lazy rip/rsp, and build_image not preserving x24–x28 across cooperative-yield resume. Not yet rendering on ARM — merging to keep the integrated ARM path moving; further hardening to follow.

squirelboy360 and others added 30 commits June 8, 2026 23:38
…surface

Replace the 35-line aarch64 stub with a module layout that mirrors the
x86_64 backend so the rest of the kernel's crate::arch::* surface resolves
on aarch64. Every function/type/const the shared kernel imports through the
arch facade now exists here with the matching signature; bodies are
compilable scaffolds (no-ops / sane defaults) pending the real ARM port.

Modules added under arch/aarch64/:
  cpu        FP/SIMD enable, TPIDR_EL0 TLS (set/get_fs_base), DAIF
             interrupt-mask save/restore, DMB/yield fences, xstate
             save/restore placeholders, hypervisor detection.
  memory     read/write_cr3 mapped onto TTBR0_EL1 (+ TLB maintenance).
  gdt        nominal USER_CS/USER_DS selector constants (no segmentation).
  idt        VBAR_EL1 exception-vector install entry points; re-exports
             the cross-arch InterruptFrame.
  apic       GIC + generic-timer scaffold: eoi, init_bsp/ap,
             finish_xapic_init, local_apic_id (from MPIDR_EL1),
             send_resched_ipi, and the vsync cadence accessors.
  syscall    SVC fast-path scaffold: per-CPU user-GPR scratch with the
             UserGprSnapshot type and user_rsp/rip/r9/rbp/user_gprs
             accessors, set_active_stack_top, init/init_ap.
  smp        per-CPU table + this_cpu()/current_cpu_id() via MPIDR_EL1,
             CPU_COUNT, broadcast_resched_ipi (single-core scaffold;
             PSCI CPU_ON wake left as a TODO).
  acpi       RSDP/MADT lookup + PSCI SYSTEM_OFF shutdown placeholders.
  interrupts disable_pic no-op (no legacy PIC on aarch64).
  mod        early_init/ap_init/smp_init/halt/halt_forever/enable/
             disable_interrupts/rdtsc (CNTVCT_EL0) and AAPCS64
             context_switch + task_entry naked trampolines.

This is the scaffolding step of the ARM port: it does NOT boot. Real MMU
page tables, the EL1 exception vectors, GIC/timer programming, the SVC
entry path, and PSCI SMP bring-up are stubbed for follow-up work.
The shared kernel embedded raw x86 instructions and x86-only limine flags
directly in architecture-neutral files, which broke the aarch64-unknown-none
build. Move every such site behind crate::arch::* so both x86_64 and aarch64
compile, with x86_64 codegen left byte-for-byte identical.

User-mode entry: extract the IRETQ/SYSRETQ ring-3 transition asm out of
process::enter_user_by_pid_noreturn{,_try} into a new arch::enter_user hook
(EnterUserRegs + enter_user_iret/enter_user_sysret). The x86_64 backend keeps
the exact prior asm (verified: identical instruction/operand sequences); the
aarch64 backend stubs them pending the real EL1->EL0 ERET path. The shared
process layer keeps all PTABLE/CR3/errno logic and only calls the hook for the
final transfer.

Other shared sites rerouted through the facade:
- main.rs: gate limine MP_FLAG_X2APIC behind cfg(x86_64), 0 otherwise
- dispatch/poll/posix: _rdtsc()/inline rdtsc -> arch::rdtsc()
- posix/poll/fd/futex/engine/ipc_display: sti;hlt;cli -> arch::enable_and_halt()
- futex/posix: pause -> arch::spin_pause()
- process: cli -> arch::interrupts_disable(); mov cr3 -> arch::memory::write_cr3()
- paging: read cr3 -> arch::memory::read_cr3()
- panic: read rbp -> arch::read_frame_pointer()
- fd: poweroff loop -> arch::acpi_shutdown() (already arch-neutral)

New arch hooks: enter_user_{iret,sysret}+EnterUserRegs, interrupts_disable,
enable_and_halt, read_frame_pointer (x86_64 real, aarch64 mirror/stub).

Both targets build clean:
- aarch64-unknown-none --features arch-aarch64: 0 errors
- x86_64-unknown-none release: 0 errors
Direct QEMU -kernel boot path for the ARM port. The assembly _start
(boot.rs) parks secondary CPUs, drops EL2->EL1 if needed, enables
FP/AdvSIMD (CPACR_EL1.FPEN), sets up the boot stack, zeroes BSS, and
calls into the Rust bring-up sequence. A minimal polled PL011 UART
driver (uart.rs) at 0x09000000 is the serial debug lifeline.

The bring-up (bringup.rs) prints CurrentEL, SCTLR_EL1, MPIDR_EL1 and
CNTFRQ_EL0 over serial, confirming we land at EL1 with the MMU off and
a 62.5MHz generic timer. New aarch64.ld links the kernel for QEMU
-M virt RAM at 0x40080000 with _start as the ELF entry.

Boots cleanly under qemu-system-aarch64 -M virt -cpu cortex-a72.
…stone 2)

ARMv8-A MMU bring-up (mmu.rs): builds 512-entry L1 translation tables
with 1 GiB block descriptors for both TTBR0 (low/user) and TTBR1
(kernel high half), programs MAIR_EL1 (Normal WB + Device-nGnRE),
TCR_EL1 and enables SCTLR_EL1.M/C/I.

Two bring-up subtleties resolved on real QEMU:
- T0SZ/T1SZ = 25 (39-bit VA) so the 4 KiB-granule walk starts at L1,
  letting a single L1 table cover the whole space with 1 GiB blocks
  (T0SZ=16/48-bit would have required an L0 table -> level-0 fault).
- Kernel identity map uses AP=EL1-only; an EL0-writable kernel-code
  mapping implicitly forces PXN, which tripped a level-1 permission
  fault on the first translated instruction fetch.

Verified on qemu-system-aarch64: SCTLR_EL1=0xc5183d (M/C/I on),
a RAM read/write probe round-trips, and execution continues
translated past the enable.
Install a 16-entry, 2KB-aligned VBAR_EL1 vector table (vectors.rs)
covering all four groups (Current-EL SP0/SPx, Lower-EL AArch64/32) x
four kinds (Sync/IRQ/FIQ/SError). Each entry saves the full integer
register file (x0-x30) plus SP_EL0, ELR_EL1, SPSR_EL1 and ESR_EL1 into
a TrapFrame, calls a Rust dispatcher tagged with the exception kind,
then restores everything and erets.

The dispatcher decodes ESR_EL1.EC and routes SVC64 to an installable
handler, IRQs to an installable IRQ handler, and reports+parks on any
unhandled exception (printing ESR/ELR/SPSR/FAR over serial).

Verified on QEMU: a deliberate  from EL1 round-trips through
save/dispatch/restore/eret (SYNC_EL1 0->1, handler hit once) and a
callee-saved sentinel register survives the trap intact.
GICv2 driver (gic.rs) for QEMU -M virt: enables the distributor and
this core's CPU interface, with per-IRQ enable/priority/routing plus
IAR acknowledge and EOIR end-of-interrupt.

Generic-timer driver (timer.rs) programs the EL1 physical timer
(CNTP_CTL/CNTP_TVAL_EL0) for a periodic tick, routed via PPI 30 and
re-armed each interrupt. CNTFRQ_EL0 (62.5 MHz on virt) gives the rate.

The bring-up installs an IRQ handler that acknowledges at the GIC,
services the timer PPI, and EOIs. With IRQs unmasked the timer ticks:
verified 5 ticks / 5 serviced IRQs in ~61ms at the requested 100 Hz.
This is the same scheduler-tick source the x86 APIC timer drives.
The headline ARM milestone: the kernel drops to EL0, runs a userspace
program, and services its svc #0 syscalls end to end.

enter_user.rs: real EL1->EL0 transition. Builds the full x0..x30 +
SP_EL0 + ELR_EL1 + SPSR_EL1 image from the shared (x86-named)
EnterUserRegs, mapping rdi/rsi/.. onto x0/x1/.. per the Linux aarch64
ABI, then erets into EL0. Both the IRET (timer-preempt) and SYSRET
(syscall-yield) hooks the shared process layer calls route here.
enter_el0_at() is the bring-up's direct launch path.

mmu.rs: map_user_page() installs an L1->L2->L3 walk for an EL0-RW,
EL0-executable (PXN=1/UXN=0) 4 KiB page at a free VA, backing it with
a kernel-writable physical page so the kernel can stage user code.

bringup_user.rs: assembles a tiny EL0 program (MOVZ/MOVK + svc #0
sequences) that issues write(64)/getpid(172)/exit(93) syscalls. The
SVC handler reads x8 (nr) + x0..x2 (args) from the saved TrapFrame,
services the call, writes the return into x0, and erets back to EL0.

psci.rs: PSCI SYSTEM_OFF / CPU_ON helpers (conduit-gated on EL>=2).

Verified on qemu-system-aarch64 -M virt -cpu cortex-a72: the EL0
program prints over serial via syscalls, getpid returns a value, and
exit(7) is observed (writes=3, exit_code=7). Full chain boots:
EL1 -> MMU -> vectors -> GIC+timer -> EL0 -> syscalls -> exit.
The aarch64 SVC handler now calls syscall::capture_from_trap(), which
stashes the trapping EL0 thread's registers into the per-CPU scratch
that backs the architecture-neutral accessors arch::syscall::user_rip
/ user_rsp / user_gprs the shared dispatch_fast path reads. AArch64
registers map onto the x86-named slots per the Linux aarch64 ABI
(x0..x5 -> arg regs, SP_EL0 -> user_rsp, ELR_EL1 -> user_rip,
x19..x23/x29 -> callee-saved slots).

Verified on QEMU: after the first EL0 svc, the shared accessors report
user_rip=0x400000018 (past the svc) and user_rsp=0x400000f00 (the EL0
stack) — the exact contract dispatch_fast consumes. x86_64 build
unaffected.
Best-effort SMP bring-up (bringup_smp.rs): issues PSCI CPU_ON over the
QEMU -M virt HVC conduit to start CPU 1 at a physical secondary entry
stub (__secondary_entry) which enables FP, sets up its own stack, and
calls into Rust to record its MPIDR and report online over serial.

psci.rs cpu_on now uses HVC (the active virt conduit). The EL1 sync
vector treats an Undefined-instruction trap (EC=0, e.g. a PSCI probe
where no conduit is active) as a graceful no-op: it sets x0 to PSCI
NOT_SUPPORTED and steps ELR past the faulting instruction instead of
parking, so SMP probing never wedges the single-core path.

Verified on QEMU -smp 2: PSCI CPU_ON returns 0, CPU 1 comes online and
reports MPIDR=1; with -smp 1 the probe returns gracefully and the
single-core EL0 path still completes. Full per-CPU scheduling on the
secondary is a follow-up; the BSP already runs userspace.

x86_64 build unaffected.
Convenience wrapper that builds the aarch64 kernel and boots it with
qemu-system-aarch64 -M virt -cpu cortex-a72 (single core, or smp2 to
exercise the PSCI CPU_ON path), wiring PL011 serial to the terminal.
The ARM port previously ran only the self-contained bring-up demo. This wires
the proven arch primitives into the SAME kernel_main the x86 path uses, so the
real subsystem init runs on ARM.

  * FDT/DTB reader (arch/aarch64/fdt.rs): a small hand-rolled flattened
    device-tree parser (no external crate) that finds the /memory node(s) for
    RAM discovery. _start now preserves the x0 DTB pointer (saved in x20) and
    passes it through aarch64_start.

  * Neutral boot memory map (mm::BootMemMap): mm::init no longer reads Limine
    directly. Both arches translate their source — x86 from the Limine memmap,
    aarch64 from the device tree — into a fixed-capacity region list consumed by
    the shared mm::init_from_regions. x86 behaviour is byte-equivalent.

  * Production ARM boot path (arch/aarch64/boot_prod.rs): brings up PL011 serial,
    the MMU (identity map), EL1 exception vectors, and the GIC, parses the DTB,
    then calls the shared kernel_main_arch.

  * kernel_main split: the Limine-coupled prologue stays in the x86-only
    kernel_main; the arch-neutral subsystem init + init-process spawn + idle loop
    move to shared_init_and_run, called by both arches. Limine request statics
    are now x86-only so the ARM kernel carries no dead boot section.

  * PL011 wired into the shared logger so early_print + the log framework reach
    serial on aarch64.

Verified: production kernel_main runs end-to-end on qemu-system-aarch64 -M virt
through MMU, frame allocator (2048 MiB discovered), heap, scheduler, security,
IPC, WM, VFS, drivers, and the full Cortex AI runtime, reaching the init-process
spawn step. x86_64 kernel still builds (release ELF unchanged in behaviour).
Completes the ARM port's headline goal: the PRODUCTION kernel_main now spawns a
userspace process and services its first syscalls through the SHARED dispatcher.

  * aarch64 user page-table walker (mm/paging.rs): a real 4 KiB-granule L1→L3
    TTBR0 walker replaces the non-x86 stub — alloc_user_pml4, map_page_in,
    translate/update/unmap, free. Each per-process root is seeded with a copy of
    the kernel's identity-map L1 block descriptors so the kernel stays mapped
    after write_cr3 loads the process root (the bring-up kernel runs from the
    TTBR0 low half). free_user_pml4 skips those shared block descriptors so it
    never frees kernel RAM.

  * SVC → shared dispatch (arch/aarch64/syscall.rs): the production SVC handler
    captures the EL0 trap frame, calls syscall::dispatch_fast(x8, x0..x4), writes
    the result back to x0, and on exit/exit_group hands off to the next runnable
    process (or parks). Installed in arch::early_init.

  * ELF loader (process/elf.rs): accept EM_AARCH64 on ARM via an arch-selected
    EM_NATIVE machine check.

  * Process spawn (process/mod.rs): arch-conditional USER_STACK_TOP placed inside
    the 39-bit aarch64 VA window; the x86 glibc POSIX trampoline mapping is
    skipped on ARM (bare EL0 program, no libc).

  * Scheduler (sched/mod.rs): spawn_kernel_task builds an AAPCS64 first-run frame
    on aarch64 (x19..x30 + fn-ptr slot matching context_switch/task_entry),
    instead of the x86 register layout.

  * Native /init (build.rs): for aarch64 builds, embed a tiny EL0 ELF (linked at
    64 GiB to clear the kernel identity blocks) that does write + getpid + write +
    exit using the OSCortex syscall numbers.

  * main.rs: on aarch64 the init is entered directly via enter_user_by_pid_noreturn
    (the real write_cr3 → ERET path) since the ARM preemptive timer ISR is a
    follow-on; x86 keeps the schedule_user_launch hand-off.

Verified on qemu-system-aarch64 -M virt: init spawns (pid=1, EL0 entry), prints
two lines via write(1,...), getpid returns, exit(0) reaps cleanly and the kernel
parks — all through the shared syscall layer, no faults. x86_64 debug + release
still build green; no /init is injected on x86.
QEMU's -M virt passes the DTB pointer in x0 for a flat/Image boot, but for an
ELF -kernel it may leave x0 = 0 and place the blob at an image-dependent address.
boot_prod now: (1) uses x0 when it points at a valid FDT (the spec mechanism,
also correct on real hardware), (2) probes the RAM base + a bounded low window
for the FDT magic, (3) falls back to the QEMU virt default (2 GiB @ 0x40000000,
matching -m) when neither yields a /memory node. Adds fdt::scan_for_dtb for the
magic-scan path. The parser itself is unchanged and validated against the
machine's real device tree.
The ARM `-kernel` boot now routes into the shared kernel_main and spawns
userspace; update the script header to describe that instead of only the
self-contained bring-up demo.
Ensures a fragmented x86 Limine memory map never drops usable RAM when
translating into the architecture-neutral region list.
QEMU -M virt ships no Limine framebuffer (Limine is the x86 boot protocol),
so the ARM port had no display. Add a fw_cfg DMA reader/writer and use it to
configure QEMU's -device ramfb:

  * kernel/src/arch/aarch64/ramfb.rs: walk the fw_cfg file directory to find
    the etc/ramfb selector, allocate a 1280x800 XRGB8888 framebuffer in
    identity-mapped RAM, and DMA-write the big-endian RAMFBCfg into it.
    Two gotchas handled: SELECT and WRITE must be separate DMAs (a combined
    control word is rejected), and RamfbCfg must be repr(C, packed) so the
    write length is exactly the 28-byte file size QEMU expects.

  * drivers/fb.rs: add init_raw(addr,w,h,pitch) — an arch-neutral entry point
    so the existing fb console/compositor (fed from Limine on x86) works
    unchanged on ARM from a raw buffer.

  * main.rs (kernel_main_arch): bring ramfb up after the frame allocator is
    online, publish it to the shared fb, and paint a boot fill.

  * scripts/run-aarch64.sh: add -device ramfb; DISPLAY_MODE=none|cocoa|gtk|sdl
    with a headless monitor socket for screendump capture.

Validated on qemu-system-aarch64: monitor screendump is now 1280x800 (was the
640x480 default), with the painted teal top band (20,184,166), navy body
(26,26,46) and white boot-console glyphs all present — pixels the kernel wrote
are scanned out by ramfb.
Wire the ARM generic-timer IRQ to the SAME shared cooperative-scheduler
hand-off the x86 APIC-timer ISR uses, so EL0 threads (the Flutter engine's
threads, ultimately) are timer-preempted and round-robined — collapsing the
ARM 'enter init directly' path onto x86's timer-driven model.

  * arch/aarch64/apic.rs: production_irq_handler — on a generic-timer PPI taken
    from EL0, apply the same quantum + input/focus preempt policy as x86, map
    the live EL1 vector TrapFrame to the shared UserRegs, call the shared
    process::timer_preempt_switch_try (arch-neutral: PTABLE bookkeeping, CR3/
    TTBR0, fs_base, xstate, current_pid), then write the next thread's frame
    back into the TrapFrame so the vector stub's RESTORE_FRAME+eret enters it.
    init_bsp only installs the handler; start_scheduler_tick arms the timer.

  * Full-fidelity context: the shared UserRegs carries only the x86-named
    register subset, so each thread's complete ARM frame (x0..x30/SP_EL0/ELR/
    SPSR) round-trips through a new per-process arch_trapframe slot
    (process::arch_store/take_trapframe, try_lock — ISR-safe). Scratch
    registers (x6/x7, x11–x18, x24–x28, x30) survive arbitrary-instruction
    preemption.

  * IRQ enablement: IRQs are NOT unmasked at EL1; the eret into EL0 (SPSR=EL0t,
    I/F clear) enables them at the userspace boundary. This closes a sporadic
    boot hang where a timer fired mid enter_user_by_pid_noreturn (during the
    CR3 switch / image build) — boots went from ~50% hanging to 5/5 clean.

  * mm: reserve the kernel image in the frame allocator on ARM. QEMU -kernel
    loads the image (0x40080000) into RAM the device tree reports usable from
    0x40000000; the allocator marked it all free, so the first large contiguous
    alloc (the 64 MiB heap, then the compositor's 4 MiB double-buffer) overran
    live kernel code — a latent corruption that hung the first big alloc once a
    framebuffer made the compositor enable double-buffering. frame_allocator::
    reserve_range carves [RAM base..__kernel_end] before the heap is built.

  * build.rs + main.rs: the ARM /init program is now a compute-bound loop
    (busy-spins with NO syscall between ticks, prints a per-instance tag for a
    bounded budget, then exits). main.rs spawns TWO instances (A=x0:0, B=x0:1);
    since neither yields cooperatively, interleaved A/B output on serial can
    ONLY come from the timer preempting + switching between them.

Validated on qemu-system-aarch64 (5/5 boots): both EL0 threads make progress
with fair, interleaved ticks (e.g. A=12 B=12, 7-8 A<->B switches) then exit
cleanly. The longer compute-budget variant showed thousands of balanced ticks
(A=3341/B=3350, 2148 switches). x86 builds clean (no regression).
Build oscortex-host for aarch64-unknown-none. The syscall stubs (sys.rs)
gain an SVC #0 backend (nr in x8, args x0..x5, return x0) alongside the
existing x86 SYSCALL path; the typed wrappers stay architecture-neutral.

_start gains an AArch64 naked entry: zero x29/x30, 16-byte-align SP via a
scratch GPR, preserve the kernel's bootstrap args (x0=host_mode, x1=app_id,
x2=aot_va) across the breadcrumb write, then call main_embedder(x0,x1,x2).
main_embedder now takes those three as C arguments (dropping the brittle
read-rdi/rsi/rdx asm), and the monotonic clock reads CNTVCT_EL0/CNTFRQ_EL0
on ARM instead of RDTSC. A user_aarch64.ld linker script + an
aarch64-unknown-none .cargo target (static reloc) load it at 0x400000.

Produces a 0x400000-based static AArch64 ELF, mirroring the x86 host.
dl.rs: accept the aarch64 engine .so. Add R_AARCH64_RELATIVE/ABS64/
GLOB_DAT/JUMP_SLOT alongside the x86 relocation set (same RELA encoding,
selected by type number), gate the ELF e_machine check on the build arch,
and arch-gate the x86-only engine byte-patch block (x86 opcodes/offsets
must never run against the ARM engine). One dlopen path now loads either
the x86 or the ARM libflutter_engine.so.

vectors.rs: route EL0 data/instruction aborts (EC 0x24/0x20) through the
shared demand pager. Reads FAR_EL1 for the faulting VA and decodes the ESR
fault-status code into the x86-style present bit so demand_page behaves
identically. The Flutter engine demand-pages its Dart heaps and AOT exec
regions, so without this every such access parked the core.

paging.rs: flush the stale TLB entry on aarch64 in demand_page's
already-mapped fast path (the x86 path uses invlpg).
…as pid1

Deliver the arm64 Flutter stack through the initramfs (no Limine on -M virt):
the arm64 engine .so, the aarch64 oscortex-host embedder (as /init), the arm64
AOT shell snapshot (libapp.so), icudtl.dat and flutter_assets. build.rs keeps a
staged real /init and ships libflutter_engine.so in the initramfs on aarch64.
kernel_main spawns the host as pid 1 with HOST_MODE_SHELL. mm/paging splits the
seeded kernel identity block when overlaying user pages; vectors report FAR_EL1;
diagnostics trace pid1 syscalls.
…gine from initramfs

Two bugs blocked the Flutter host on ARM:

1. ELF loader read 0xFF for the host's .rodata. translate_user_page/walk treated
   the kernel identity-map BLOCK descriptors (seeded into every per-process root)
   as table pointers, dereferencing their output PA as a table — returning bogus
   'already mapped' identity PAs. The loader then reused the identity PA (device
   memory) instead of allocating a real frame, and splitting a block to overlay
   one page left the rest of the block as valid identity sub-pages, poisoning
   every later page too. Fix: walk() returns None at a block descriptor (no 4 KiB
   leaf there); the ELF loader tracks frames IT allocated this load (BTreeMap)
   rather than querying the identity-polluted page table.

2. dlopen of libflutter_engine.so hard-failed when no Limine module was present.
   x86 ships the engine as a Limine boot module; aarch64 has no Limine and ships
   it in the initramfs. Fix: fall back to the VFS lookup when the module is absent.

Host now prints its breadcrumbs and reaches the engine dlopen on ARM.
… paging fires

The Flutter engine's DT_INIT_ARRAY jumps into the POSIX trampoline page to reach
libc symbols; that page was (a) skipped entirely on aarch64 and (b) encoded with
x86 machine code. Add an AArch64 encode_stub (movz x8,#nr; svc #0; ret and the
RetU32/RetAddr/FloatZero/MathSyscallN/SyscallRetAsArg0 shapes), enable
map_system_pages on both arches, fill stub padding with RET, and do I-cache
maintenance (dc cvau / ic ivau / isb) on the freshly written code pages.

Also fix EC_DABT_LOWER: a data abort from a lower EL (EL0) is EC=0x25, not 0x25's
same-EL sibling 0x24. With the wrong constant every EL0 mmap/heap demand fault
fell through to report_unhandled instead of the demand pager. The engine now runs
its init array through dozens of pthread/locale syscalls and demand-faults its
anonymous Dart heap correctly.
… (0x25) through pager

The ARM ARM (D17.2.37) ESR_EL1.EC encodings are:
  0x24 = Data Abort from a lower EL   (EL0 user fault)
  0x25 = Data Abort without EL change (EL1 kernel touching a user VA)

efe6c25 set EC_DABT_LOWER=0x25, which is the same-EL (EL1) variant, so EL0
demand faults from the Flutter engine never matched and the pager never ran.
Fix EC_DABT_LOWER to 0x24 and add EC_DABT_CURR=0x25 routed through the same
demand pager so a syscall handler dereferencing a not-yet-paged user pointer
also resolves.
…ns for engine thread_local

The arm64 Flutter engine compiles its C++ thread_local accesses with the
TLSDESC model: 'ldr x1,[desc#0]; blr x1' where the descriptor's first word is
a resolver and the second its argument. The loader applied ABS64/GLOB_DAT/
JUMP_SLOT/RELATIVE but ignored the 14 R_AARCH64_TLSDESC relocations, so the
descriptor slots stayed zero and the very first thread_local access in
fml::MessageLoop::EnsureInitializedForCurrentThread branched to address 0
(EL0 sync abort, EC=0, ELR=0).

Resolve them statically for the single, fixed-load module:
  - emit a tiny '_dl_tlsdesc_return' stub (ldr x0,[x0,#8]; ret) into the last
    slot of the executable trampoline page (TLSDESC_RESOLVER_VA)
  - point every descriptor's word0 at it and write the variant-I TP-relative
    offset (TLS_TP_OFFSET=16 + module offset) into word1
Also handle TPREL64/DTPREL64/DTPMOD64 for completeness. x86 path untouched.
AArch64 keeps return addresses in the link register, not on the stack, so a
thread that yields inside a syscall (epoll_wait/futex/cond_wait) must resume
with x30 intact. The cooperative re-entry path (build_image) hard-wired x30=0
('zero on first entry'), so after FlutterEngineInitialize the engine's first
thread resumed from an epoll-block and its next  branched to address 0
(EL0 sync abort, EC=0, ELR=0, x30=0).

Capture x30 at the SVC boundary into a per-CPU/per-process slot and restore it
on SYSRET re-entry, carried through the shared enter path in the otherwise-
unused rflags slot (build_image maps rflags->x30 on aarch64; SPSR is constant).
Also print x30/x1/x16/x17/pid in the unhandled-exception report for diagnosis.
x86 path untouched (rflags stays real FLAGS there).
…r, harden FB console

Three blockers past the LR fix:
1. Embedder passed --dart-flags=--old_gen_heap_size=512,... which the engine's
   switches.cc IsAllowedDartVMFlag denylist FML_LOG(FATAL)s. Drop the switch;
   the AOT VM uses its defaults (the heap sizing was a stale x86-JIT workaround).
2. The blocking syscalls rewound the saved user PC by a hardcoded 2 bytes to
   re-execute on resume — correct for x86's 2-byte , but aarch64's
    is 4 bytes, so -2 left the PC mid-instruction → EC=0x22 PC-alignment
   fault. Add process::SYSCALL_INSN_LEN (2 on x86, 4 on arm) + a
   save_return_context_reexec() helper and route all 14 re-exec sites through it.
3. pthread_create gave the newborn an immediate slice by entering the child
   from inside the creator's syscall (never returning, delivering r=0 later).
   On aarch64 this nested enter-while-in-syscall corrupted the creator's resume
   so the 2nd fml::Thread's pthread_create returned non-zero (thread.cc:80).
   Gate that path to x86; on arm the child runs via normal cooperative sched.
4. Harden blit_char against col/row overflow so the serial-mirror text console
   can never panic the kernel (was:  u32 overflow during heavy
   thread spawn). x86 paths unchanged.

Engine now spawns tid=2/3/4 and runs ~1100 serial lines before the FB-console
panic that this also fixes.
…d thread-enter off

Make blit_char/scroll_up/write_byte bail cleanly on a degenerate or corrupted
FB geometry (rows/cols == 0) instead of panicking the kernel with arithmetic
overflow — the serial-mirror console is never worth a panic. With this the
engine spawns its full UI/raster/IO thread set (tid=2/3/4) without the kernel
dying. Document why the x86 immediate-child-enter slice stays disabled on arm
(it corrupts the creator's cooperative resume → thread.cc:80 abort).
… robustness for single-core ISR; re-enable scheduler tick for engine host; preempt diagnostics
…l thread states, scan tids 1-12, raise log caps
squirelboy360 and others added 7 commits June 9, 2026 20:36
… scheduler tick can fire mid-syscall (mirrors x86 sti-on-entry); fixes engine bring-up deadlock where a spinning/cooperatively-yielding syscall masked the timer and froze preemption
…utex_waiter_remove_try) — fixes single-core IRQ-masked self-deadlock where a timer tick during spawn_thread/futex syscall spun forever on a held lock
… — fixes demand-pager self-deadlock when a fault re-enters map_user_page while the page-table lock is held (exposed by aarch64 IRQs-on-during-syscall)
…s (the aarch64-only resolver trampoline; arm unchanged) — keeps x86 green
…e engine bring-up past the early ret-to-0 / pager deadlock

Two single-core register/lock-coherence fixes that move the ARM Flutter
engine from crashing immediately after FlutterEngineInitialize to running
through FlutterEngineRunInitialized into ICU init and worker-thread spawn.

1) Eager user-GPR capture at the SVC boundary (process::mod, vectors).
   `save_full_user_gprs` read the per-CPU user-GPR snapshot LAZILY at yield
   time. Syscalls run with IRQs unmasked, so a generic-timer tick can preempt
   a thread mid-handler, switch to a sibling whose own SVC overwrites the
   shared snapshot, then switch back — and the yielding thread then persists
   the sibling's callee-saved regs / x30 into its own context (resume → `ret`
   to a stale address, e.g. x30=0 → branch to 0). Port the proven x86 fix
   (GPRS_CAPTURED flag + capture_user_gprs_at_entry): snapshot ONCE, eagerly,
   while fresh, and make later yield-time saves no-ops. On aarch64 the eager
   capture runs from inside the IRQ-masked SVC window in the vector dispatch
   (x86 stays masked until the handler sti's), so the snapshot can't be
   clobbered before it is persisted.

2) IRQ-masked outer page-table critical section (mm::paging).
   The reentrant PAGE_TABLE_LOCK depth counter is only sound if the section
   can't be interleaved by another thread. With IRQs unmasked during syscalls,
   a timer tick could preempt a lock holder mid-section; a sibling then saw a
   non-zero depth and proceeded as a bogus "nested" writer, desynchronising the
   counter from the real lock and eventually stranding the lock held while
   depth read 0 — a later outer acquire then spun forever with IRQs masked in
   the demand-abort handler (single-core deadlock, observed freezing right at
   worker-thread stack setup). Mask IRQs for the whole outer section so it is
   genuinely uninterruptible; nesting then only ever means true same-stack
   re-entry (a demand fault during a page-table walk, already IRQs-masked).
   x86 behaviour is unchanged (cfg-gated to aarch64).

Also: the unhandled-exception reporter now scans the EL0 stack for engine
return addresses (the FP chain is empty when x30/FP are 0), so a ret-to-0 can
be symbolised against libflutter_engine.so.

Status: engine now reaches ICU init / worker spawn; a residual cooperative-
yield corruption remains (nondeterministic ret-to-0 vs abort) — same class the
x86 port hardened over several commits. Not yet rendering on ARM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…shell

# Conflicts:
#	kernel/src/process/mod.rs
#	kernel/src/syscall/dispatch.rs
@squirelboy360 squirelboy360 merged commit aea673f into develop Jun 9, 2026
3 checks passed
squirelboy360 added a commit that referenced this pull request Jun 9, 2026
… init + worker spawn) (#9)

* arch/aarch64: scaffold a full backend that satisfies the shared arch surface

Replace the 35-line aarch64 stub with a module layout that mirrors the
x86_64 backend so the rest of the kernel's crate::arch::* surface resolves
on aarch64. Every function/type/const the shared kernel imports through the
arch facade now exists here with the matching signature; bodies are
compilable scaffolds (no-ops / sane defaults) pending the real ARM port.

Modules added under arch/aarch64/:
  cpu        FP/SIMD enable, TPIDR_EL0 TLS (set/get_fs_base), DAIF
             interrupt-mask save/restore, DMB/yield fences, xstate
             save/restore placeholders, hypervisor detection.
  memory     read/write_cr3 mapped onto TTBR0_EL1 (+ TLB maintenance).
  gdt        nominal USER_CS/USER_DS selector constants (no segmentation).
  idt        VBAR_EL1 exception-vector install entry points; re-exports
             the cross-arch InterruptFrame.
  apic       GIC + generic-timer scaffold: eoi, init_bsp/ap,
             finish_xapic_init, local_apic_id (from MPIDR_EL1),
             send_resched_ipi, and the vsync cadence accessors.
  syscall    SVC fast-path scaffold: per-CPU user-GPR scratch with the
             UserGprSnapshot type and user_rsp/rip/r9/rbp/user_gprs
             accessors, set_active_stack_top, init/init_ap.
  smp        per-CPU table + this_cpu()/current_cpu_id() via MPIDR_EL1,
             CPU_COUNT, broadcast_resched_ipi (single-core scaffold;
             PSCI CPU_ON wake left as a TODO).
  acpi       RSDP/MADT lookup + PSCI SYSTEM_OFF shutdown placeholders.
  interrupts disable_pic no-op (no legacy PIC on aarch64).
  mod        early_init/ap_init/smp_init/halt/halt_forever/enable/
             disable_interrupts/rdtsc (CNTVCT_EL0) and AAPCS64
             context_switch + task_entry naked trampolines.

This is the scaffolding step of the ARM port: it does NOT boot. Real MMU
page tables, the EL1 exception vectors, GIC/timer programming, the SVC
entry path, and PSCI SMP bring-up are stubbed for follow-up work.

* arch: route x86-specific code in shared kernel through the arch facade

The shared kernel embedded raw x86 instructions and x86-only limine flags
directly in architecture-neutral files, which broke the aarch64-unknown-none
build. Move every such site behind crate::arch::* so both x86_64 and aarch64
compile, with x86_64 codegen left byte-for-byte identical.

User-mode entry: extract the IRETQ/SYSRETQ ring-3 transition asm out of
process::enter_user_by_pid_noreturn{,_try} into a new arch::enter_user hook
(EnterUserRegs + enter_user_iret/enter_user_sysret). The x86_64 backend keeps
the exact prior asm (verified: identical instruction/operand sequences); the
aarch64 backend stubs them pending the real EL1->EL0 ERET path. The shared
process layer keeps all PTABLE/CR3/errno logic and only calls the hook for the
final transfer.

Other shared sites rerouted through the facade:
- main.rs: gate limine MP_FLAG_X2APIC behind cfg(x86_64), 0 otherwise
- dispatch/poll/posix: _rdtsc()/inline rdtsc -> arch::rdtsc()
- posix/poll/fd/futex/engine/ipc_display: sti;hlt;cli -> arch::enable_and_halt()
- futex/posix: pause -> arch::spin_pause()
- process: cli -> arch::interrupts_disable(); mov cr3 -> arch::memory::write_cr3()
- paging: read cr3 -> arch::memory::read_cr3()
- panic: read rbp -> arch::read_frame_pointer()
- fd: poweroff loop -> arch::acpi_shutdown() (already arch-neutral)

New arch hooks: enter_user_{iret,sysret}+EnterUserRegs, interrupts_disable,
enable_and_halt, read_frame_pointer (x86_64 real, aarch64 mirror/stub).

Both targets build clean:
- aarch64-unknown-none --features arch-aarch64: 0 errors
- x86_64-unknown-none release: 0 errors

* aarch64: boot to EL1 with PL011 serial (milestone 1)

Direct QEMU -kernel boot path for the ARM port. The assembly _start
(boot.rs) parks secondary CPUs, drops EL2->EL1 if needed, enables
FP/AdvSIMD (CPACR_EL1.FPEN), sets up the boot stack, zeroes BSS, and
calls into the Rust bring-up sequence. A minimal polled PL011 UART
driver (uart.rs) at 0x09000000 is the serial debug lifeline.

The bring-up (bringup.rs) prints CurrentEL, SCTLR_EL1, MPIDR_EL1 and
CNTFRQ_EL0 over serial, confirming we land at EL1 with the MMU off and
a 62.5MHz generic timer. New aarch64.ld links the kernel for QEMU
-M virt RAM at 0x40080000 with _start as the ELF entry.

Boots cleanly under qemu-system-aarch64 -M virt -cpu cortex-a72.

* aarch64: enable the MMU with identity-mapped translation tables (milestone 2)

ARMv8-A MMU bring-up (mmu.rs): builds 512-entry L1 translation tables
with 1 GiB block descriptors for both TTBR0 (low/user) and TTBR1
(kernel high half), programs MAIR_EL1 (Normal WB + Device-nGnRE),
TCR_EL1 and enables SCTLR_EL1.M/C/I.

Two bring-up subtleties resolved on real QEMU:
- T0SZ/T1SZ = 25 (39-bit VA) so the 4 KiB-granule walk starts at L1,
  letting a single L1 table cover the whole space with 1 GiB blocks
  (T0SZ=16/48-bit would have required an L0 table -> level-0 fault).
- Kernel identity map uses AP=EL1-only; an EL0-writable kernel-code
  mapping implicitly forces PXN, which tripped a level-1 permission
  fault on the first translated instruction fetch.

Verified on qemu-system-aarch64: SCTLR_EL1=0xc5183d (M/C/I on),
a RAM read/write probe round-trips, and execution continues
translated past the enable.

* aarch64: EL1 exception vectors + trap frame round-trip (milestone 3)

Install a 16-entry, 2KB-aligned VBAR_EL1 vector table (vectors.rs)
covering all four groups (Current-EL SP0/SPx, Lower-EL AArch64/32) x
four kinds (Sync/IRQ/FIQ/SError). Each entry saves the full integer
register file (x0-x30) plus SP_EL0, ELR_EL1, SPSR_EL1 and ESR_EL1 into
a TrapFrame, calls a Rust dispatcher tagged with the exception kind,
then restores everything and erets.

The dispatcher decodes ESR_EL1.EC and routes SVC64 to an installable
handler, IRQs to an installable IRQ handler, and reports+parks on any
unhandled exception (printing ESR/ELR/SPSR/FAR over serial).

Verified on QEMU: a deliberate  from EL1 round-trips through
save/dispatch/restore/eret (SYNC_EL1 0->1, handler hit once) and a
callee-saved sentinel register survives the trap intact.

* aarch64: GICv2 + generic-timer periodic tick (milestones 4+5)

GICv2 driver (gic.rs) for QEMU -M virt: enables the distributor and
this core's CPU interface, with per-IRQ enable/priority/routing plus
IAR acknowledge and EOIR end-of-interrupt.

Generic-timer driver (timer.rs) programs the EL1 physical timer
(CNTP_CTL/CNTP_TVAL_EL0) for a periodic tick, routed via PPI 30 and
re-armed each interrupt. CNTFRQ_EL0 (62.5 MHz on virt) gives the rate.

The bring-up installs an IRQ handler that acknowledges at the GIC,
services the timer PPI, and EOIs. With IRQs unmasked the timer ticks:
verified 5 ticks / 5 serviced IRQs in ~61ms at the requested 100 Hz.
This is the same scheduler-tick source the x86 APIC timer drives.

* aarch64: SVC syscall entry + EL0 user process servicing (milestones 6+7)

The headline ARM milestone: the kernel drops to EL0, runs a userspace
program, and services its svc #0 syscalls end to end.

enter_user.rs: real EL1->EL0 transition. Builds the full x0..x30 +
SP_EL0 + ELR_EL1 + SPSR_EL1 image from the shared (x86-named)
EnterUserRegs, mapping rdi/rsi/.. onto x0/x1/.. per the Linux aarch64
ABI, then erets into EL0. Both the IRET (timer-preempt) and SYSRET
(syscall-yield) hooks the shared process layer calls route here.
enter_el0_at() is the bring-up's direct launch path.

mmu.rs: map_user_page() installs an L1->L2->L3 walk for an EL0-RW,
EL0-executable (PXN=1/UXN=0) 4 KiB page at a free VA, backing it with
a kernel-writable physical page so the kernel can stage user code.

bringup_user.rs: assembles a tiny EL0 program (MOVZ/MOVK + svc #0
sequences) that issues write(64)/getpid(172)/exit(93) syscalls. The
SVC handler reads x8 (nr) + x0..x2 (args) from the saved TrapFrame,
services the call, writes the return into x0, and erets back to EL0.

psci.rs: PSCI SYSTEM_OFF / CPU_ON helpers (conduit-gated on EL>=2).

Verified on qemu-system-aarch64 -M virt -cpu cortex-a72: the EL0
program prints over serial via syscalls, getpid returns a value, and
exit(7) is observed (writes=3, exit_code=7). Full chain boots:
EL1 -> MMU -> vectors -> GIC+timer -> EL0 -> syscalls -> exit.

* aarch64: wire SVC capture into the shared per-CPU user-GPR snapshot

The aarch64 SVC handler now calls syscall::capture_from_trap(), which
stashes the trapping EL0 thread's registers into the per-CPU scratch
that backs the architecture-neutral accessors arch::syscall::user_rip
/ user_rsp / user_gprs the shared dispatch_fast path reads. AArch64
registers map onto the x86-named slots per the Linux aarch64 ABI
(x0..x5 -> arg regs, SP_EL0 -> user_rsp, ELR_EL1 -> user_rip,
x19..x23/x29 -> callee-saved slots).

Verified on QEMU: after the first EL0 svc, the shared accessors report
user_rip=0x400000018 (past the svc) and user_rsp=0x400000f00 (the EL0
stack) — the exact contract dispatch_fast consumes. x86_64 build
unaffected.

* aarch64: wake a secondary core via PSCI CPU_ON (milestone 8)

Best-effort SMP bring-up (bringup_smp.rs): issues PSCI CPU_ON over the
QEMU -M virt HVC conduit to start CPU 1 at a physical secondary entry
stub (__secondary_entry) which enables FP, sets up its own stack, and
calls into Rust to record its MPIDR and report online over serial.

psci.rs cpu_on now uses HVC (the active virt conduit). The EL1 sync
vector treats an Undefined-instruction trap (EC=0, e.g. a PSCI probe
where no conduit is active) as a graceful no-op: it sets x0 to PSCI
NOT_SUPPORTED and steps ELR past the faulting instruction instead of
parking, so SMP probing never wedges the single-core path.

Verified on QEMU -smp 2: PSCI CPU_ON returns 0, CPU 1 comes online and
reports MPIDR=1; with -smp 1 the probe returns gracefully and the
single-core EL0 path still completes. Full per-CPU scheduling on the
secondary is a follow-up; the BSP already runs userspace.

x86_64 build unaffected.

* aarch64: add run-aarch64.sh to build + boot the ARM bring-up under QEMU

Convenience wrapper that builds the aarch64 kernel and boots it with
qemu-system-aarch64 -M virt -cpu cortex-a72 (single core, or smp2 to
exercise the PSCI CPU_ON path), wiring PL011 serial to the terminal.

* aarch64: route -kernel boot into the shared production kernel_main

The ARM port previously ran only the self-contained bring-up demo. This wires
the proven arch primitives into the SAME kernel_main the x86 path uses, so the
real subsystem init runs on ARM.

  * FDT/DTB reader (arch/aarch64/fdt.rs): a small hand-rolled flattened
    device-tree parser (no external crate) that finds the /memory node(s) for
    RAM discovery. _start now preserves the x0 DTB pointer (saved in x20) and
    passes it through aarch64_start.

  * Neutral boot memory map (mm::BootMemMap): mm::init no longer reads Limine
    directly. Both arches translate their source — x86 from the Limine memmap,
    aarch64 from the device tree — into a fixed-capacity region list consumed by
    the shared mm::init_from_regions. x86 behaviour is byte-equivalent.

  * Production ARM boot path (arch/aarch64/boot_prod.rs): brings up PL011 serial,
    the MMU (identity map), EL1 exception vectors, and the GIC, parses the DTB,
    then calls the shared kernel_main_arch.

  * kernel_main split: the Limine-coupled prologue stays in the x86-only
    kernel_main; the arch-neutral subsystem init + init-process spawn + idle loop
    move to shared_init_and_run, called by both arches. Limine request statics
    are now x86-only so the ARM kernel carries no dead boot section.

  * PL011 wired into the shared logger so early_print + the log framework reach
    serial on aarch64.

Verified: production kernel_main runs end-to-end on qemu-system-aarch64 -M virt
through MMU, frame allocator (2048 MiB discovered), heap, scheduler, security,
IPC, WM, VFS, drivers, and the full Cortex AI runtime, reaching the init-process
spawn step. x86_64 kernel still builds (release ELF unchanged in behaviour).

* aarch64: spawn the init process at EL0 and service its syscalls

Completes the ARM port's headline goal: the PRODUCTION kernel_main now spawns a
userspace process and services its first syscalls through the SHARED dispatcher.

  * aarch64 user page-table walker (mm/paging.rs): a real 4 KiB-granule L1→L3
    TTBR0 walker replaces the non-x86 stub — alloc_user_pml4, map_page_in,
    translate/update/unmap, free. Each per-process root is seeded with a copy of
    the kernel's identity-map L1 block descriptors so the kernel stays mapped
    after write_cr3 loads the process root (the bring-up kernel runs from the
    TTBR0 low half). free_user_pml4 skips those shared block descriptors so it
    never frees kernel RAM.

  * SVC → shared dispatch (arch/aarch64/syscall.rs): the production SVC handler
    captures the EL0 trap frame, calls syscall::dispatch_fast(x8, x0..x4), writes
    the result back to x0, and on exit/exit_group hands off to the next runnable
    process (or parks). Installed in arch::early_init.

  * ELF loader (process/elf.rs): accept EM_AARCH64 on ARM via an arch-selected
    EM_NATIVE machine check.

  * Process spawn (process/mod.rs): arch-conditional USER_STACK_TOP placed inside
    the 39-bit aarch64 VA window; the x86 glibc POSIX trampoline mapping is
    skipped on ARM (bare EL0 program, no libc).

  * Scheduler (sched/mod.rs): spawn_kernel_task builds an AAPCS64 first-run frame
    on aarch64 (x19..x30 + fn-ptr slot matching context_switch/task_entry),
    instead of the x86 register layout.

  * Native /init (build.rs): for aarch64 builds, embed a tiny EL0 ELF (linked at
    64 GiB to clear the kernel identity blocks) that does write + getpid + write +
    exit using the OSCortex syscall numbers.

  * main.rs: on aarch64 the init is entered directly via enter_user_by_pid_noreturn
    (the real write_cr3 → ERET path) since the ARM preemptive timer ISR is a
    follow-on; x86 keeps the schedule_user_launch hand-off.

Verified on qemu-system-aarch64 -M virt: init spawns (pid=1, EL0 entry), prints
two lines via write(1,...), getpid returns, exit(0) reaps cleanly and the kernel
parks — all through the shared syscall layer, no faults. x86_64 debug + release
still build green; no /init is injected on x86.

* aarch64: robust DTB discovery — x0, RAM probe, then arch default

QEMU's -M virt passes the DTB pointer in x0 for a flat/Image boot, but for an
ELF -kernel it may leave x0 = 0 and place the blob at an image-dependent address.
boot_prod now: (1) uses x0 when it points at a valid FDT (the spec mechanism,
also correct on real hardware), (2) probes the RAM base + a bounded low window
for the FDT magic, (3) falls back to the QEMU virt default (2 GiB @ 0x40000000,
matching -m) when neither yields a /memory node. Adds fdt::scan_for_dtb for the
magic-scan path. The parser itself is unchanged and validated against the
machine's real device tree.

* aarch64: document the production boot path in run-aarch64.sh

The ARM `-kernel` boot now routes into the shared kernel_main and spawns
userspace; update the script header to describe that instead of only the
self-contained bring-up demo.

* mm: widen BootMemMap capacity to 128 regions

Ensures a fragmented x86 Limine memory map never drops usable RAM when
translating into the architecture-neutral region list.

* aarch64: ramfb framebuffer via fw_cfg — display-capable on -M virt

QEMU -M virt ships no Limine framebuffer (Limine is the x86 boot protocol),
so the ARM port had no display. Add a fw_cfg DMA reader/writer and use it to
configure QEMU's -device ramfb:

  * kernel/src/arch/aarch64/ramfb.rs: walk the fw_cfg file directory to find
    the etc/ramfb selector, allocate a 1280x800 XRGB8888 framebuffer in
    identity-mapped RAM, and DMA-write the big-endian RAMFBCfg into it.
    Two gotchas handled: SELECT and WRITE must be separate DMAs (a combined
    control word is rejected), and RamfbCfg must be repr(C, packed) so the
    write length is exactly the 28-byte file size QEMU expects.

  * drivers/fb.rs: add init_raw(addr,w,h,pitch) — an arch-neutral entry point
    so the existing fb console/compositor (fed from Limine on x86) works
    unchanged on ARM from a raw buffer.

  * main.rs (kernel_main_arch): bring ramfb up after the frame allocator is
    online, publish it to the shared fb, and paint a boot fill.

  * scripts/run-aarch64.sh: add -device ramfb; DISPLAY_MODE=none|cocoa|gtk|sdl
    with a headless monitor socket for screendump capture.

Validated on qemu-system-aarch64: monitor screendump is now 1280x800 (was the
640x480 default), with the painted teal top band (20,184,166), navy body
(26,26,46) and white boot-console glyphs all present — pixels the kernel wrote
are scanned out by ramfb.

* aarch64: timer-driven preemption — shared cooperative scheduler hand-off

Wire the ARM generic-timer IRQ to the SAME shared cooperative-scheduler
hand-off the x86 APIC-timer ISR uses, so EL0 threads (the Flutter engine's
threads, ultimately) are timer-preempted and round-robined — collapsing the
ARM 'enter init directly' path onto x86's timer-driven model.

  * arch/aarch64/apic.rs: production_irq_handler — on a generic-timer PPI taken
    from EL0, apply the same quantum + input/focus preempt policy as x86, map
    the live EL1 vector TrapFrame to the shared UserRegs, call the shared
    process::timer_preempt_switch_try (arch-neutral: PTABLE bookkeeping, CR3/
    TTBR0, fs_base, xstate, current_pid), then write the next thread's frame
    back into the TrapFrame so the vector stub's RESTORE_FRAME+eret enters it.
    init_bsp only installs the handler; start_scheduler_tick arms the timer.

  * Full-fidelity context: the shared UserRegs carries only the x86-named
    register subset, so each thread's complete ARM frame (x0..x30/SP_EL0/ELR/
    SPSR) round-trips through a new per-process arch_trapframe slot
    (process::arch_store/take_trapframe, try_lock — ISR-safe). Scratch
    registers (x6/x7, x11–x18, x24–x28, x30) survive arbitrary-instruction
    preemption.

  * IRQ enablement: IRQs are NOT unmasked at EL1; the eret into EL0 (SPSR=EL0t,
    I/F clear) enables them at the userspace boundary. This closes a sporadic
    boot hang where a timer fired mid enter_user_by_pid_noreturn (during the
    CR3 switch / image build) — boots went from ~50% hanging to 5/5 clean.

  * mm: reserve the kernel image in the frame allocator on ARM. QEMU -kernel
    loads the image (0x40080000) into RAM the device tree reports usable from
    0x40000000; the allocator marked it all free, so the first large contiguous
    alloc (the 64 MiB heap, then the compositor's 4 MiB double-buffer) overran
    live kernel code — a latent corruption that hung the first big alloc once a
    framebuffer made the compositor enable double-buffering. frame_allocator::
    reserve_range carves [RAM base..__kernel_end] before the heap is built.

  * build.rs + main.rs: the ARM /init program is now a compute-bound loop
    (busy-spins with NO syscall between ticks, prints a per-instance tag for a
    bounded budget, then exits). main.rs spawns TWO instances (A=x0:0, B=x0:1);
    since neither yields cooperatively, interleaved A/B output on serial can
    ONLY come from the timer preempting + switching between them.

Validated on qemu-system-aarch64 (5/5 boots): both EL0 threads make progress
with fair, interleaved ticks (e.g. A=12 B=12, 7-8 A<->B switches) then exit
cleanly. The longer compute-budget variant showed thousands of balanced ticks
(A=3341/B=3350, 2148 switches). x86 builds clean (no regression).

* aarch64: port the Flutter embedder host to the ARM syscall ABI

Build oscortex-host for aarch64-unknown-none. The syscall stubs (sys.rs)
gain an SVC #0 backend (nr in x8, args x0..x5, return x0) alongside the
existing x86 SYSCALL path; the typed wrappers stay architecture-neutral.

_start gains an AArch64 naked entry: zero x29/x30, 16-byte-align SP via a
scratch GPR, preserve the kernel's bootstrap args (x0=host_mode, x1=app_id,
x2=aot_va) across the breadcrumb write, then call main_embedder(x0,x1,x2).
main_embedder now takes those three as C arguments (dropping the brittle
read-rdi/rsi/rdx asm), and the monotonic clock reads CNTVCT_EL0/CNTFRQ_EL0
on ARM instead of RDTSC. A user_aarch64.ld linker script + an
aarch64-unknown-none .cargo target (static reloc) load it at 0x400000.

Produces a 0x400000-based static AArch64 ELF, mirroring the x86 host.

* aarch64: dynamic-loader relocations + EL0 demand-paging for the engine

dl.rs: accept the aarch64 engine .so. Add R_AARCH64_RELATIVE/ABS64/
GLOB_DAT/JUMP_SLOT alongside the x86 relocation set (same RELA encoding,
selected by type number), gate the ELF e_machine check on the build arch,
and arch-gate the x86-only engine byte-patch block (x86 opcodes/offsets
must never run against the ARM engine). One dlopen path now loads either
the x86 or the ARM libflutter_engine.so.

vectors.rs: route EL0 data/instruction aborts (EC 0x24/0x20) through the
shared demand pager. Reads FAR_EL1 for the faulting VA and decodes the ESR
fault-status code into the x86-style present bit so demand_page behaves
identically. The Flutter engine demand-pages its Dart heaps and AOT exec
regions, so without this every such access parked the core.

paging.rs: flush the stale TLB entry on aarch64 in demand_page's
already-mapped fast path (the x86 path uses invlpg).

* aarch64 shell: stage engine+host+snapshot into initramfs, spawn host as pid1

Deliver the arm64 Flutter stack through the initramfs (no Limine on -M virt):
the arm64 engine .so, the aarch64 oscortex-host embedder (as /init), the arm64
AOT shell snapshot (libapp.so), icudtl.dat and flutter_assets. build.rs keeps a
staged real /init and ships libflutter_engine.so in the initramfs on aarch64.
kernel_main spawns the host as pid 1 with HOST_MODE_SHELL. mm/paging splits the
seeded kernel identity block when overlaying user pages; vectors report FAR_EL1;
diagnostics trace pid1 syscalls.

* aarch64: fix user-page mapping over seeded identity blocks + serve engine from initramfs

Two bugs blocked the Flutter host on ARM:

1. ELF loader read 0xFF for the host's .rodata. translate_user_page/walk treated
   the kernel identity-map BLOCK descriptors (seeded into every per-process root)
   as table pointers, dereferencing their output PA as a table — returning bogus
   'already mapped' identity PAs. The loader then reused the identity PA (device
   memory) instead of allocating a real frame, and splitting a block to overlay
   one page left the rest of the block as valid identity sub-pages, poisoning
   every later page too. Fix: walk() returns None at a block descriptor (no 4 KiB
   leaf there); the ELF loader tracks frames IT allocated this load (BTreeMap)
   rather than querying the identity-polluted page table.

2. dlopen of libflutter_engine.so hard-failed when no Limine module was present.
   x86 ships the engine as a Limine boot module; aarch64 has no Limine and ships
   it in the initramfs. Fix: fall back to the VFS lookup when the module is absent.

Host now prints its breadcrumbs and reaches the engine dlopen on ARM.

* aarch64: native syscall trampolines + fix EL0 data-abort EC so demand paging fires

The Flutter engine's DT_INIT_ARRAY jumps into the POSIX trampoline page to reach
libc symbols; that page was (a) skipped entirely on aarch64 and (b) encoded with
x86 machine code. Add an AArch64 encode_stub (movz x8,#nr; svc #0; ret and the
RetU32/RetAddr/FloatZero/MathSyscallN/SyscallRetAsArg0 shapes), enable
map_system_pages on both arches, fill stub padding with RET, and do I-cache
maintenance (dc cvau / ic ivau / isb) on the freshly written code pages.

Also fix EC_DABT_LOWER: a data abort from a lower EL (EL0) is EC=0x25, not 0x25's
same-EL sibling 0x24. With the wrong constant every EL0 mmap/heap demand fault
fell through to report_unhandled instead of the demand pager. The engine now runs
its init array through dozens of pthread/locale syscalls and demand-faults its
anonymous Dart heap correctly.

* aarch64: correct EL0 data-abort EC to 0x24 + route EL1 user-VA aborts (0x25) through pager

The ARM ARM (D17.2.37) ESR_EL1.EC encodings are:
  0x24 = Data Abort from a lower EL   (EL0 user fault)
  0x25 = Data Abort without EL change (EL1 kernel touching a user VA)

efe6c25 set EC_DABT_LOWER=0x25, which is the same-EL (EL1) variant, so EL0
demand faults from the Flutter engine never matched and the pager never ran.
Fix EC_DABT_LOWER to 0x24 and add EC_DABT_CURR=0x25 routed through the same
demand pager so a syscall handler dereferencing a not-yet-paged user pointer
also resolves.

* aarch64: implement R_AARCH64_TLSDESC (+TPREL/DTPREL/DTPMOD) relocations for engine thread_local

The arm64 Flutter engine compiles its C++ thread_local accesses with the
TLSDESC model: 'ldr x1,[desc#0]; blr x1' where the descriptor's first word is
a resolver and the second its argument. The loader applied ABS64/GLOB_DAT/
JUMP_SLOT/RELATIVE but ignored the 14 R_AARCH64_TLSDESC relocations, so the
descriptor slots stayed zero and the very first thread_local access in
fml::MessageLoop::EnsureInitializedForCurrentThread branched to address 0
(EL0 sync abort, EC=0, ELR=0).

Resolve them statically for the single, fixed-load module:
  - emit a tiny '_dl_tlsdesc_return' stub (ldr x0,[x0,#8]; ret) into the last
    slot of the executable trampoline page (TLSDESC_RESOLVER_VA)
  - point every descriptor's word0 at it and write the variant-I TP-relative
    offset (TLS_TP_OFFSET=16 + module offset) into word1
Also handle TPREL64/DTPREL64/DTPMOD64 for completeness. x86 path untouched.

* aarch64: preserve user x30 (LR) across cooperative syscall yields

AArch64 keeps return addresses in the link register, not on the stack, so a
thread that yields inside a syscall (epoll_wait/futex/cond_wait) must resume
with x30 intact. The cooperative re-entry path (build_image) hard-wired x30=0
('zero on first entry'), so after FlutterEngineInitialize the engine's first
thread resumed from an epoll-block and its next  branched to address 0
(EL0 sync abort, EC=0, ELR=0, x30=0).

Capture x30 at the SVC boundary into a per-CPU/per-process slot and restore it
on SYSRET re-entry, carried through the shared enter path in the otherwise-
unused rflags slot (build_image maps rflags->x30 on aarch64; SPSR is constant).
Also print x30/x1/x16/x17/pid in the unhandled-exception report for diagnosis.
x86 path untouched (rflags stays real FLAGS there).

* aarch64: arch-correct syscall re-exec rewind, drop nested thread-enter, harden FB console

Three blockers past the LR fix:
1. Embedder passed --dart-flags=--old_gen_heap_size=512,... which the engine's
   switches.cc IsAllowedDartVMFlag denylist FML_LOG(FATAL)s. Drop the switch;
   the AOT VM uses its defaults (the heap sizing was a stale x86-JIT workaround).
2. The blocking syscalls rewound the saved user PC by a hardcoded 2 bytes to
   re-execute on resume — correct for x86's 2-byte , but aarch64's
    is 4 bytes, so -2 left the PC mid-instruction → EC=0x22 PC-alignment
   fault. Add process::SYSCALL_INSN_LEN (2 on x86, 4 on arm) + a
   save_return_context_reexec() helper and route all 14 re-exec sites through it.
3. pthread_create gave the newborn an immediate slice by entering the child
   from inside the creator's syscall (never returning, delivering r=0 later).
   On aarch64 this nested enter-while-in-syscall corrupted the creator's resume
   so the 2nd fml::Thread's pthread_create returned non-zero (thread.cc:80).
   Gate that path to x86; on arm the child runs via normal cooperative sched.
4. Harden blit_char against col/row overflow so the serial-mirror text console
   can never panic the kernel (was:  u32 overflow during heavy
   thread spawn). x86 paths unchanged.

Engine now spawns tid=2/3/4 and runs ~1100 serial lines before the FB-console
panic that this also fixes.

* aarch64: harden FB text console against geometry overflow; keep nested thread-enter off

Make blit_char/scroll_up/write_byte bail cleanly on a degenerate or corrupted
FB geometry (rows/cols == 0) instead of panicking the kernel with arithmetic
overflow — the serial-mirror console is never worth a panic. With this the
engine spawns its full UI/raster/IO thread set (tid=2/3/4) without the kernel
dying. Document why the x86 immediate-child-enter slice stays disabled on arm
(it corrupts the creator's cooperative resume → thread.cc:80 abort).

* wip(aarch64): timer-ISR timerfd wakes + pending-wake delivery; cpu-id robustness for single-core ISR; re-enable scheduler tick for engine host; preempt diagnostics

* wip(aarch64): walk EL0 FP chain in unhandled-exception report to symbolise null-jump crash

* wip(aarch64): expand scheduler diagnostics — heartbeat ticks with full thread states, scan tids 1-12, raise log caps

* wip(aarch64): dump raw PTABLE slot pid/state for idx 1-12 to find the vanished worker thread

* aarch64: unmask IRQs during SVC syscall handling so the generic-timer scheduler tick can fire mid-syscall (mirrors x86 sti-on-entry); fixes engine bring-up deadlock where a spinning/cooperatively-yielding syscall masked the timer and froze preemption

* aarch64: make timer-ISR cond-expiry use try-lock for FUTEX_WAITERS (futex_waiter_remove_try) — fixes single-core IRQ-masked self-deadlock where a timer tick during spawn_thread/futex syscall spun forever on a held lock

* mm: make PAGE_TABLE_LOCK reentrant on a single core (lock_page_table) — fixes demand-pager self-deadlock when a fault re-enters map_user_page while the page-table lock is held (exposed by aarch64 IRQs-on-during-syscall)

* dl: cfg-gate TLSDESC_RESOLVER_VA reference so the x86_64 kernel builds (the aarch64-only resolver trampoline; arm unchanged) — keeps x86 green

* aarch64: eager user-GPR capture + IRQ-masked page-table lock — advance engine bring-up past the early ret-to-0 / pager deadlock

Two single-core register/lock-coherence fixes that move the ARM Flutter
engine from crashing immediately after FlutterEngineInitialize to running
through FlutterEngineRunInitialized into ICU init and worker-thread spawn.

1) Eager user-GPR capture at the SVC boundary (process::mod, vectors).
   `save_full_user_gprs` read the per-CPU user-GPR snapshot LAZILY at yield
   time. Syscalls run with IRQs unmasked, so a generic-timer tick can preempt
   a thread mid-handler, switch to a sibling whose own SVC overwrites the
   shared snapshot, then switch back — and the yielding thread then persists
   the sibling's callee-saved regs / x30 into its own context (resume → `ret`
   to a stale address, e.g. x30=0 → branch to 0). Port the proven x86 fix
   (GPRS_CAPTURED flag + capture_user_gprs_at_entry): snapshot ONCE, eagerly,
   while fresh, and make later yield-time saves no-ops. On aarch64 the eager
   capture runs from inside the IRQ-masked SVC window in the vector dispatch
   (x86 stays masked until the handler sti's), so the snapshot can't be
   clobbered before it is persisted.

2) IRQ-masked outer page-table critical section (mm::paging).
   The reentrant PAGE_TABLE_LOCK depth counter is only sound if the section
   can't be interleaved by another thread. With IRQs unmasked during syscalls,
   a timer tick could preempt a lock holder mid-section; a sibling then saw a
   non-zero depth and proceeded as a bogus "nested" writer, desynchronising the
   counter from the real lock and eventually stranding the lock held while
   depth read 0 — a later outer acquire then spun forever with IRQs masked in
   the demand-abort handler (single-core deadlock, observed freezing right at
   worker-thread stack setup). Mask IRQs for the whole outer section so it is
   genuinely uninterruptible; nesting then only ever means true same-stack
   re-entry (a demand fault during a page-table walk, already IRQs-masked).
   x86 behaviour is unchanged (cfg-gated to aarch64).

Also: the unhandled-exception reporter now scans the EL0 stack for engine
return addresses (the FP chain is empty when x30/FP are 0), so a ret-to-0 can
be symbolised against libflutter_engine.so.

Status: engine now reaches ICU init / worker spawn; a residual cooperative-
yield corruption remains (nondeterministic ret-to-0 vs abort) — same class the
x86 port hardened over several commits. Not yet rendering on ARM.


---------

Co-authored-by: Tahiru Agbanwa <tahiru@users.noreply.github.com>
Co-authored-by: Tahiru Agbanwa <tahiru@oscortex.dev>
Co-authored-by: Tahiru Agbanwa <tahiru@dotcorr.com>
@squirelboy360 squirelboy360 deleted the feat/arch-aarch64-shell branch June 9, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants