aarch64: Flutter shell bring-up — engine runs AOT on ARM (reaches ICU init + worker spawn)#9
Merged
Merged
Conversation
…surface
Replace the 35-line aarch64 stub with a module layout that mirrors the
x86_64 backend so the rest of the kernel's crate::arch::* surface resolves
on aarch64. Every function/type/const the shared kernel imports through the
arch facade now exists here with the matching signature; bodies are
compilable scaffolds (no-ops / sane defaults) pending the real ARM port.
Modules added under arch/aarch64/:
cpu FP/SIMD enable, TPIDR_EL0 TLS (set/get_fs_base), DAIF
interrupt-mask save/restore, DMB/yield fences, xstate
save/restore placeholders, hypervisor detection.
memory read/write_cr3 mapped onto TTBR0_EL1 (+ TLB maintenance).
gdt nominal USER_CS/USER_DS selector constants (no segmentation).
idt VBAR_EL1 exception-vector install entry points; re-exports
the cross-arch InterruptFrame.
apic GIC + generic-timer scaffold: eoi, init_bsp/ap,
finish_xapic_init, local_apic_id (from MPIDR_EL1),
send_resched_ipi, and the vsync cadence accessors.
syscall SVC fast-path scaffold: per-CPU user-GPR scratch with the
UserGprSnapshot type and user_rsp/rip/r9/rbp/user_gprs
accessors, set_active_stack_top, init/init_ap.
smp per-CPU table + this_cpu()/current_cpu_id() via MPIDR_EL1,
CPU_COUNT, broadcast_resched_ipi (single-core scaffold;
PSCI CPU_ON wake left as a TODO).
acpi RSDP/MADT lookup + PSCI SYSTEM_OFF shutdown placeholders.
interrupts disable_pic no-op (no legacy PIC on aarch64).
mod early_init/ap_init/smp_init/halt/halt_forever/enable/
disable_interrupts/rdtsc (CNTVCT_EL0) and AAPCS64
context_switch + task_entry naked trampolines.
This is the scaffolding step of the ARM port: it does NOT boot. Real MMU
page tables, the EL1 exception vectors, GIC/timer programming, the SVC
entry path, and PSCI SMP bring-up are stubbed for follow-up work.
The shared kernel embedded raw x86 instructions and x86-only limine flags
directly in architecture-neutral files, which broke the aarch64-unknown-none
build. Move every such site behind crate::arch::* so both x86_64 and aarch64
compile, with x86_64 codegen left byte-for-byte identical.
User-mode entry: extract the IRETQ/SYSRETQ ring-3 transition asm out of
process::enter_user_by_pid_noreturn{,_try} into a new arch::enter_user hook
(EnterUserRegs + enter_user_iret/enter_user_sysret). The x86_64 backend keeps
the exact prior asm (verified: identical instruction/operand sequences); the
aarch64 backend stubs them pending the real EL1->EL0 ERET path. The shared
process layer keeps all PTABLE/CR3/errno logic and only calls the hook for the
final transfer.
Other shared sites rerouted through the facade:
- main.rs: gate limine MP_FLAG_X2APIC behind cfg(x86_64), 0 otherwise
- dispatch/poll/posix: _rdtsc()/inline rdtsc -> arch::rdtsc()
- posix/poll/fd/futex/engine/ipc_display: sti;hlt;cli -> arch::enable_and_halt()
- futex/posix: pause -> arch::spin_pause()
- process: cli -> arch::interrupts_disable(); mov cr3 -> arch::memory::write_cr3()
- paging: read cr3 -> arch::memory::read_cr3()
- panic: read rbp -> arch::read_frame_pointer()
- fd: poweroff loop -> arch::acpi_shutdown() (already arch-neutral)
New arch hooks: enter_user_{iret,sysret}+EnterUserRegs, interrupts_disable,
enable_and_halt, read_frame_pointer (x86_64 real, aarch64 mirror/stub).
Both targets build clean:
- aarch64-unknown-none --features arch-aarch64: 0 errors
- x86_64-unknown-none release: 0 errors
Direct QEMU -kernel boot path for the ARM port. The assembly _start (boot.rs) parks secondary CPUs, drops EL2->EL1 if needed, enables FP/AdvSIMD (CPACR_EL1.FPEN), sets up the boot stack, zeroes BSS, and calls into the Rust bring-up sequence. A minimal polled PL011 UART driver (uart.rs) at 0x09000000 is the serial debug lifeline. The bring-up (bringup.rs) prints CurrentEL, SCTLR_EL1, MPIDR_EL1 and CNTFRQ_EL0 over serial, confirming we land at EL1 with the MMU off and a 62.5MHz generic timer. New aarch64.ld links the kernel for QEMU -M virt RAM at 0x40080000 with _start as the ELF entry. Boots cleanly under qemu-system-aarch64 -M virt -cpu cortex-a72.
…stone 2) ARMv8-A MMU bring-up (mmu.rs): builds 512-entry L1 translation tables with 1 GiB block descriptors for both TTBR0 (low/user) and TTBR1 (kernel high half), programs MAIR_EL1 (Normal WB + Device-nGnRE), TCR_EL1 and enables SCTLR_EL1.M/C/I. Two bring-up subtleties resolved on real QEMU: - T0SZ/T1SZ = 25 (39-bit VA) so the 4 KiB-granule walk starts at L1, letting a single L1 table cover the whole space with 1 GiB blocks (T0SZ=16/48-bit would have required an L0 table -> level-0 fault). - Kernel identity map uses AP=EL1-only; an EL0-writable kernel-code mapping implicitly forces PXN, which tripped a level-1 permission fault on the first translated instruction fetch. Verified on qemu-system-aarch64: SCTLR_EL1=0xc5183d (M/C/I on), a RAM read/write probe round-trips, and execution continues translated past the enable.
Install a 16-entry, 2KB-aligned VBAR_EL1 vector table (vectors.rs) covering all four groups (Current-EL SP0/SPx, Lower-EL AArch64/32) x four kinds (Sync/IRQ/FIQ/SError). Each entry saves the full integer register file (x0-x30) plus SP_EL0, ELR_EL1, SPSR_EL1 and ESR_EL1 into a TrapFrame, calls a Rust dispatcher tagged with the exception kind, then restores everything and erets. The dispatcher decodes ESR_EL1.EC and routes SVC64 to an installable handler, IRQs to an installable IRQ handler, and reports+parks on any unhandled exception (printing ESR/ELR/SPSR/FAR over serial). Verified on QEMU: a deliberate from EL1 round-trips through save/dispatch/restore/eret (SYNC_EL1 0->1, handler hit once) and a callee-saved sentinel register survives the trap intact.
GICv2 driver (gic.rs) for QEMU -M virt: enables the distributor and this core's CPU interface, with per-IRQ enable/priority/routing plus IAR acknowledge and EOIR end-of-interrupt. Generic-timer driver (timer.rs) programs the EL1 physical timer (CNTP_CTL/CNTP_TVAL_EL0) for a periodic tick, routed via PPI 30 and re-armed each interrupt. CNTFRQ_EL0 (62.5 MHz on virt) gives the rate. The bring-up installs an IRQ handler that acknowledges at the GIC, services the timer PPI, and EOIs. With IRQs unmasked the timer ticks: verified 5 ticks / 5 serviced IRQs in ~61ms at the requested 100 Hz. This is the same scheduler-tick source the x86 APIC timer drives.
The headline ARM milestone: the kernel drops to EL0, runs a userspace program, and services its svc #0 syscalls end to end. enter_user.rs: real EL1->EL0 transition. Builds the full x0..x30 + SP_EL0 + ELR_EL1 + SPSR_EL1 image from the shared (x86-named) EnterUserRegs, mapping rdi/rsi/.. onto x0/x1/.. per the Linux aarch64 ABI, then erets into EL0. Both the IRET (timer-preempt) and SYSRET (syscall-yield) hooks the shared process layer calls route here. enter_el0_at() is the bring-up's direct launch path. mmu.rs: map_user_page() installs an L1->L2->L3 walk for an EL0-RW, EL0-executable (PXN=1/UXN=0) 4 KiB page at a free VA, backing it with a kernel-writable physical page so the kernel can stage user code. bringup_user.rs: assembles a tiny EL0 program (MOVZ/MOVK + svc #0 sequences) that issues write(64)/getpid(172)/exit(93) syscalls. The SVC handler reads x8 (nr) + x0..x2 (args) from the saved TrapFrame, services the call, writes the return into x0, and erets back to EL0. psci.rs: PSCI SYSTEM_OFF / CPU_ON helpers (conduit-gated on EL>=2). Verified on qemu-system-aarch64 -M virt -cpu cortex-a72: the EL0 program prints over serial via syscalls, getpid returns a value, and exit(7) is observed (writes=3, exit_code=7). Full chain boots: EL1 -> MMU -> vectors -> GIC+timer -> EL0 -> syscalls -> exit.
The aarch64 SVC handler now calls syscall::capture_from_trap(), which stashes the trapping EL0 thread's registers into the per-CPU scratch that backs the architecture-neutral accessors arch::syscall::user_rip / user_rsp / user_gprs the shared dispatch_fast path reads. AArch64 registers map onto the x86-named slots per the Linux aarch64 ABI (x0..x5 -> arg regs, SP_EL0 -> user_rsp, ELR_EL1 -> user_rip, x19..x23/x29 -> callee-saved slots). Verified on QEMU: after the first EL0 svc, the shared accessors report user_rip=0x400000018 (past the svc) and user_rsp=0x400000f00 (the EL0 stack) — the exact contract dispatch_fast consumes. x86_64 build unaffected.
Best-effort SMP bring-up (bringup_smp.rs): issues PSCI CPU_ON over the QEMU -M virt HVC conduit to start CPU 1 at a physical secondary entry stub (__secondary_entry) which enables FP, sets up its own stack, and calls into Rust to record its MPIDR and report online over serial. psci.rs cpu_on now uses HVC (the active virt conduit). The EL1 sync vector treats an Undefined-instruction trap (EC=0, e.g. a PSCI probe where no conduit is active) as a graceful no-op: it sets x0 to PSCI NOT_SUPPORTED and steps ELR past the faulting instruction instead of parking, so SMP probing never wedges the single-core path. Verified on QEMU -smp 2: PSCI CPU_ON returns 0, CPU 1 comes online and reports MPIDR=1; with -smp 1 the probe returns gracefully and the single-core EL0 path still completes. Full per-CPU scheduling on the secondary is a follow-up; the BSP already runs userspace. x86_64 build unaffected.
Convenience wrapper that builds the aarch64 kernel and boots it with qemu-system-aarch64 -M virt -cpu cortex-a72 (single core, or smp2 to exercise the PSCI CPU_ON path), wiring PL011 serial to the terminal.
The ARM port previously ran only the self-contained bring-up demo. This wires
the proven arch primitives into the SAME kernel_main the x86 path uses, so the
real subsystem init runs on ARM.
* FDT/DTB reader (arch/aarch64/fdt.rs): a small hand-rolled flattened
device-tree parser (no external crate) that finds the /memory node(s) for
RAM discovery. _start now preserves the x0 DTB pointer (saved in x20) and
passes it through aarch64_start.
* Neutral boot memory map (mm::BootMemMap): mm::init no longer reads Limine
directly. Both arches translate their source — x86 from the Limine memmap,
aarch64 from the device tree — into a fixed-capacity region list consumed by
the shared mm::init_from_regions. x86 behaviour is byte-equivalent.
* Production ARM boot path (arch/aarch64/boot_prod.rs): brings up PL011 serial,
the MMU (identity map), EL1 exception vectors, and the GIC, parses the DTB,
then calls the shared kernel_main_arch.
* kernel_main split: the Limine-coupled prologue stays in the x86-only
kernel_main; the arch-neutral subsystem init + init-process spawn + idle loop
move to shared_init_and_run, called by both arches. Limine request statics
are now x86-only so the ARM kernel carries no dead boot section.
* PL011 wired into the shared logger so early_print + the log framework reach
serial on aarch64.
Verified: production kernel_main runs end-to-end on qemu-system-aarch64 -M virt
through MMU, frame allocator (2048 MiB discovered), heap, scheduler, security,
IPC, WM, VFS, drivers, and the full Cortex AI runtime, reaching the init-process
spawn step. x86_64 kernel still builds (release ELF unchanged in behaviour).
Completes the ARM port's headline goal: the PRODUCTION kernel_main now spawns a
userspace process and services its first syscalls through the SHARED dispatcher.
* aarch64 user page-table walker (mm/paging.rs): a real 4 KiB-granule L1→L3
TTBR0 walker replaces the non-x86 stub — alloc_user_pml4, map_page_in,
translate/update/unmap, free. Each per-process root is seeded with a copy of
the kernel's identity-map L1 block descriptors so the kernel stays mapped
after write_cr3 loads the process root (the bring-up kernel runs from the
TTBR0 low half). free_user_pml4 skips those shared block descriptors so it
never frees kernel RAM.
* SVC → shared dispatch (arch/aarch64/syscall.rs): the production SVC handler
captures the EL0 trap frame, calls syscall::dispatch_fast(x8, x0..x4), writes
the result back to x0, and on exit/exit_group hands off to the next runnable
process (or parks). Installed in arch::early_init.
* ELF loader (process/elf.rs): accept EM_AARCH64 on ARM via an arch-selected
EM_NATIVE machine check.
* Process spawn (process/mod.rs): arch-conditional USER_STACK_TOP placed inside
the 39-bit aarch64 VA window; the x86 glibc POSIX trampoline mapping is
skipped on ARM (bare EL0 program, no libc).
* Scheduler (sched/mod.rs): spawn_kernel_task builds an AAPCS64 first-run frame
on aarch64 (x19..x30 + fn-ptr slot matching context_switch/task_entry),
instead of the x86 register layout.
* Native /init (build.rs): for aarch64 builds, embed a tiny EL0 ELF (linked at
64 GiB to clear the kernel identity blocks) that does write + getpid + write +
exit using the OSCortex syscall numbers.
* main.rs: on aarch64 the init is entered directly via enter_user_by_pid_noreturn
(the real write_cr3 → ERET path) since the ARM preemptive timer ISR is a
follow-on; x86 keeps the schedule_user_launch hand-off.
Verified on qemu-system-aarch64 -M virt: init spawns (pid=1, EL0 entry), prints
two lines via write(1,...), getpid returns, exit(0) reaps cleanly and the kernel
parks — all through the shared syscall layer, no faults. x86_64 debug + release
still build green; no /init is injected on x86.
QEMU's -M virt passes the DTB pointer in x0 for a flat/Image boot, but for an ELF -kernel it may leave x0 = 0 and place the blob at an image-dependent address. boot_prod now: (1) uses x0 when it points at a valid FDT (the spec mechanism, also correct on real hardware), (2) probes the RAM base + a bounded low window for the FDT magic, (3) falls back to the QEMU virt default (2 GiB @ 0x40000000, matching -m) when neither yields a /memory node. Adds fdt::scan_for_dtb for the magic-scan path. The parser itself is unchanged and validated against the machine's real device tree.
The ARM `-kernel` boot now routes into the shared kernel_main and spawns userspace; update the script header to describe that instead of only the self-contained bring-up demo.
Ensures a fragmented x86 Limine memory map never drops usable RAM when translating into the architecture-neutral region list.
QEMU -M virt ships no Limine framebuffer (Limine is the x86 boot protocol),
so the ARM port had no display. Add a fw_cfg DMA reader/writer and use it to
configure QEMU's -device ramfb:
* kernel/src/arch/aarch64/ramfb.rs: walk the fw_cfg file directory to find
the etc/ramfb selector, allocate a 1280x800 XRGB8888 framebuffer in
identity-mapped RAM, and DMA-write the big-endian RAMFBCfg into it.
Two gotchas handled: SELECT and WRITE must be separate DMAs (a combined
control word is rejected), and RamfbCfg must be repr(C, packed) so the
write length is exactly the 28-byte file size QEMU expects.
* drivers/fb.rs: add init_raw(addr,w,h,pitch) — an arch-neutral entry point
so the existing fb console/compositor (fed from Limine on x86) works
unchanged on ARM from a raw buffer.
* main.rs (kernel_main_arch): bring ramfb up after the frame allocator is
online, publish it to the shared fb, and paint a boot fill.
* scripts/run-aarch64.sh: add -device ramfb; DISPLAY_MODE=none|cocoa|gtk|sdl
with a headless monitor socket for screendump capture.
Validated on qemu-system-aarch64: monitor screendump is now 1280x800 (was the
640x480 default), with the painted teal top band (20,184,166), navy body
(26,26,46) and white boot-console glyphs all present — pixels the kernel wrote
are scanned out by ramfb.
Wire the ARM generic-timer IRQ to the SAME shared cooperative-scheduler
hand-off the x86 APIC-timer ISR uses, so EL0 threads (the Flutter engine's
threads, ultimately) are timer-preempted and round-robined — collapsing the
ARM 'enter init directly' path onto x86's timer-driven model.
* arch/aarch64/apic.rs: production_irq_handler — on a generic-timer PPI taken
from EL0, apply the same quantum + input/focus preempt policy as x86, map
the live EL1 vector TrapFrame to the shared UserRegs, call the shared
process::timer_preempt_switch_try (arch-neutral: PTABLE bookkeeping, CR3/
TTBR0, fs_base, xstate, current_pid), then write the next thread's frame
back into the TrapFrame so the vector stub's RESTORE_FRAME+eret enters it.
init_bsp only installs the handler; start_scheduler_tick arms the timer.
* Full-fidelity context: the shared UserRegs carries only the x86-named
register subset, so each thread's complete ARM frame (x0..x30/SP_EL0/ELR/
SPSR) round-trips through a new per-process arch_trapframe slot
(process::arch_store/take_trapframe, try_lock — ISR-safe). Scratch
registers (x6/x7, x11–x18, x24–x28, x30) survive arbitrary-instruction
preemption.
* IRQ enablement: IRQs are NOT unmasked at EL1; the eret into EL0 (SPSR=EL0t,
I/F clear) enables them at the userspace boundary. This closes a sporadic
boot hang where a timer fired mid enter_user_by_pid_noreturn (during the
CR3 switch / image build) — boots went from ~50% hanging to 5/5 clean.
* mm: reserve the kernel image in the frame allocator on ARM. QEMU -kernel
loads the image (0x40080000) into RAM the device tree reports usable from
0x40000000; the allocator marked it all free, so the first large contiguous
alloc (the 64 MiB heap, then the compositor's 4 MiB double-buffer) overran
live kernel code — a latent corruption that hung the first big alloc once a
framebuffer made the compositor enable double-buffering. frame_allocator::
reserve_range carves [RAM base..__kernel_end] before the heap is built.
* build.rs + main.rs: the ARM /init program is now a compute-bound loop
(busy-spins with NO syscall between ticks, prints a per-instance tag for a
bounded budget, then exits). main.rs spawns TWO instances (A=x0:0, B=x0:1);
since neither yields cooperatively, interleaved A/B output on serial can
ONLY come from the timer preempting + switching between them.
Validated on qemu-system-aarch64 (5/5 boots): both EL0 threads make progress
with fair, interleaved ticks (e.g. A=12 B=12, 7-8 A<->B switches) then exit
cleanly. The longer compute-budget variant showed thousands of balanced ticks
(A=3341/B=3350, 2148 switches). x86 builds clean (no regression).
Build oscortex-host for aarch64-unknown-none. The syscall stubs (sys.rs) gain an SVC #0 backend (nr in x8, args x0..x5, return x0) alongside the existing x86 SYSCALL path; the typed wrappers stay architecture-neutral. _start gains an AArch64 naked entry: zero x29/x30, 16-byte-align SP via a scratch GPR, preserve the kernel's bootstrap args (x0=host_mode, x1=app_id, x2=aot_va) across the breadcrumb write, then call main_embedder(x0,x1,x2). main_embedder now takes those three as C arguments (dropping the brittle read-rdi/rsi/rdx asm), and the monotonic clock reads CNTVCT_EL0/CNTFRQ_EL0 on ARM instead of RDTSC. A user_aarch64.ld linker script + an aarch64-unknown-none .cargo target (static reloc) load it at 0x400000. Produces a 0x400000-based static AArch64 ELF, mirroring the x86 host.
dl.rs: accept the aarch64 engine .so. Add R_AARCH64_RELATIVE/ABS64/ GLOB_DAT/JUMP_SLOT alongside the x86 relocation set (same RELA encoding, selected by type number), gate the ELF e_machine check on the build arch, and arch-gate the x86-only engine byte-patch block (x86 opcodes/offsets must never run against the ARM engine). One dlopen path now loads either the x86 or the ARM libflutter_engine.so. vectors.rs: route EL0 data/instruction aborts (EC 0x24/0x20) through the shared demand pager. Reads FAR_EL1 for the faulting VA and decodes the ESR fault-status code into the x86-style present bit so demand_page behaves identically. The Flutter engine demand-pages its Dart heaps and AOT exec regions, so without this every such access parked the core. paging.rs: flush the stale TLB entry on aarch64 in demand_page's already-mapped fast path (the x86 path uses invlpg).
…as pid1 Deliver the arm64 Flutter stack through the initramfs (no Limine on -M virt): the arm64 engine .so, the aarch64 oscortex-host embedder (as /init), the arm64 AOT shell snapshot (libapp.so), icudtl.dat and flutter_assets. build.rs keeps a staged real /init and ships libflutter_engine.so in the initramfs on aarch64. kernel_main spawns the host as pid 1 with HOST_MODE_SHELL. mm/paging splits the seeded kernel identity block when overlaying user pages; vectors report FAR_EL1; diagnostics trace pid1 syscalls.
…gine from initramfs Two bugs blocked the Flutter host on ARM: 1. ELF loader read 0xFF for the host's .rodata. translate_user_page/walk treated the kernel identity-map BLOCK descriptors (seeded into every per-process root) as table pointers, dereferencing their output PA as a table — returning bogus 'already mapped' identity PAs. The loader then reused the identity PA (device memory) instead of allocating a real frame, and splitting a block to overlay one page left the rest of the block as valid identity sub-pages, poisoning every later page too. Fix: walk() returns None at a block descriptor (no 4 KiB leaf there); the ELF loader tracks frames IT allocated this load (BTreeMap) rather than querying the identity-polluted page table. 2. dlopen of libflutter_engine.so hard-failed when no Limine module was present. x86 ships the engine as a Limine boot module; aarch64 has no Limine and ships it in the initramfs. Fix: fall back to the VFS lookup when the module is absent. Host now prints its breadcrumbs and reaches the engine dlopen on ARM.
… paging fires The Flutter engine's DT_INIT_ARRAY jumps into the POSIX trampoline page to reach libc symbols; that page was (a) skipped entirely on aarch64 and (b) encoded with x86 machine code. Add an AArch64 encode_stub (movz x8,#nr; svc #0; ret and the RetU32/RetAddr/FloatZero/MathSyscallN/SyscallRetAsArg0 shapes), enable map_system_pages on both arches, fill stub padding with RET, and do I-cache maintenance (dc cvau / ic ivau / isb) on the freshly written code pages. Also fix EC_DABT_LOWER: a data abort from a lower EL (EL0) is EC=0x25, not 0x25's same-EL sibling 0x24. With the wrong constant every EL0 mmap/heap demand fault fell through to report_unhandled instead of the demand pager. The engine now runs its init array through dozens of pthread/locale syscalls and demand-faults its anonymous Dart heap correctly.
… (0x25) through pager The ARM ARM (D17.2.37) ESR_EL1.EC encodings are: 0x24 = Data Abort from a lower EL (EL0 user fault) 0x25 = Data Abort without EL change (EL1 kernel touching a user VA) efe6c25 set EC_DABT_LOWER=0x25, which is the same-EL (EL1) variant, so EL0 demand faults from the Flutter engine never matched and the pager never ran. Fix EC_DABT_LOWER to 0x24 and add EC_DABT_CURR=0x25 routed through the same demand pager so a syscall handler dereferencing a not-yet-paged user pointer also resolves.
…ns for engine thread_local The arm64 Flutter engine compiles its C++ thread_local accesses with the TLSDESC model: 'ldr x1,[desc#0]; blr x1' where the descriptor's first word is a resolver and the second its argument. The loader applied ABS64/GLOB_DAT/ JUMP_SLOT/RELATIVE but ignored the 14 R_AARCH64_TLSDESC relocations, so the descriptor slots stayed zero and the very first thread_local access in fml::MessageLoop::EnsureInitializedForCurrentThread branched to address 0 (EL0 sync abort, EC=0, ELR=0). Resolve them statically for the single, fixed-load module: - emit a tiny '_dl_tlsdesc_return' stub (ldr x0,[x0,#8]; ret) into the last slot of the executable trampoline page (TLSDESC_RESOLVER_VA) - point every descriptor's word0 at it and write the variant-I TP-relative offset (TLS_TP_OFFSET=16 + module offset) into word1 Also handle TPREL64/DTPREL64/DTPMOD64 for completeness. x86 path untouched.
AArch64 keeps return addresses in the link register, not on the stack, so a
thread that yields inside a syscall (epoll_wait/futex/cond_wait) must resume
with x30 intact. The cooperative re-entry path (build_image) hard-wired x30=0
('zero on first entry'), so after FlutterEngineInitialize the engine's first
thread resumed from an epoll-block and its next branched to address 0
(EL0 sync abort, EC=0, ELR=0, x30=0).
Capture x30 at the SVC boundary into a per-CPU/per-process slot and restore it
on SYSRET re-entry, carried through the shared enter path in the otherwise-
unused rflags slot (build_image maps rflags->x30 on aarch64; SPSR is constant).
Also print x30/x1/x16/x17/pid in the unhandled-exception report for diagnosis.
x86 path untouched (rflags stays real FLAGS there).
…r, harden FB console
Three blockers past the LR fix:
1. Embedder passed --dart-flags=--old_gen_heap_size=512,... which the engine's
switches.cc IsAllowedDartVMFlag denylist FML_LOG(FATAL)s. Drop the switch;
the AOT VM uses its defaults (the heap sizing was a stale x86-JIT workaround).
2. The blocking syscalls rewound the saved user PC by a hardcoded 2 bytes to
re-execute on resume — correct for x86's 2-byte , but aarch64's
is 4 bytes, so -2 left the PC mid-instruction → EC=0x22 PC-alignment
fault. Add process::SYSCALL_INSN_LEN (2 on x86, 4 on arm) + a
save_return_context_reexec() helper and route all 14 re-exec sites through it.
3. pthread_create gave the newborn an immediate slice by entering the child
from inside the creator's syscall (never returning, delivering r=0 later).
On aarch64 this nested enter-while-in-syscall corrupted the creator's resume
so the 2nd fml::Thread's pthread_create returned non-zero (thread.cc:80).
Gate that path to x86; on arm the child runs via normal cooperative sched.
4. Harden blit_char against col/row overflow so the serial-mirror text console
can never panic the kernel (was: u32 overflow during heavy
thread spawn). x86 paths unchanged.
Engine now spawns tid=2/3/4 and runs ~1100 serial lines before the FB-console
panic that this also fixes.
…d thread-enter off Make blit_char/scroll_up/write_byte bail cleanly on a degenerate or corrupted FB geometry (rows/cols == 0) instead of panicking the kernel with arithmetic overflow — the serial-mirror console is never worth a panic. With this the engine spawns its full UI/raster/IO thread set (tid=2/3/4) without the kernel dying. Document why the x86 immediate-child-enter slice stays disabled on arm (it corrupts the creator's cooperative resume → thread.cc:80 abort).
… robustness for single-core ISR; re-enable scheduler tick for engine host; preempt diagnostics
…olise null-jump crash
…l thread states, scan tids 1-12, raise log caps
… vanished worker thread
… scheduler tick can fire mid-syscall (mirrors x86 sti-on-entry); fixes engine bring-up deadlock where a spinning/cooperatively-yielding syscall masked the timer and froze preemption
…utex_waiter_remove_try) — fixes single-core IRQ-masked self-deadlock where a timer tick during spawn_thread/futex syscall spun forever on a held lock
… — fixes demand-pager self-deadlock when a fault re-enters map_user_page while the page-table lock is held (exposed by aarch64 IRQs-on-during-syscall)
…s (the aarch64-only resolver trampoline; arm unchanged) — keeps x86 green
…e engine bring-up past the early ret-to-0 / pager deadlock Two single-core register/lock-coherence fixes that move the ARM Flutter engine from crashing immediately after FlutterEngineInitialize to running through FlutterEngineRunInitialized into ICU init and worker-thread spawn. 1) Eager user-GPR capture at the SVC boundary (process::mod, vectors). `save_full_user_gprs` read the per-CPU user-GPR snapshot LAZILY at yield time. Syscalls run with IRQs unmasked, so a generic-timer tick can preempt a thread mid-handler, switch to a sibling whose own SVC overwrites the shared snapshot, then switch back — and the yielding thread then persists the sibling's callee-saved regs / x30 into its own context (resume → `ret` to a stale address, e.g. x30=0 → branch to 0). Port the proven x86 fix (GPRS_CAPTURED flag + capture_user_gprs_at_entry): snapshot ONCE, eagerly, while fresh, and make later yield-time saves no-ops. On aarch64 the eager capture runs from inside the IRQ-masked SVC window in the vector dispatch (x86 stays masked until the handler sti's), so the snapshot can't be clobbered before it is persisted. 2) IRQ-masked outer page-table critical section (mm::paging). The reentrant PAGE_TABLE_LOCK depth counter is only sound if the section can't be interleaved by another thread. With IRQs unmasked during syscalls, a timer tick could preempt a lock holder mid-section; a sibling then saw a non-zero depth and proceeded as a bogus "nested" writer, desynchronising the counter from the real lock and eventually stranding the lock held while depth read 0 — a later outer acquire then spun forever with IRQs masked in the demand-abort handler (single-core deadlock, observed freezing right at worker-thread stack setup). Mask IRQs for the whole outer section so it is genuinely uninterruptible; nesting then only ever means true same-stack re-entry (a demand fault during a page-table walk, already IRQs-masked). x86 behaviour is unchanged (cfg-gated to aarch64). Also: the unhandled-exception reporter now scans the EL0 stack for engine return addresses (the FP chain is empty when x30/FP are 0), so a ret-to-0 can be symbolised against libflutter_engine.so. Status: engine now reaches ICU init / worker spawn; a residual cooperative- yield corruption remains (nondeterministic ret-to-0 vs abort) — same class the x86 port hardened over several commits. Not yet rendering on ARM. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…shell # Conflicts: # kernel/src/process/mod.rs # kernel/src/syscall/dispatch.rs
squirelboy360
added a commit
that referenced
this pull request
Jun 9, 2026
… init + worker spawn) (#9) * arch/aarch64: scaffold a full backend that satisfies the shared arch surface Replace the 35-line aarch64 stub with a module layout that mirrors the x86_64 backend so the rest of the kernel's crate::arch::* surface resolves on aarch64. Every function/type/const the shared kernel imports through the arch facade now exists here with the matching signature; bodies are compilable scaffolds (no-ops / sane defaults) pending the real ARM port. Modules added under arch/aarch64/: cpu FP/SIMD enable, TPIDR_EL0 TLS (set/get_fs_base), DAIF interrupt-mask save/restore, DMB/yield fences, xstate save/restore placeholders, hypervisor detection. memory read/write_cr3 mapped onto TTBR0_EL1 (+ TLB maintenance). gdt nominal USER_CS/USER_DS selector constants (no segmentation). idt VBAR_EL1 exception-vector install entry points; re-exports the cross-arch InterruptFrame. apic GIC + generic-timer scaffold: eoi, init_bsp/ap, finish_xapic_init, local_apic_id (from MPIDR_EL1), send_resched_ipi, and the vsync cadence accessors. syscall SVC fast-path scaffold: per-CPU user-GPR scratch with the UserGprSnapshot type and user_rsp/rip/r9/rbp/user_gprs accessors, set_active_stack_top, init/init_ap. smp per-CPU table + this_cpu()/current_cpu_id() via MPIDR_EL1, CPU_COUNT, broadcast_resched_ipi (single-core scaffold; PSCI CPU_ON wake left as a TODO). acpi RSDP/MADT lookup + PSCI SYSTEM_OFF shutdown placeholders. interrupts disable_pic no-op (no legacy PIC on aarch64). mod early_init/ap_init/smp_init/halt/halt_forever/enable/ disable_interrupts/rdtsc (CNTVCT_EL0) and AAPCS64 context_switch + task_entry naked trampolines. This is the scaffolding step of the ARM port: it does NOT boot. Real MMU page tables, the EL1 exception vectors, GIC/timer programming, the SVC entry path, and PSCI SMP bring-up are stubbed for follow-up work. * arch: route x86-specific code in shared kernel through the arch facade The shared kernel embedded raw x86 instructions and x86-only limine flags directly in architecture-neutral files, which broke the aarch64-unknown-none build. Move every such site behind crate::arch::* so both x86_64 and aarch64 compile, with x86_64 codegen left byte-for-byte identical. User-mode entry: extract the IRETQ/SYSRETQ ring-3 transition asm out of process::enter_user_by_pid_noreturn{,_try} into a new arch::enter_user hook (EnterUserRegs + enter_user_iret/enter_user_sysret). The x86_64 backend keeps the exact prior asm (verified: identical instruction/operand sequences); the aarch64 backend stubs them pending the real EL1->EL0 ERET path. The shared process layer keeps all PTABLE/CR3/errno logic and only calls the hook for the final transfer. Other shared sites rerouted through the facade: - main.rs: gate limine MP_FLAG_X2APIC behind cfg(x86_64), 0 otherwise - dispatch/poll/posix: _rdtsc()/inline rdtsc -> arch::rdtsc() - posix/poll/fd/futex/engine/ipc_display: sti;hlt;cli -> arch::enable_and_halt() - futex/posix: pause -> arch::spin_pause() - process: cli -> arch::interrupts_disable(); mov cr3 -> arch::memory::write_cr3() - paging: read cr3 -> arch::memory::read_cr3() - panic: read rbp -> arch::read_frame_pointer() - fd: poweroff loop -> arch::acpi_shutdown() (already arch-neutral) New arch hooks: enter_user_{iret,sysret}+EnterUserRegs, interrupts_disable, enable_and_halt, read_frame_pointer (x86_64 real, aarch64 mirror/stub). Both targets build clean: - aarch64-unknown-none --features arch-aarch64: 0 errors - x86_64-unknown-none release: 0 errors * aarch64: boot to EL1 with PL011 serial (milestone 1) Direct QEMU -kernel boot path for the ARM port. The assembly _start (boot.rs) parks secondary CPUs, drops EL2->EL1 if needed, enables FP/AdvSIMD (CPACR_EL1.FPEN), sets up the boot stack, zeroes BSS, and calls into the Rust bring-up sequence. A minimal polled PL011 UART driver (uart.rs) at 0x09000000 is the serial debug lifeline. The bring-up (bringup.rs) prints CurrentEL, SCTLR_EL1, MPIDR_EL1 and CNTFRQ_EL0 over serial, confirming we land at EL1 with the MMU off and a 62.5MHz generic timer. New aarch64.ld links the kernel for QEMU -M virt RAM at 0x40080000 with _start as the ELF entry. Boots cleanly under qemu-system-aarch64 -M virt -cpu cortex-a72. * aarch64: enable the MMU with identity-mapped translation tables (milestone 2) ARMv8-A MMU bring-up (mmu.rs): builds 512-entry L1 translation tables with 1 GiB block descriptors for both TTBR0 (low/user) and TTBR1 (kernel high half), programs MAIR_EL1 (Normal WB + Device-nGnRE), TCR_EL1 and enables SCTLR_EL1.M/C/I. Two bring-up subtleties resolved on real QEMU: - T0SZ/T1SZ = 25 (39-bit VA) so the 4 KiB-granule walk starts at L1, letting a single L1 table cover the whole space with 1 GiB blocks (T0SZ=16/48-bit would have required an L0 table -> level-0 fault). - Kernel identity map uses AP=EL1-only; an EL0-writable kernel-code mapping implicitly forces PXN, which tripped a level-1 permission fault on the first translated instruction fetch. Verified on qemu-system-aarch64: SCTLR_EL1=0xc5183d (M/C/I on), a RAM read/write probe round-trips, and execution continues translated past the enable. * aarch64: EL1 exception vectors + trap frame round-trip (milestone 3) Install a 16-entry, 2KB-aligned VBAR_EL1 vector table (vectors.rs) covering all four groups (Current-EL SP0/SPx, Lower-EL AArch64/32) x four kinds (Sync/IRQ/FIQ/SError). Each entry saves the full integer register file (x0-x30) plus SP_EL0, ELR_EL1, SPSR_EL1 and ESR_EL1 into a TrapFrame, calls a Rust dispatcher tagged with the exception kind, then restores everything and erets. The dispatcher decodes ESR_EL1.EC and routes SVC64 to an installable handler, IRQs to an installable IRQ handler, and reports+parks on any unhandled exception (printing ESR/ELR/SPSR/FAR over serial). Verified on QEMU: a deliberate from EL1 round-trips through save/dispatch/restore/eret (SYNC_EL1 0->1, handler hit once) and a callee-saved sentinel register survives the trap intact. * aarch64: GICv2 + generic-timer periodic tick (milestones 4+5) GICv2 driver (gic.rs) for QEMU -M virt: enables the distributor and this core's CPU interface, with per-IRQ enable/priority/routing plus IAR acknowledge and EOIR end-of-interrupt. Generic-timer driver (timer.rs) programs the EL1 physical timer (CNTP_CTL/CNTP_TVAL_EL0) for a periodic tick, routed via PPI 30 and re-armed each interrupt. CNTFRQ_EL0 (62.5 MHz on virt) gives the rate. The bring-up installs an IRQ handler that acknowledges at the GIC, services the timer PPI, and EOIs. With IRQs unmasked the timer ticks: verified 5 ticks / 5 serviced IRQs in ~61ms at the requested 100 Hz. This is the same scheduler-tick source the x86 APIC timer drives. * aarch64: SVC syscall entry + EL0 user process servicing (milestones 6+7) The headline ARM milestone: the kernel drops to EL0, runs a userspace program, and services its svc #0 syscalls end to end. enter_user.rs: real EL1->EL0 transition. Builds the full x0..x30 + SP_EL0 + ELR_EL1 + SPSR_EL1 image from the shared (x86-named) EnterUserRegs, mapping rdi/rsi/.. onto x0/x1/.. per the Linux aarch64 ABI, then erets into EL0. Both the IRET (timer-preempt) and SYSRET (syscall-yield) hooks the shared process layer calls route here. enter_el0_at() is the bring-up's direct launch path. mmu.rs: map_user_page() installs an L1->L2->L3 walk for an EL0-RW, EL0-executable (PXN=1/UXN=0) 4 KiB page at a free VA, backing it with a kernel-writable physical page so the kernel can stage user code. bringup_user.rs: assembles a tiny EL0 program (MOVZ/MOVK + svc #0 sequences) that issues write(64)/getpid(172)/exit(93) syscalls. The SVC handler reads x8 (nr) + x0..x2 (args) from the saved TrapFrame, services the call, writes the return into x0, and erets back to EL0. psci.rs: PSCI SYSTEM_OFF / CPU_ON helpers (conduit-gated on EL>=2). Verified on qemu-system-aarch64 -M virt -cpu cortex-a72: the EL0 program prints over serial via syscalls, getpid returns a value, and exit(7) is observed (writes=3, exit_code=7). Full chain boots: EL1 -> MMU -> vectors -> GIC+timer -> EL0 -> syscalls -> exit. * aarch64: wire SVC capture into the shared per-CPU user-GPR snapshot The aarch64 SVC handler now calls syscall::capture_from_trap(), which stashes the trapping EL0 thread's registers into the per-CPU scratch that backs the architecture-neutral accessors arch::syscall::user_rip / user_rsp / user_gprs the shared dispatch_fast path reads. AArch64 registers map onto the x86-named slots per the Linux aarch64 ABI (x0..x5 -> arg regs, SP_EL0 -> user_rsp, ELR_EL1 -> user_rip, x19..x23/x29 -> callee-saved slots). Verified on QEMU: after the first EL0 svc, the shared accessors report user_rip=0x400000018 (past the svc) and user_rsp=0x400000f00 (the EL0 stack) — the exact contract dispatch_fast consumes. x86_64 build unaffected. * aarch64: wake a secondary core via PSCI CPU_ON (milestone 8) Best-effort SMP bring-up (bringup_smp.rs): issues PSCI CPU_ON over the QEMU -M virt HVC conduit to start CPU 1 at a physical secondary entry stub (__secondary_entry) which enables FP, sets up its own stack, and calls into Rust to record its MPIDR and report online over serial. psci.rs cpu_on now uses HVC (the active virt conduit). The EL1 sync vector treats an Undefined-instruction trap (EC=0, e.g. a PSCI probe where no conduit is active) as a graceful no-op: it sets x0 to PSCI NOT_SUPPORTED and steps ELR past the faulting instruction instead of parking, so SMP probing never wedges the single-core path. Verified on QEMU -smp 2: PSCI CPU_ON returns 0, CPU 1 comes online and reports MPIDR=1; with -smp 1 the probe returns gracefully and the single-core EL0 path still completes. Full per-CPU scheduling on the secondary is a follow-up; the BSP already runs userspace. x86_64 build unaffected. * aarch64: add run-aarch64.sh to build + boot the ARM bring-up under QEMU Convenience wrapper that builds the aarch64 kernel and boots it with qemu-system-aarch64 -M virt -cpu cortex-a72 (single core, or smp2 to exercise the PSCI CPU_ON path), wiring PL011 serial to the terminal. * aarch64: route -kernel boot into the shared production kernel_main The ARM port previously ran only the self-contained bring-up demo. This wires the proven arch primitives into the SAME kernel_main the x86 path uses, so the real subsystem init runs on ARM. * FDT/DTB reader (arch/aarch64/fdt.rs): a small hand-rolled flattened device-tree parser (no external crate) that finds the /memory node(s) for RAM discovery. _start now preserves the x0 DTB pointer (saved in x20) and passes it through aarch64_start. * Neutral boot memory map (mm::BootMemMap): mm::init no longer reads Limine directly. Both arches translate their source — x86 from the Limine memmap, aarch64 from the device tree — into a fixed-capacity region list consumed by the shared mm::init_from_regions. x86 behaviour is byte-equivalent. * Production ARM boot path (arch/aarch64/boot_prod.rs): brings up PL011 serial, the MMU (identity map), EL1 exception vectors, and the GIC, parses the DTB, then calls the shared kernel_main_arch. * kernel_main split: the Limine-coupled prologue stays in the x86-only kernel_main; the arch-neutral subsystem init + init-process spawn + idle loop move to shared_init_and_run, called by both arches. Limine request statics are now x86-only so the ARM kernel carries no dead boot section. * PL011 wired into the shared logger so early_print + the log framework reach serial on aarch64. Verified: production kernel_main runs end-to-end on qemu-system-aarch64 -M virt through MMU, frame allocator (2048 MiB discovered), heap, scheduler, security, IPC, WM, VFS, drivers, and the full Cortex AI runtime, reaching the init-process spawn step. x86_64 kernel still builds (release ELF unchanged in behaviour). * aarch64: spawn the init process at EL0 and service its syscalls Completes the ARM port's headline goal: the PRODUCTION kernel_main now spawns a userspace process and services its first syscalls through the SHARED dispatcher. * aarch64 user page-table walker (mm/paging.rs): a real 4 KiB-granule L1→L3 TTBR0 walker replaces the non-x86 stub — alloc_user_pml4, map_page_in, translate/update/unmap, free. Each per-process root is seeded with a copy of the kernel's identity-map L1 block descriptors so the kernel stays mapped after write_cr3 loads the process root (the bring-up kernel runs from the TTBR0 low half). free_user_pml4 skips those shared block descriptors so it never frees kernel RAM. * SVC → shared dispatch (arch/aarch64/syscall.rs): the production SVC handler captures the EL0 trap frame, calls syscall::dispatch_fast(x8, x0..x4), writes the result back to x0, and on exit/exit_group hands off to the next runnable process (or parks). Installed in arch::early_init. * ELF loader (process/elf.rs): accept EM_AARCH64 on ARM via an arch-selected EM_NATIVE machine check. * Process spawn (process/mod.rs): arch-conditional USER_STACK_TOP placed inside the 39-bit aarch64 VA window; the x86 glibc POSIX trampoline mapping is skipped on ARM (bare EL0 program, no libc). * Scheduler (sched/mod.rs): spawn_kernel_task builds an AAPCS64 first-run frame on aarch64 (x19..x30 + fn-ptr slot matching context_switch/task_entry), instead of the x86 register layout. * Native /init (build.rs): for aarch64 builds, embed a tiny EL0 ELF (linked at 64 GiB to clear the kernel identity blocks) that does write + getpid + write + exit using the OSCortex syscall numbers. * main.rs: on aarch64 the init is entered directly via enter_user_by_pid_noreturn (the real write_cr3 → ERET path) since the ARM preemptive timer ISR is a follow-on; x86 keeps the schedule_user_launch hand-off. Verified on qemu-system-aarch64 -M virt: init spawns (pid=1, EL0 entry), prints two lines via write(1,...), getpid returns, exit(0) reaps cleanly and the kernel parks — all through the shared syscall layer, no faults. x86_64 debug + release still build green; no /init is injected on x86. * aarch64: robust DTB discovery — x0, RAM probe, then arch default QEMU's -M virt passes the DTB pointer in x0 for a flat/Image boot, but for an ELF -kernel it may leave x0 = 0 and place the blob at an image-dependent address. boot_prod now: (1) uses x0 when it points at a valid FDT (the spec mechanism, also correct on real hardware), (2) probes the RAM base + a bounded low window for the FDT magic, (3) falls back to the QEMU virt default (2 GiB @ 0x40000000, matching -m) when neither yields a /memory node. Adds fdt::scan_for_dtb for the magic-scan path. The parser itself is unchanged and validated against the machine's real device tree. * aarch64: document the production boot path in run-aarch64.sh The ARM `-kernel` boot now routes into the shared kernel_main and spawns userspace; update the script header to describe that instead of only the self-contained bring-up demo. * mm: widen BootMemMap capacity to 128 regions Ensures a fragmented x86 Limine memory map never drops usable RAM when translating into the architecture-neutral region list. * aarch64: ramfb framebuffer via fw_cfg — display-capable on -M virt QEMU -M virt ships no Limine framebuffer (Limine is the x86 boot protocol), so the ARM port had no display. Add a fw_cfg DMA reader/writer and use it to configure QEMU's -device ramfb: * kernel/src/arch/aarch64/ramfb.rs: walk the fw_cfg file directory to find the etc/ramfb selector, allocate a 1280x800 XRGB8888 framebuffer in identity-mapped RAM, and DMA-write the big-endian RAMFBCfg into it. Two gotchas handled: SELECT and WRITE must be separate DMAs (a combined control word is rejected), and RamfbCfg must be repr(C, packed) so the write length is exactly the 28-byte file size QEMU expects. * drivers/fb.rs: add init_raw(addr,w,h,pitch) — an arch-neutral entry point so the existing fb console/compositor (fed from Limine on x86) works unchanged on ARM from a raw buffer. * main.rs (kernel_main_arch): bring ramfb up after the frame allocator is online, publish it to the shared fb, and paint a boot fill. * scripts/run-aarch64.sh: add -device ramfb; DISPLAY_MODE=none|cocoa|gtk|sdl with a headless monitor socket for screendump capture. Validated on qemu-system-aarch64: monitor screendump is now 1280x800 (was the 640x480 default), with the painted teal top band (20,184,166), navy body (26,26,46) and white boot-console glyphs all present — pixels the kernel wrote are scanned out by ramfb. * aarch64: timer-driven preemption — shared cooperative scheduler hand-off Wire the ARM generic-timer IRQ to the SAME shared cooperative-scheduler hand-off the x86 APIC-timer ISR uses, so EL0 threads (the Flutter engine's threads, ultimately) are timer-preempted and round-robined — collapsing the ARM 'enter init directly' path onto x86's timer-driven model. * arch/aarch64/apic.rs: production_irq_handler — on a generic-timer PPI taken from EL0, apply the same quantum + input/focus preempt policy as x86, map the live EL1 vector TrapFrame to the shared UserRegs, call the shared process::timer_preempt_switch_try (arch-neutral: PTABLE bookkeeping, CR3/ TTBR0, fs_base, xstate, current_pid), then write the next thread's frame back into the TrapFrame so the vector stub's RESTORE_FRAME+eret enters it. init_bsp only installs the handler; start_scheduler_tick arms the timer. * Full-fidelity context: the shared UserRegs carries only the x86-named register subset, so each thread's complete ARM frame (x0..x30/SP_EL0/ELR/ SPSR) round-trips through a new per-process arch_trapframe slot (process::arch_store/take_trapframe, try_lock — ISR-safe). Scratch registers (x6/x7, x11–x18, x24–x28, x30) survive arbitrary-instruction preemption. * IRQ enablement: IRQs are NOT unmasked at EL1; the eret into EL0 (SPSR=EL0t, I/F clear) enables them at the userspace boundary. This closes a sporadic boot hang where a timer fired mid enter_user_by_pid_noreturn (during the CR3 switch / image build) — boots went from ~50% hanging to 5/5 clean. * mm: reserve the kernel image in the frame allocator on ARM. QEMU -kernel loads the image (0x40080000) into RAM the device tree reports usable from 0x40000000; the allocator marked it all free, so the first large contiguous alloc (the 64 MiB heap, then the compositor's 4 MiB double-buffer) overran live kernel code — a latent corruption that hung the first big alloc once a framebuffer made the compositor enable double-buffering. frame_allocator:: reserve_range carves [RAM base..__kernel_end] before the heap is built. * build.rs + main.rs: the ARM /init program is now a compute-bound loop (busy-spins with NO syscall between ticks, prints a per-instance tag for a bounded budget, then exits). main.rs spawns TWO instances (A=x0:0, B=x0:1); since neither yields cooperatively, interleaved A/B output on serial can ONLY come from the timer preempting + switching between them. Validated on qemu-system-aarch64 (5/5 boots): both EL0 threads make progress with fair, interleaved ticks (e.g. A=12 B=12, 7-8 A<->B switches) then exit cleanly. The longer compute-budget variant showed thousands of balanced ticks (A=3341/B=3350, 2148 switches). x86 builds clean (no regression). * aarch64: port the Flutter embedder host to the ARM syscall ABI Build oscortex-host for aarch64-unknown-none. The syscall stubs (sys.rs) gain an SVC #0 backend (nr in x8, args x0..x5, return x0) alongside the existing x86 SYSCALL path; the typed wrappers stay architecture-neutral. _start gains an AArch64 naked entry: zero x29/x30, 16-byte-align SP via a scratch GPR, preserve the kernel's bootstrap args (x0=host_mode, x1=app_id, x2=aot_va) across the breadcrumb write, then call main_embedder(x0,x1,x2). main_embedder now takes those three as C arguments (dropping the brittle read-rdi/rsi/rdx asm), and the monotonic clock reads CNTVCT_EL0/CNTFRQ_EL0 on ARM instead of RDTSC. A user_aarch64.ld linker script + an aarch64-unknown-none .cargo target (static reloc) load it at 0x400000. Produces a 0x400000-based static AArch64 ELF, mirroring the x86 host. * aarch64: dynamic-loader relocations + EL0 demand-paging for the engine dl.rs: accept the aarch64 engine .so. Add R_AARCH64_RELATIVE/ABS64/ GLOB_DAT/JUMP_SLOT alongside the x86 relocation set (same RELA encoding, selected by type number), gate the ELF e_machine check on the build arch, and arch-gate the x86-only engine byte-patch block (x86 opcodes/offsets must never run against the ARM engine). One dlopen path now loads either the x86 or the ARM libflutter_engine.so. vectors.rs: route EL0 data/instruction aborts (EC 0x24/0x20) through the shared demand pager. Reads FAR_EL1 for the faulting VA and decodes the ESR fault-status code into the x86-style present bit so demand_page behaves identically. The Flutter engine demand-pages its Dart heaps and AOT exec regions, so without this every such access parked the core. paging.rs: flush the stale TLB entry on aarch64 in demand_page's already-mapped fast path (the x86 path uses invlpg). * aarch64 shell: stage engine+host+snapshot into initramfs, spawn host as pid1 Deliver the arm64 Flutter stack through the initramfs (no Limine on -M virt): the arm64 engine .so, the aarch64 oscortex-host embedder (as /init), the arm64 AOT shell snapshot (libapp.so), icudtl.dat and flutter_assets. build.rs keeps a staged real /init and ships libflutter_engine.so in the initramfs on aarch64. kernel_main spawns the host as pid 1 with HOST_MODE_SHELL. mm/paging splits the seeded kernel identity block when overlaying user pages; vectors report FAR_EL1; diagnostics trace pid1 syscalls. * aarch64: fix user-page mapping over seeded identity blocks + serve engine from initramfs Two bugs blocked the Flutter host on ARM: 1. ELF loader read 0xFF for the host's .rodata. translate_user_page/walk treated the kernel identity-map BLOCK descriptors (seeded into every per-process root) as table pointers, dereferencing their output PA as a table — returning bogus 'already mapped' identity PAs. The loader then reused the identity PA (device memory) instead of allocating a real frame, and splitting a block to overlay one page left the rest of the block as valid identity sub-pages, poisoning every later page too. Fix: walk() returns None at a block descriptor (no 4 KiB leaf there); the ELF loader tracks frames IT allocated this load (BTreeMap) rather than querying the identity-polluted page table. 2. dlopen of libflutter_engine.so hard-failed when no Limine module was present. x86 ships the engine as a Limine boot module; aarch64 has no Limine and ships it in the initramfs. Fix: fall back to the VFS lookup when the module is absent. Host now prints its breadcrumbs and reaches the engine dlopen on ARM. * aarch64: native syscall trampolines + fix EL0 data-abort EC so demand paging fires The Flutter engine's DT_INIT_ARRAY jumps into the POSIX trampoline page to reach libc symbols; that page was (a) skipped entirely on aarch64 and (b) encoded with x86 machine code. Add an AArch64 encode_stub (movz x8,#nr; svc #0; ret and the RetU32/RetAddr/FloatZero/MathSyscallN/SyscallRetAsArg0 shapes), enable map_system_pages on both arches, fill stub padding with RET, and do I-cache maintenance (dc cvau / ic ivau / isb) on the freshly written code pages. Also fix EC_DABT_LOWER: a data abort from a lower EL (EL0) is EC=0x25, not 0x25's same-EL sibling 0x24. With the wrong constant every EL0 mmap/heap demand fault fell through to report_unhandled instead of the demand pager. The engine now runs its init array through dozens of pthread/locale syscalls and demand-faults its anonymous Dart heap correctly. * aarch64: correct EL0 data-abort EC to 0x24 + route EL1 user-VA aborts (0x25) through pager The ARM ARM (D17.2.37) ESR_EL1.EC encodings are: 0x24 = Data Abort from a lower EL (EL0 user fault) 0x25 = Data Abort without EL change (EL1 kernel touching a user VA) efe6c25 set EC_DABT_LOWER=0x25, which is the same-EL (EL1) variant, so EL0 demand faults from the Flutter engine never matched and the pager never ran. Fix EC_DABT_LOWER to 0x24 and add EC_DABT_CURR=0x25 routed through the same demand pager so a syscall handler dereferencing a not-yet-paged user pointer also resolves. * aarch64: implement R_AARCH64_TLSDESC (+TPREL/DTPREL/DTPMOD) relocations for engine thread_local The arm64 Flutter engine compiles its C++ thread_local accesses with the TLSDESC model: 'ldr x1,[desc#0]; blr x1' where the descriptor's first word is a resolver and the second its argument. The loader applied ABS64/GLOB_DAT/ JUMP_SLOT/RELATIVE but ignored the 14 R_AARCH64_TLSDESC relocations, so the descriptor slots stayed zero and the very first thread_local access in fml::MessageLoop::EnsureInitializedForCurrentThread branched to address 0 (EL0 sync abort, EC=0, ELR=0). Resolve them statically for the single, fixed-load module: - emit a tiny '_dl_tlsdesc_return' stub (ldr x0,[x0,#8]; ret) into the last slot of the executable trampoline page (TLSDESC_RESOLVER_VA) - point every descriptor's word0 at it and write the variant-I TP-relative offset (TLS_TP_OFFSET=16 + module offset) into word1 Also handle TPREL64/DTPREL64/DTPMOD64 for completeness. x86 path untouched. * aarch64: preserve user x30 (LR) across cooperative syscall yields AArch64 keeps return addresses in the link register, not on the stack, so a thread that yields inside a syscall (epoll_wait/futex/cond_wait) must resume with x30 intact. The cooperative re-entry path (build_image) hard-wired x30=0 ('zero on first entry'), so after FlutterEngineInitialize the engine's first thread resumed from an epoll-block and its next branched to address 0 (EL0 sync abort, EC=0, ELR=0, x30=0). Capture x30 at the SVC boundary into a per-CPU/per-process slot and restore it on SYSRET re-entry, carried through the shared enter path in the otherwise- unused rflags slot (build_image maps rflags->x30 on aarch64; SPSR is constant). Also print x30/x1/x16/x17/pid in the unhandled-exception report for diagnosis. x86 path untouched (rflags stays real FLAGS there). * aarch64: arch-correct syscall re-exec rewind, drop nested thread-enter, harden FB console Three blockers past the LR fix: 1. Embedder passed --dart-flags=--old_gen_heap_size=512,... which the engine's switches.cc IsAllowedDartVMFlag denylist FML_LOG(FATAL)s. Drop the switch; the AOT VM uses its defaults (the heap sizing was a stale x86-JIT workaround). 2. The blocking syscalls rewound the saved user PC by a hardcoded 2 bytes to re-execute on resume — correct for x86's 2-byte , but aarch64's is 4 bytes, so -2 left the PC mid-instruction → EC=0x22 PC-alignment fault. Add process::SYSCALL_INSN_LEN (2 on x86, 4 on arm) + a save_return_context_reexec() helper and route all 14 re-exec sites through it. 3. pthread_create gave the newborn an immediate slice by entering the child from inside the creator's syscall (never returning, delivering r=0 later). On aarch64 this nested enter-while-in-syscall corrupted the creator's resume so the 2nd fml::Thread's pthread_create returned non-zero (thread.cc:80). Gate that path to x86; on arm the child runs via normal cooperative sched. 4. Harden blit_char against col/row overflow so the serial-mirror text console can never panic the kernel (was: u32 overflow during heavy thread spawn). x86 paths unchanged. Engine now spawns tid=2/3/4 and runs ~1100 serial lines before the FB-console panic that this also fixes. * aarch64: harden FB text console against geometry overflow; keep nested thread-enter off Make blit_char/scroll_up/write_byte bail cleanly on a degenerate or corrupted FB geometry (rows/cols == 0) instead of panicking the kernel with arithmetic overflow — the serial-mirror console is never worth a panic. With this the engine spawns its full UI/raster/IO thread set (tid=2/3/4) without the kernel dying. Document why the x86 immediate-child-enter slice stays disabled on arm (it corrupts the creator's cooperative resume → thread.cc:80 abort). * wip(aarch64): timer-ISR timerfd wakes + pending-wake delivery; cpu-id robustness for single-core ISR; re-enable scheduler tick for engine host; preempt diagnostics * wip(aarch64): walk EL0 FP chain in unhandled-exception report to symbolise null-jump crash * wip(aarch64): expand scheduler diagnostics — heartbeat ticks with full thread states, scan tids 1-12, raise log caps * wip(aarch64): dump raw PTABLE slot pid/state for idx 1-12 to find the vanished worker thread * aarch64: unmask IRQs during SVC syscall handling so the generic-timer scheduler tick can fire mid-syscall (mirrors x86 sti-on-entry); fixes engine bring-up deadlock where a spinning/cooperatively-yielding syscall masked the timer and froze preemption * aarch64: make timer-ISR cond-expiry use try-lock for FUTEX_WAITERS (futex_waiter_remove_try) — fixes single-core IRQ-masked self-deadlock where a timer tick during spawn_thread/futex syscall spun forever on a held lock * mm: make PAGE_TABLE_LOCK reentrant on a single core (lock_page_table) — fixes demand-pager self-deadlock when a fault re-enters map_user_page while the page-table lock is held (exposed by aarch64 IRQs-on-during-syscall) * dl: cfg-gate TLSDESC_RESOLVER_VA reference so the x86_64 kernel builds (the aarch64-only resolver trampoline; arm unchanged) — keeps x86 green * aarch64: eager user-GPR capture + IRQ-masked page-table lock — advance engine bring-up past the early ret-to-0 / pager deadlock Two single-core register/lock-coherence fixes that move the ARM Flutter engine from crashing immediately after FlutterEngineInitialize to running through FlutterEngineRunInitialized into ICU init and worker-thread spawn. 1) Eager user-GPR capture at the SVC boundary (process::mod, vectors). `save_full_user_gprs` read the per-CPU user-GPR snapshot LAZILY at yield time. Syscalls run with IRQs unmasked, so a generic-timer tick can preempt a thread mid-handler, switch to a sibling whose own SVC overwrites the shared snapshot, then switch back — and the yielding thread then persists the sibling's callee-saved regs / x30 into its own context (resume → `ret` to a stale address, e.g. x30=0 → branch to 0). Port the proven x86 fix (GPRS_CAPTURED flag + capture_user_gprs_at_entry): snapshot ONCE, eagerly, while fresh, and make later yield-time saves no-ops. On aarch64 the eager capture runs from inside the IRQ-masked SVC window in the vector dispatch (x86 stays masked until the handler sti's), so the snapshot can't be clobbered before it is persisted. 2) IRQ-masked outer page-table critical section (mm::paging). The reentrant PAGE_TABLE_LOCK depth counter is only sound if the section can't be interleaved by another thread. With IRQs unmasked during syscalls, a timer tick could preempt a lock holder mid-section; a sibling then saw a non-zero depth and proceeded as a bogus "nested" writer, desynchronising the counter from the real lock and eventually stranding the lock held while depth read 0 — a later outer acquire then spun forever with IRQs masked in the demand-abort handler (single-core deadlock, observed freezing right at worker-thread stack setup). Mask IRQs for the whole outer section so it is genuinely uninterruptible; nesting then only ever means true same-stack re-entry (a demand fault during a page-table walk, already IRQs-masked). x86 behaviour is unchanged (cfg-gated to aarch64). Also: the unhandled-exception reporter now scans the EL0 stack for engine return addresses (the FP chain is empty when x30/FP are 0), so a ret-to-0 can be symbolised against libflutter_engine.so. Status: engine now reaches ICU init / worker spawn; a residual cooperative- yield corruption remains (nondeterministic ret-to-0 vs abort) — same class the x86 port hardened over several commits. Not yet rendering on ARM. --------- Co-authored-by: Tahiru Agbanwa <tahiru@users.noreply.github.com> Co-authored-by: Tahiru Agbanwa <tahiru@oscortex.dev> Co-authored-by: Tahiru Agbanwa <tahiru@dotcorr.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this brings
The aarch64 Flutter shell path: one source tree, the same Material shell that renders on x86, now builds and boots into the Flutter engine on ARM via direct
qemu-system-aarch64 -M virt -kernel(RAMFB display, GICv2 + generic-timer preemption, EL0 syscalls). The arm64 engine + AOT shell snapshot are staged into the initramfs byscripts/build-aarch64-shell.sh(artifacts published as releaseoscortex-engine-1/ prebuilt in theoscx-enginecontainer).Boot reaches: dlopen engine →
FlutterEngineInitializeOK → AOT confirmed on ARM (FlutterEngineRunsAOTCompiledDartCode=true) → snapshot ptrs resolve →RunInitialized→ ICU init (icudtl mmap) → worker-thread spawn.Kernel fixes (final commit b98ff19) — single-core register/lock coherence
GPRS_CAPTUREDfix; on aarch64 it runs inside the IRQ-masked SVC window (ARM unmasks IRQs before the handler). Stops a timer-preempted sibling's syscall from clobbering the shared per-CPU snapshot and leaking stale callee-saved regs / x30 into a yielding thread's resume context.lock_page_tabledepth counter desynced from the spinlock when the timer preempted a lock holder, deadlocking the demand-pager. Now masks IRQs for the outer section (cfg-gated to aarch64; x86 untouched).x86 safety
All shared-code changes are
cfg-gated to aarch64. x86_64 kernel verified to compile (cargo check --target x86_64-unknown-none).Status / not-done
Engine reaches ICU init + worker spawn; a residual cooperative-yield corruption remains (nondeterministic ret-to-0 vs abort) — same class the x86 port hardened over several commits. Next targets:
save_return_contextlazy rip/rsp, andbuild_imagenot preserving x24–x28 across cooperative-yield resume. Not yet rendering on ARM — merging to keep the integrated ARM path moving; further hardening to follow.