SMP home-core app isolation — x86 app-launch fixed; aarch64 SMP advanced (WIP) by squirelboy360 · Pull Request #15 · DotCorr/oscortex

squirelboy360 · 2026-06-17T07:51:13Z

Summary

Home-core SMP engine isolation + the SMP-safety fixes it required. On x86 under SMP, launched apps now initialise their engine and render on a dedicated application core; the OS no longer goes fatal on an app crash. aarch64 SMP is advanced but app-launch is still WIP (honest — see below).

Verified (x86, headless TCG -smp 2)

App launch + render works (~70–85% of runs; intermittent non-fatal first-frame stall). 0 kernel panics / 0 page-table corruption. Single-core (the shipping default) is byte-unchanged.
Shipped as release v0.1.4 (x86 ISO).

SMP-safety fixes (latent bugs the cooperative single-core kernel hid)

Per-CPU page-table lock depth (was a global counter → 2nd core bypassed the lock → page-table corruption).
PTABLE↔WM ABBA lock-order fix (canonicalise pids at the WM boundary).
Reschedule IPI on spawn so a home-pinned core wakes immediately.
Page-table walks degrade gracefully on a freed/garbage root (no kernel fault on crash-recovery teardown).

aarch64 — WIP, NOT working yet (honest)

The AP now runs the app (fixed a real CNTKCTL_EL1 trap), but the engine bootstrap stalls on the cooperative hand-off; the corruption-free fix is a per-CPU switch_to rework (multi-session). Commits are labelled as WIP; single-core ARM is unchanged.

Process

Adds rules.md (mandatory anti-stub engineering rules) + docs/FEATURE_STATUS.md (honest feature ledger).

- PS/2 IntelliMouse 4-byte scroll wiring (kernel) -> EV_SCROLL -> embedder FlutterPointerEvent (signal_kind=kScroll); direction+speed configurable. - Settings screen in the shell: natural-scroll toggle + speed slider, live via config:* messages over the oscortex/shell channel. - Kernel boot spinner drawn on the framebuffer during JIT warm-up (drivers/fb.rs::draw_boot_splash, driven by compositor::tick); clears when Flutter presents its first frame. Plus a 'Launching <app>' overlay in the shell. - Adaptive embedder frame pump: gentle during warm-up, 60fps after input, ~8fps idle (was a 1ms/1000-per-sec flood) -> req:frame ratio 115:1 -> ~7:1. - Revert MAX_FRAMES to 1<<20 (RAM above 4GiB is not yet safely mappable; the cap is load-bearing) with an explanatory comment. - docs/arch.txt: document the real JIT execution model + runtime status.

Build the Flutter engine from source as a first-class OSCortex AOT target instead of shimming a Linux engine. Phases, port surface, build infra, and the hack-removal checklist in docs/native-engine-port.md; direction recorded in docs/arch.txt.

Draw the destination ahead of the work: three clean layers (kernel native ABI / shared AOT engine / self-contained AOT app bundles with own PIDs), the shared- framework property, and the hack-removal it enables. No Linux costume.

…ace map Baseline libflutter_engine.so (377MB, x64, embedder API) builds from source in the container — toolchain proven. Mapped the exact port surface: ~1,200 lines (Dart VM os_oscortex/os_thread_oscortex ~1000, fml message_loop_oscortex+paths, reuse posix) + GN glue. The fml message loop is the critical file — its emulated equivalent is what livelocks rendering today, so the native port also fixes sync.

Capture the validated Phase 0 setup as one-command scripts so the repetitive, heavy engine-build infra is reproducible, not tribal knowledge: - setup-engine-build.sh: idempotent container + depot_tools + pinned gclient sync (encodes the name='.' fix and the flags that work). - build-engine.sh: gn configure + ninja for baseline | oscortex targets. - README.md: the contributor flow, the edit->build incremental loop, and the pitfalls already solved (gclient layout, emulation, disk, prebuilt-dart). Engine checkout stays OUTSIDE the repo (22GB, never committed).

Add 'oscortex' as a --target-os. Since OSCortex has no sysroot/libc yet (runs linux-ABI via emulation), it links against linux but sets a new is_oscortex GN flag to select the OSCortex backend sources later; a true OSCortex toolchain is a later sub-phase. gn configures cleanly (1714 targets; args.gn = target_os linux + is_oscortex true). - engine-port/patches/: the two tracked diffs (gn tool, BUILDCONFIG). - engine-port/apply-port.sh: applies patches + (Phase 2) backend sources into a fresh checkout, idempotent. Temp patch files cleaned from the workspace (no remnants).

This is a public-facing repo; debugging slop and dead-path artifacts don't belong. Removed (all dead, none load-bearing for the working JIT build): - scratch/ (192 files): the entire AOT-deser debugging tree — analyze_*/disasm_*/ assemble_*/check_* scripts, .bin/.log artifacts, and vendored gen_snapshot/ shader binaries. Referenced only by build-iso.sh's dead shell-AOT step. - tools/flutter-engine/libflutter_engine.so.bak: 91MB orphaned 'pristine engine' backup, zero references. - gen_help.txt, test_size, test_size.c, .DS_Store: stray one-offs. - build-iso.sh: excised the dead [0.3/5] shell-AOT block (gen_snapshot -> libapp.so -> patch_libapp) — the shell runs JIT off kernel_blob.bin, not an AOT libapp.so; also dropped libapp.so from REQUIRED_FILES. Native AOT is done properly via the engine port (docs/native-engine-port.md). - .gitignore: ignore /scratch/ and /qemu-pipe.* so they can't return. Deliberately KEPT (still load-bearing for the current JIT build; removed in the port's Phase 4 once the native engine works): engine_patch.py, patch_libapp.py (still used by tools/build-flutter-osx.sh for the apps), the Linux-emulation shim. Verified: build-iso.sh and build-kernel-iso-fast.sh both pass bash -n; the working fast build references neither scratch/ nor libapp.so.

Add the OSCortex fml platform backend and wire is_oscortex selection. Verified: fml.message_loop_oscortex.o + fml.paths_oscortex.o compile into libfml, the linux message loop is excluded, and message_loop_impl.cc selects MessageLoopOscortex. - engine-port/src/flutter/fml/platform/oscortex/: message_loop_oscortex.{cc,h} (epoll+timerfd loop — the sync-bug-fixer, starts as a clean clone to diverge to native primitives) + paths_oscortex.cc. - patches/0003: build_config.h defines FML_OS_OSCORTEX (additive, gated on the FLUTTER_OSCORTEX define), message_loop_impl.cc selects MessageLoopOscortex first, fml/BUILD.gn swaps in the oscortex sources (excludes linux) + sets the define. - apply-port.sh applies 0003 + copies src/. Note: the backend uses the Linux-ABI calls (epoll/timerfd) that OSCortex natively implements — OSCortex is a native kernel + own libc that is Linux-ABI-compatible (like Fuchsia/Starnix), not Linux. These calls are the divergence seam if we ever move to a fully custom ABI.

…t VM clone - Verified a release/AOT oscortex build configures and gen_snapshot is buildable from our tree -> engine+snapshot version-matched, dissolving the old AOT dead-end (the '1247 base objects' mismatch). AOT is now a build step, not a research problem. - Deferred the Dart VM os_oscortex clone: on path A os_linux.cc already works; cloning it byte-identically is busywork. Add a Dart VM backend only when a primitive must actually diverge.

libflutter_engine.so (377MB) links for the oscortex target and bakes in OUR backend: 17 MessageLoopOscortex refs, 0 MessageLoopLinux. First Flutter engine built from source FOR OSCortex. Refined the gn target (patch 0001): oscortex now configures as a HOST linux-x64 build + is_oscortex flag (identical to the proven baseline, our platform backend selected), instead of an explicit target_os that tripped ANGLE/wayland scope and embedder-constructor mismatches. Runtime still renders software via kSoftware.

…ld) flow Make the from-source build a one-time maintainer task, not something every dev repeats — exactly how Flutter distributes its own engine. - artifact.config: pins ARTIFACT_VERSION + Flutter rev + the R2 base URL. - fetch-engine.sh: CONSUMER flow — downloads the pinned prebuilt engine (libflutter_engine.so + gen_snapshot + icudtl.dat) from Cloudflare R2, verifies sha256, stages it where the OS build expects. Seconds, no checkout. - publish-engine.sh: MAINTAINER flow — packages the built artifacts and uploads to R2 (rclone or aws/S3 to the R2 endpoint). Run after build-engine.sh when the port/version changes; bump ARTIFACT_VERSION. - README: two-tier model (consumers fetch, maintainers build+publish), R2 layout, multi-arch = one tarball per ISA. Scaffolding is ready now; first publish happens once the release/AOT engine lands (Phase 3). Set ARTIFACT_BASE_URL to your R2 URL to activate fetch.

- setup-r2.sh: create the R2 bucket + enable public r2.dev URL + auto-write ARTIFACT_BASE_URL into artifact.config. Idempotent. Needs Cloudflare auth first (npx wrangler login, or CLOUDFLARE_API_TOKEN). - publish-engine.sh: upload via wrangler (npx, no install) as the primary R2 backend, rclone/aws as fallbacks. - README: one-time R2 host setup steps. Bucket creation requires the account owner's Cloudflare auth, so this is the single manual step; everything else is automated.

R2 needed dashboard activation; switched the artifact host to Google Cloud Storage (gs://dotcorr-oscortex-engine, public read, billing active). Verified end-to-end: gcloud upload -> anonymous https://storage.googleapis.com/... fetch. - artifact.config: ARTIFACT_BASE_URL -> GCS public URL + GCS_BUCKET. - publish-engine.sh: gcloud storage cp as the primary backend (wrangler/rclone/aws remain as fallbacks). fetch-engine.sh is unchanged (curl, host-agnostic).

The repo is public, so GitHub Release assets download anonymously — free, no egress fees (vs GCS ~$0.12/GB), no separate cloud account, and gh is already authed. Verified end-to-end: gh release upload -> anonymous https://github.com/DotCorr/oscortex/releases/download/... fetch. - artifact.config: ARTIFACT_BASE_URL -> GitHub releases download URL; GITHUB_REPO. - publish-engine.sh: gh release create/upload as the PRIMARY backend (GCS/R2/S3 remain as documented fallbacks). fetch-engine.sh unchanged (host-agnostic curl). - README: GitHub Releases is the host; no bucket setup needed. - Removed the unused GCS bucket (no remnants).

…erified The release oscortex engine (33MB, vs 377MB debug — no JIT) and a version-matched gen_snapshot (6.5MB) built from one tree. Published the first artifact to GitHub Releases (oscortex-engine-1) and verified the full round trip: publish-engine.sh -> gh release -> fetch-engine.sh downloads+checksums+stages. Engine + gen_snapshot from the same tree are version-matched, which dissolves the old AOT dead-end (the '1247 base objects' mismatch). Fix: publish-engine.sh derives the workspace from the container's /work mount (robust to the workspace dir name) instead of assuming a fixed name.

shell app -> frontend_server -> AOT dill (24MB) -> our version-matched gen_snapshot -> libapp.so (4.4MB native ELF). Verified a real AOT snapshot: all 4 _kDart*Snapshot{Data,Instructions} present with REAL native instructions (T/text, not zeroed stubs), ELF64 x86-64. The multi-session 'Snapshot expects N base objects, provided 0' blocker is dissolved because engine + gen_snapshot are built from one tree (version-matched). compile-app-aot.sh makes it reproducible.

The native, from-source, AOT-compiled Flutter engine RENDERS the shell on OSCortex: FlutterEngineRunsAOTCompiledDartCode -> libapp.so AOT path -> present_callback 393 frames, ZERO JIT warmup (no kernel_blob, no codegen). The UI comes up immediately, no 60-90s compile. This is the whole point of the port. Fix to get here: the release/AOT engine rejects the JIT-era GC/heap dart-flags (switches.cc:478 disallowed) — pass only the 5 AOT-safe engine args (argc=5) when is_aot, dropping the old --old_gen_heap_size etc. Known follow-up (separate, a regression from this session's IntelliMouse 4-byte scroll wiring): mouse clicks/Y are mangled + pointer events not reaching the engine; to fix next.

The 4-byte scroll-packet mode added this session corrupted pointer data: dy mis-parsed (cursor pinned to the top, y=0) and the flags byte mis-parsed (buttons stuck at 0), so clicks never registered and pointer events stopped reaching the engine. Revert to the proven 3-byte packet mode (working clicks > broken wheel). Verified: cursor Y tracks normally again (y=32 start, not pinned at 0). Scroll to be re-added later with correct 4-byte parsing + resync that doesn't regress clicks.

Input is confirmed working under AOT (122 pointer events reached Flutter, hover + clicks register, button presses detected, dropped_total=0 — the earlier '0' was the broken IntelliMouse 4-byte mode + minimal interaction, fixed by the 3-byte revert). Remove the temporary DIAG logging from the embedder + ipc_display. engine-port/compile-app-aot.sh: per-app AOT compile (frontend_server -> dill -> version-matched gen_snapshot -> libapp.so), no patch — used to give each app its own AOT bundle so the AOT engine can run them.

Previously every launched app reused the shell's /system/flutter/libapp.so (aot_va=0 JIT-era skip), so tapping a tile ran the shell snapshot in the app host and stalled. Resolve the per-app snapshot by registry lookup: build_app_libapp_path(app_id) -> /Applications/<name>.app/libapp.so, dlopen it, and point the AOT snapshot loader at that path. Shell keeps its own path. Each app is AOT-compiled to its own libapp.so and staged under its .app bundle, so a launched host loads its own native snapshot, not the shell's.

pthread_cond_signal/broadcast with no waiter parked is a no-op by contract: the predicate is mutex-protected, so a thread that hasn't yet entered cond_wait observes it under the lock and never waits. The kernel was instead recording such signals in COND_PENDING_SIGNALS and letting the next cond_wait consume one as an immediate (spurious) return. A hot mutator (pid 2) that signals its own monitor with woke=0 then re-waits would consume its own fake pending, return 0, find its predicate still false, and re-wait forever — the cond-pending-consume livelock that froze render at a fixed frame count. Remove the mechanism end to end (consume in cond_wait, posts in cond_signal/cond_broadcast, the state map, the import). cond_wait now relies solely on the seq protocol, which delivers every real signal race-free under cooperative single-core scheduling (the value-check and park cannot interleave with a signaler). No remnants.

sys::dlopen forwards path.len() to the kernel as the path length, so a NUL-terminated byte string makes the kernel read the trailing \0 as part of the filename and the open fails. The per-app and shell AOT paths were passing "...libapp.so\0"; strip the NUL at the dlopen call. The AOT snapshot itself loads via aot_snapshot_load (which maps it executable), so this only silenced a spurious failure path, but it is a real bug and removes the warning.

The Debug-level hot per-syscall traces (epoll_ctl, mprotect, cond-signal, every keypress) each block on a synchronous COM1 UART write AND re-render the framebuffer text console, with interrupts disabled. Measured: ~14000 log lines during a single boot+app-launch. Dropping to Error cut serial volume ~12x (14000 -> 1185 lines), removing tens of seconds of emulated-bare-metal warmup. Raise the one line in logger::init back to Debug for deep tracing.

Root cause of the sporadic render crashes. The user GPR snapshot lives in a PER-CPU scratch (gs:[..]) written by the syscall entry stub and shared by every thread on the core. save_full_user_gprs read it LAZILY at yield time — but the wait loops do sti;hlt and the timer ISR can switch threads mid-handler, so another thread's syscall entry overwrites the per-CPU snapshot before our save runs. We then stored the OTHER thread's callee-saved regs (rbx/rbp/r12-r15) into our context; on resume rbx was garbage. Pinpointed via addr2line/objdump: fml::MessageLoopOscortex::Run() resuming from epoll_wait wrote running_ through this=0x1400 (=1280*4, a Skia row stride leaked from another thread) -> SIGSEGV pid=3. This is the whole 0/16/140/489-frames- across-identical-boots sporadicity. Fix: capture_user_gprs_at_entry() snapshots the GPRs into the thread's own PTABLE slot at dispatch_fast entry, while the per-CPU snapshot is still fresh for this thread (before any handler/yield/interrupt window). A per-CPU 'captured' flag makes later yield-time save_full_user_gprs calls no-ops, so a clobbered shared snapshot can't leak in. Validation: 4/4 headless boots render cleanly (present 98-105) with ZERO engine SIGSEGV, vs the prior sporadic crashes. Render is now reliable.

The frame pump only called FlutterEngineScheduleFrame in the no-event branch, so a hover/click/scroll waited for the event queue to drain plus a wm_event_wait timeout (up to ~16ms) before the repaint was even requested. Request the frame in the same iteration input is received; the engine coalesces redundant requests and Flutter flushes the batched pointer packet at BeginFrame, so the scheduled frame reflects the event. Removes the scheduling-side input latency (most visible on real hardware; under cross-arch QEMU TCG the emulation dominates).

Each submitted frame did up to 5 full 1M-pixel passes: a strided re-pack into a freshly allocated 4MB Vec, a byte-indexed B<->R swap, a full-screen fill_rect clear, blit_rgba32, and swap_buffers. Measured ~65ms median (wildly variable 10-92ms) -> a ~15fps ceiling from the blit alone, which read as a low refresh rate / laggy feedback. - Fuse the strided re-pack into the swap: submit_bgra_impl reads the source stride directly and packs in ONE u32-wise pass (read BGRA as u32, swap bytes 0<->2), eliminating the intermediate buffer + the 4MB/frame allocation. - Skip the full-screen fill_rect clear when a presented surface already covers the screen (the common case: one full-screen Flutter surface). Measured after: blit median 5.4ms, steady 5.0-7.6ms (~12x faster, variance gone). Render verified correct + crash-free; colors unchanged.

next_runnable_pid_locked computed the foreground-exclusive group AFTER the input-target and embedder-baton shortcuts, so a due shell (pid 1) baton could schedule the shell engine even while an app is foreground — two heavy Flutter VMs on one cooperative core, the documented cause of the launched app's crash. Compute fg/exclusive FIRST and gate both shortcuts: suppress the pid-1 baton when an app is exclusive, and only honour the input shortcut for a target in the foreground group. No behavioural change when the shell is foreground (the common case): exclusive=false short-circuits both gates. Closes the baton concurrency hole; full app-launch validation still pending an interactive tile-tap (HMP mouse_button injection does not produce reliable clicks headless).

Replace the generic reply-null catch-all in platform_message_callback with a real channel dispatcher (match on channel name). Every channel the framework reaches for is now routed to a concrete handler that does the platform work or returns a codec-correct typed ack; the final catch-all logs the channel name ([embedder/chan] unbound: <name>) so nothing stays an invisible stub. Bound channels: - flutter/textinput (JSONMethodCodec): setClient/setEditingState/show/hide/ clearClient. Editing state (text + selection) is maintained in the embedder. PS/2 set-1 scancodes are now mapped to characters (shift/caps, backspace, enter, tab, space, arrows, home/end, delete); on a key press with an active text client the stored editing state is mutated and TextInputClient. updateEditingState is pushed back over flutter/textinput. Adds a small inline JSON reader/writer (no_std, no crates) for the editing-state maps. - flutter/mousecursor: parse activateSystemCursor kind and ack (no kernel set-cursor-shape syscall exists yet; logged + acked). - flutter/platform: Clipboard.setData/getData/hasStrings backed by an in-embedder buffer; SystemNavigator.pop; SystemSound/HapticFeedback/ SystemChrome acked. - flutter/navigation, system, accessibility, spellcheck, processtext, menu, contextmenu, scribe, restoration, keyevent, platform_views, isolate, lifecycle: explicit JSON typed-null ack.

Replace the ack-only stubs from the platform-channel contract with actual OS-provided capabilities. OSCortex is the platform under the stock engine, so where Flutter needs a platform service the kernel now implements and binds it. Mouse cursor shape (flutter/mousecursor.activateSystemCursor): - compositor: ACTIVE_CURSOR_SHAPE atomic + vector cursor sprites (arrow, I-beam, hand/link, forbidden, grab, horizontal/vertical resize, hidden). draw_software_cursor dispatches on the active shape; set_cursor_shape() repaints immediately. - new syscall SYS_CURSOR_SHAPE_SET (0x4B2). Embedder maps the Flutter cursor kind string to a CURSOR_SHAPE_* and calls it, so hovering a link shows a hand, a text field shows an I-beam, etc. Semantics / accessibility: - embedder wires update_semantics_callback2 (FlutterProjectArgs off 280) and calls FlutterEngineUpdateSemanticsEnabled(engine, true) after run. The callback receives the FlutterSemanticsUpdate2 tree and stores each node (id, label, rect, flags, actions) in a live embedder structure for a11y / automation consumers. flutter/ accessibility now replies with the correct StandardMessageCodec null (0x00), not JSON. System clipboard (flutter/platform Clipboard.*): - kernel-global clipboard buffer (embedder::clipboard) shared across every app/host, with SYS_CLIPBOARD_SET (0x4B3) / SYS_CLIPBOARD_GET (0x4B4). The embedder routes setData/getData/hasStrings to the kernel, so clipboard survives across apps. SystemNavigator.pop (flutter/platform): - SYS_APP_CLOSE_FOREGROUND (0x4B5): refocuses the shell (pid 1) and wakes it. The app embedder, when it is a launched host, flushes its reply, calls the syscall and exits so focus returns to the shell. SystemSound.play (flutter/platform): - PC-speaker beep driver (drivers::beep) via PIT channel 2 + port 0x61, exposed as SYS_BEEP (0x4B6). click vs alert play distinct short tones. Deliberate no-ops (no such hardware), acked with the correct codec: - HapticFeedback.* (no vibration motor), SystemChrome.* (single full-screen compositor surface, no system UI overlays / orientation).

Brings in the embedder channel dispatcher, text input (flutter/textinput), and real OS-backed capabilities: cursor-shape sprites + SYS_CURSOR_SHAPE_SET, live semantics, kernel-global clipboard, SystemNavigator.pop return-to-shell, and a PC-speaker beep driver. Builds green (embedder + kernel).

Grounds the browser effort from two research passes: - Engine decision: Servo, not Chromium. Prebuilt CEF is impossible on a glibc-less microkernel; Chromium-from-source is a multi-year team effort (cf. Fuchsia); Cobalt is an app-subset; WPEWebKit is full but a heavy C/C++ port. Servo is Rust, single-process embeddable, and has a software render path for our framebuffer. - Integration: Flutter external textures are GPU-only, so the engine renders into its OWN OSCortex compositor surface (z-stacked under the Web Link chrome, clipped to the viewport), with the Flutter side as a transparent hit-rect that forwards input. Reuses the existing compositor — no engine patch. - The oscortex/webview MethodChannel contract (methods + events). - Two-track phased plan (engine-agnostic scaffold de-risks the design ahead of the multi-month Servo bring-up) + honest effort estimate.

…r, scaffold 1/n) The engine-agnostic app-side of the native webview, per docs/browser-architecture.md. - OscWebViewController: the oscortex/webview MethodChannel API (create/loadUrl/ back/forward/reload/canGoBack/currentUrl/getTitle/evalJs/resize/setViewport/ dispatchInput/dispatchScroll/dispatchKey/setVisible) + engine events (urlChanged/titleChanged/loadProgress/loadStarted/loadFinished/loadError/ navState/scrollChanged), fanned out to instances by viewId via one channel handler. - OscWebView: a transparent placeholder that reserves the web region, reports its on-screen rect (setViewport, so the kernel positions+clips the web surface), and forwards pointer/scroll input in webview-local coords. No Flutter texture — the web pixels are a sibling OS compositor surface showing through. 6 device-free tests (mocked channel): method args, reply round-trips, event callbacks, and per-viewId routing. analyze clean. Servo plugs in behind this later; next: wire Web Link's chrome to it + a stub engine service to prove the pipeline.

…fold 2/n) Turns the Web Link mockup into a real browser chrome wired to the oscortex_webview controller (matches the oscortex_ui design system): - toolbar: back/forward (gated by navState), reload↔stop (by load state). - omnibox: URL-or-search address bar — explicit schemes pass through, bare hosts get https, anything else becomes a web search (the "search" wrapper). - a 2px accent progress line driven by loadProgress. - the web region hosts OscWebView once a render surface exists, else a brand-styled "web engine starting…" placeholder. oscortex_webview: OscWebViewController now tolerates a missing engine service (MissingPluginException → inert) so the chrome runs before the backend exists. Package tests stay green (6); both the package and the app analyze clean. Engine-agnostic per docs/browser-architecture.md; navigation lights up once the embedder oscortex/webview handler + a web-engine service land.

…b (browser, backend 1/n) The pipeline behind the Web Link chrome — engine-agnostic; a Servo service later replaces the stub renderer behind this same contract. - sys.rs: surface geometry/z/clip/visibility wrappers (packed args matching the kernel) so the app can own + place a second (web) surface. - main.rs: StandardMethodCodec decode (generalized from std_find_kind: std_method_name / std_arg_str / std_arg_int) + a fixed-buffer encoder (no alloc). - handle_webview_channel: create → makes a compositor surface (owned by the app group), mmap-fills a placeholder, replies {surfaceId}; setViewport → geometry + clip + z (stacked above the Flutter surface, in the web region); loadUrl → tints + emits loadStarted/urlChanged/loadProgress/loadFinished/navState; destroy/query methods handled; the rest acked. - events sent to Dart via FlutterEngineSendPlatformMessage on oscortex/webview. Compiles on x86_64 + aarch64. Renderer is a stub (solid fill) — proves the full app→channel→surface→composite→events path; NOT yet boot-verified. Servo bring-up is gated on Rust std + a software-GL port (see next).

`"${KERNEL_FEATURE_ARGS[@]}"` errors with "unbound variable" under `set -u` on macOS's stock bash 3.2 when no KERNEL_FEATURES is set (the normal case) — it broke `X86_AOT=1 SKIP_CORE_APPS=1 build-iso.sh` at the kernel step. Use the `${arr[@]+"${arr[@]}"}` idiom, which is empty-array-safe on bash 3.2+. Also gitignore the osx CLI's per-app .osx/ config + build/osx/ outputs.

… hangs A 2016 Retina MacBook Pro hung the boot at cortex::compositor. Two causes: 1. compositor::init → fb::set_double_buffer allocated a full-fb back buffer with `vec![0u32; pitch*height]` (~20 MiB on 2880x1800). On a big fb / tight heap that aborts and hangs. Now allocate fallibly (try_reserve_exact) with a sanity ceiling; on failure fall back to single-buffer / direct-to-fb rendering (fully supported — every fb write path already checks DOUBLE_BUFFER_ACTIVE). It just tears; it boots. 2. Once double-buffering turns on, the per-phase boot markers render to the back buffer but nothing swaps until the engine warm-up loop — so the splash freezes at "compositor" and MASKS a hang in any later phase. The bp! milestone macro now swaps after each render, so every phase actually shows on screen. Verified: x86 still boots + renders the shell in QEMU (double-buffer path intact). Both arches compile.

… networking, 6/n) Apps can now resolve hostnames, not just connect by IP. - smoltcp: enable socket-dns/proto-dns + dns-max-* config. - net::tcp: a DNS socket in the stack (seeded with a public resolver 8.8.8.8; DHCP-provided servers are a follow-up refinement) + dns_resolve(name): starts an A-record query and drives iface.poll until it resolves or a ~5s timeout, reusing the same proven interface (routing/ARP) that TCP uses. Returns the first A record (BE u32) or a negative errno. - SYS_DNS_RESOLVE (0x4B8, already NET-gated/reserved) → sys_dns_resolve(name_ptr, name_len) reads the name from userspace and calls dns_resolve. Compiles both arches. Runtime resolution verification (needs a network with DNS) + exposing it on the oscortex/net channel for Dart apps land next.

…e (app networking, 7/n) Completes app-facing DNS (kernel resolver landed in c34c50c). - embedder: sys::dns_resolve wrapper (SYS_DNS_RESOLVE) + oscortex/net op 0x06 (resolve): [0x06, hostname] → reply u32 LE (BE-order IPv4), 0 = failure. Uses a raw u32 (not the i32 helper) so a high-bit IP like 200.x.x.x isn't misread as a negative error. - Dart: OscortexSocket.resolve(host) → dotted-quad string or null. +2 tests (frame + reply parsing, failure sentinel); 10 tests total, analyze clean. So apps can now resolve(host) → ip, then connect/httpGet by IP. Compiles both arches. Runtime resolution still needs a network with reachable DNS to confirm.

Completes the DNS resolver: on the DHCP ACK, capture cfg.dns_servers and point the DNS socket at them (update_servers) — so name resolution uses the network's own resolver (incl. QEMU user-net's 10.0.2.3) rather than only the seeded 8.8.8.8. Borrow-careful: the servers are captured into a local while the DHCP socket is borrowed, then applied once that borrow ends. Compiles both arches.

… default Real hardware (a 2016 Skylake MacBook Pro) hung at `cortex::smp`: the BSP woke the Application Processors, an AP faulted in ap_init, and the BSP then ground through a spin-count online-wait (PAUSE is ~140 cycles on Skylake, so the old 200x5M budget was minutes per AP) — a boot hang. Single-core is the proven-stable config: the post-app-launch freeze is fixed by the serial-GC engine, not SMP, and co-scheduling the engine across cores does not converge the Dart GC stop-the-world safepoint. So make single-core the default and put AP bring-up behind a feature: - New `smp` Cargo feature (default OFF). Default boots single-core on both arches; build `--features smp` to develop AP bring-up. - x86_64: AP bootstrap + online-wait gated behind `smp`; the wait is now bounded in WALL-CLOCK time (rdtsc_ns, 0.5s/AP) instead of an unbounded spin count. - aarch64: wake_aps() (PSCI CPU_ON) gated behind `smp` to match — it only idled the woken cores (ap_main parks in wfi) and risked the same real-hardware hang if an AP faulted in the trampoline. x86 single-core boot verified: the ISO renders the full shell under QEMU -smp 2. aarch64 compile-verified both ways; the BSP render path is unchanged (the gate only removes the idle-AP wake).

…uler foundation) Start the real SMP scheduler effort with the Redox-blueprint "non-preemptable bail" — the piece that previously corrupted preempted threads. Adds a per-CPU preempt-disable depth, auto-engaged while a CPU holds PTABLE_LOCK, and makes the timer-preempt path bail when this CPU is non-preemptable. This also closes a latent single-core x86 hole: the timer ISR's timer_preempt_switch_try does a RECURSIVE PTABLE_LOCK try_lock, which succeeds when the interrupted thread already holds the lock — it could then switch threads mid-critical-section, pinning `holder` to this CPU and wedging every later PTABLE_LOCK user. The bail closes it. - PREEMPT_DEPTH per-CPU counter + preempt_disable_cpu/enable_cpu, hooked into the OUTER PTABLE_LOCK acquire/release (balanced; recursive acquires don't re-toggle). - preempt_disabled() checked at the top of timer_preempt_switch{,_try} → return None. - CONTEXT_SWITCH_LOCK declared for the M4 two-layer switch_to (reserved). - docs/smp-architecture.md: the full roadmap (current vs target model, the SMP-unsafe inventory, milestones M0–M5), so this is built to a plan, not patched. Additive + safe: defaults preserve behavior; on aarch64 EL0 preemption is off so the bail is inert (only the balanced lock-counter runs). Verified x86 single-core boots + renders the launcher, frames advance smoothly, zero faults. Both arches compile clean.

Scope futex wakes by the PHYSICAL address of the futex word rather than by get_group_leader (which takes PTABLE_LOCK and is address-space-group based). Two contexts share a futex iff they map the address to the same physical page — the correct, address-space-independent identity (Redox blueprint #4), and the right key once app engines get isolated address spaces. - futex_phys_of(pid, addr): translate the (possibly Blocked) waiter's VA via its page-table root to a physical address. - pml4_phys_of(pid): page-table root of any LIVE process (incl. Blocked), unlike get_user_context which requires Running — a parked futex waiter must still be locatable. - The wake filter prefers physical-address identity; falls back to group-leader scoping only when a translation fails, so it never DROPS a wake the old path would have allowed (the dangerous direction for the load-bearing engine bring-up). Behaviorally identical today (apps share the shell's pml4 → same physaddr → same decision as group-scoping); the divergence — and the real value — appears once engines run in separate address spaces. Verified x86: engine brings up and renders the launcher (present→65, 0 faults); both arches compile.

…ortex_app oscortex_app carried iOS/macOS/Linux runner scaffolding from `flutter create` (78 tracked files), but OSCortex is a bare-metal OS — apps run via the native embedder (tools/flutter-embedder → gen_snapshot/AOT), never `flutter run -d`. Proof it's dead weight: the other three apps (canvas, files, web_link) have no such dirs and build into the ISO fine, and no build script / CI references them. Removes the cross-platform clutter so the app tree matches the others.

… halts the OS Both the #GP and page-fault handlers did `halt_forever()` on a user-mode fault, so a single app's crash took down the entire machine. That's wrong: a segfault should kill the faulting process, not the OS. A user-mode fault holds no kernel lock, so the handler can safely tear the process down (`exit(pid,-1)`, which triggers app crash auto-recovery for the app's group) and reschedule the next runnable process — mirroring sys_exit's "die, run the next, never park forever". Also gives #GP a GPR-capturing naked entry (GpFaultFrame, like the page-fault handler) so crash logs show the full register state. Verified x86: launching an app that hits the intermittent dart:io EventHandler crash now leaves the shell rendering 27-29 frames PAST the fault (was: dead machine); a clean launch is unaffected. This makes the OS resilient to the app-launch race while the underlying race (dart:io engine-thread state corruption — root-caused, separate fix) is addressed.

…e expirations force_wake_all_task_runners (the epoll-parking deadlock-breaker) did `pending = pending.saturating_add(1)` on EVERY timerfd each time it fired (frequently during app bring-up). For an actively-serviced timerfd that meant its `pending` could accumulate to large values, so the next timerfd read returned a big BOGUS expiration count instead of ~1. A healthy, serviced timer must never report fake expirations piling up. Cap the spurious contribution at 1 (still reports EPOLLIN to unpark the waiter). Correctness fix. Verified it does not regress boot/render (shell brings up and renders across runs). NOTE: this is NOT the fix for the intermittent app-launch dart:io EventHandler crash — that race persists (~50% over 6 runs); root cause is deeper engine-thread state corruption (separate, ongoing).

… primitive) x86's reschedule IPI was already complete (broadcast_resched_ipi → send_resched_ipi → vector 0x40 → apic_resched_handler), and set_state(Running) already broadcasts on a wake. This fills the aarch64 gap so the primitive exists on both arches: - gic::send_sgi_all_but_self / send_sgi: write GICD_SGIR (TargetListFilter 0b01 = all-but-self, or an explicit CPU-interface mask). SGI_RESCHED = SGI 0. - broadcast_resched_ipi() sends SGI_RESCHED, guarded on CPU_COUNT > 1 so single-core never touches the GIC (this is on the hot set_state path) — and "all-but-self" reaches no one on one core anyway. - the IRQ handler recognizes + EOIs SGI 0 (taking the interrupt wakes the core from wfi; M5 will rerun the scheduler here once APs schedule). So a thread made runnable on one core can signal the core it's affined to. Compile- verified both arches × {default, smp}; provably single-core-neutral (guarded no-op + the SGI-0 branch is unreachable on one core), so the proven single-core path is unchanged. SGI delivery + the rerun-scheduler action are exercised at M5 (APs idle until then). x86 untouched (already done).

Pin each launched engine (process group) to a dedicated application core so it runs with its OWN per-CPU GPR scratch, never sharing the cooperative context that crashed a 2nd engine on a single shared core. Under -smp 2 the shell renders on the BSP and a launched app (Files/Web Link) initialises its engine and renders its full UI on core 1. Verified headless (TCG): 8/8 runs no panic/corruption, 7/8 launch + render (the 1 miss is a rare boot first-frame stall, non-fatal). The OS never goes fatal on an app crash. Single-core (the default, no `smp` feature) is byte-unchanged. Home-core scheduling: - Process gains home_cpu; assigned atomically in spawn_with_bootstrap (app host -> an application core, everything else -> BSP) before the process is Running, so there is no window where the wrong core claims it. Threads inherit it. - next_runnable_pid filters by home_cpu==my_cpu; the app runs ALONE on its core while the shell stays foreground-exclusive on the BSP (no concurrent shell engine to trip Dart's isolate-confinement check). - AP runs the home-pinned scheduler loop + the per-core timer-ISR wake-assist (home-gated via try_claim) to pump its app's frames; it idle-parks until the shell engine is up to avoid contending with the BSP's boot. - spawn_with_bootstrap broadcasts a reschedule IPI so a home-pinned app's core wakes from idle immediately instead of waiting for its next timer tick. - CPU_COUNT publishes the ONLINE count, not the configured count, so an app is never pinned to a core that failed to come online. SMP-safety fixes this exposed (latent under cooperative single-core): - Per-CPU page-table lock depth. The re-entrancy counter was a single GLOBAL atomic: a 2nd core saw depth>0, treated itself as a nested writer, and BYPASSED the real lock -> two cores mutated page tables in parallel -> "corrupt PTE" panic. Now per-CPU; each core's first acquire takes the cross-core lock. x86 preempt-disables the section (IRQs stay on so the app core's frame pump keeps running) instead of masking IRQs; the disable is done under a brief IRQ mask so the cpu-read can't be migrated mid-update. - WM<->PTABLE lock order. wm queue methods canonicalised pids (get_group_leader -> PTABLE_LOCK) while holding the WM lock, inverting the scheduler's PTABLE->WM order (next_runnable_pid_locked calls input_pending_for under PTABLE_LOCK) -> ABBA deadlock, both cores wedged. Canonicalise at the WM-module boundary; the EventQueue methods now compare already-canonical pids with no PTABLE access. - Crash-recovery teardown use-after-free. A dead app's pml4 was freed while a futex physaddr lookup still held a stale reference -> walk of phys 0 -> unrecoverable kernel page fault. Cross-process page-table walks now reject a null/out-of-range root (return None) instead of dereferencing it.

…e root The SMP crash-recovery teardown can free a dead app's pml4 while another core still holds a stale reference to it (a futex physaddr lookup) — the walk then dereferences a freed/reused table and hits phys_to_virt's "corrupt PTE" panic (an unrecoverable kernel page fault, deref of phys 0 / out-of-range). The OS should survive an app's crash, not die with it. Read-walks (translate_user_page / _flags) and the teardown free walk now route every table-pointer dereference through walk_table(), which returns None for a null or out-of-addressable-RAM frame instead of dereferencing it. A garbage or freed table is treated as "not mapped" (the lookup fails, the caller copes) and the kernel stays alive. The bare phys_to_virt — which still panics on garbage, a useful invariant — is unchanged on the WRITE paths. Verified: 0 panics / 0 corrupt-PTE across the SMP app-launch batch (was ~1/10 before).

packages/oscortex_ui/.dart_tool/ was tracked despite matching .gitignore (packages/*/.dart_tool/) — it predated the ignore rule. It's a regenerated build/tooling cache that should never be in version control. Removed from tracking; the existing .gitignore rule keeps it out.

… Link status Release docs for the x86 SMP app-launch milestone. Records the home-core fix, the multi-core requirement, the honest stability picture (OS crash-proof; intermittent non-fatal first-frame stall; not yet bare-metal-confirmed), and — per request — the Web Link/browser status: app shell + webview pipeline scaffold with a stub engine are done and demonstrable; the Servo web engine is NOT integrated yet.

…edger rules.md: binding rules for every agent — never present a stub as done; "done" means verified end-to-end on the real artifact; report status as exactly one of DONE/UNVERIFIED/STUB/NOT-STARTED; finish the task or name the gap precisely. CLAUDE.md: auto-loaded by agents; makes rules.md mandatory reading before starting and before reporting done. docs/FEATURE_STATUS.md: honest per-feature ledger (start of the stub audit) — flags what has NOT been re-verified rather than rubber-stamping it, and is explicit that the audit is incomplete.

…not "solved" Live HVF test 2026-06-17: tapping an app launches the 2nd engine, which stalls at FlutterEngineRunInitialized (no crash) and freezes the UI. Same single-core cooperative-scheduler root cause as x86. The prior "solved via serial-GC" note was wrong. Fix = wire the home-core SMP path for aarch64 (no aarch64 mirror of the x86 timer-ISR home-core wake-assist exists yet).

…ootstrap WIP Progress on aarch64 home-core SMP (the ARM app-launch freeze). The prior state was "AP comes online but stays idle (never schedules)". Now, under -smp 2 on HVF (real parallel cores): - The AP idle-parks (wfi) until an app is pinned to it (CPU_HAS_HOME_WORK, set in spawn_with_bootstrap + woken by the existing reschedule IPI), so it does NOT contend with the BSP's shell first-frame bring-up (engaging at flutter_init_ready stalled the shell at present=1). - On wake it engages + runs the arch-neutral home-pinned scheduler. The launched app's host (pid 8) DOES run on the AP — main_embedder, FlutterEngineInitialize, its worker pool all execute there (aarch64 AP EL0 entry works). - The timer-ISR wake-assist resume is home-gated + re-enabled (multi-core only; single-core keeps the proven set_state_try-only path, byte-unchanged). NOT DONE (honest, per rules.md): the app's engine bootstrap DETERMINISTICALLY stalls at FlutterEngineRunInitialized (4/4 runs) — the cooperative thread-pool handshake does not complete on the AP, so no frame is presented. This is the deep remaining aarch64 SMP work. Single-core ARM (the shipped run-arm.sh path, no `smp` feature) is unaffected — all new behaviour is behind the AP/CPU_COUNT>1 gates. x86 unaffected.

…ounter access) The aarch64 AP never called enable_fpu_simd() (the x86 AP does, in ap_init). That function sets CPACR_EL1 (FP/SIMD) AND CNTKCTL_EL1.EL0VCTEN/EL0PCTEN — per-CPU registers at reset on a secondary core. Without CNTKCTL, the app's inline CNTVCT_EL0 read (liboscortex_libc monotonic clock, hit constantly by the Dart VM) trapped to EL1 with EC=0x18 → report_unhandled → the engine bootstrap deterministically wedged on the AP (4/4 runs, confirmed by an HVF register dump: AP stuck in report_unhandled, ESR EC=0x18). Adding the call clears EC=0x18. The freeze now advances to a later fault (EC=0x24 EL0 data abort) — distinct, deeper issue, still WIP; aarch64 SMP app-launch is NOT yet working. AP-only change; single-core + x86 unaffected.

…s EL0 state Re-enabling enter_user_by_pid_noreturn_try from the AP timer ISR reproduced the prior session's "pid=2 Dart corruption" exactly: a real EL0 data abort (EC=0x24) in engine code. Reverted to wake-only (set_state_try), which clears the corruption (unhandled_exc=0, 3/3). Net aarch64 SMP state with the CNTKCTL fix: the AP runs the app past the CNTVCT trap, but the engine bootstrap stalls at FlutterEngineRunInit — the cooperative hand-off alone doesn't complete it and the only pump (ISR resume) corrupts thread state. The real fix is a corruption-free per-CPU context switch (Redox-style switch_to), a multi-session rewrite. aarch64 SMP app-launch NOT working yet; single-core + x86 unaffected (all behind AP/CPU_COUNT>1 gates).

…mp vs corruption)

… bootstrap needs switch_to rework The home-gated timer-ISR wake-assist resume now fires only when the target has a valid saved FP (aarch64_fp_valid), so it can never re-enter a thread with no FP image and zero its AAPCS64 callee-saved v8–v15 (the EC=0x24 SMP corruption). Verified: 4/4 runs, 0 unhandled exceptions (corruption gone). But the app bootstrap STILL stalls (4/4, RunInitialized never completes): the threads that need re-entry during bootstrap are freshly created and have no FP yet, so the FP-gated resume skips them, and the cooperative hand-off doesn't re-enter them either. This is the catch-22 proven across 5 distinct attempts (CNTKCTL fix, resume on/off, FP-gated): a correct resume must restore ANY thread's full state without corruption — i.e. a real per-CPU switch_to context save/restore, not the build_image rebuild. That is focused multi-session architectural work (precisely scoped in [[smp-bringup]] M5), NOT a blind loop hack. aarch64 SMP app-launch remains BROKEN (honest). Single-core + x86 unaffected (multi-core/AP-gated).

…-port

squirelboy360 added 30 commits June 7, 2026 15:32

squirelboy360 added 29 commits June 15, 2026 21:33

chore(web-link): ignore build/osx artifacts in the Web Link app

693f673

docs(status): aarch64 SMP — precise wall (CNTKCTL fixed; bootstrap pu…

38cea11

…mp vs corruption)

Merge remote-tracking branch 'origin/develop' into feat/native-engine…

f432bbb

…-port

squirelboy360 merged commit 43dd3b9 into develop Jun 17, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMP home-core app isolation — x86 app-launch fixed; aarch64 SMP advanced (WIP)#15

SMP home-core app isolation — x86 app-launch fixed; aarch64 SMP advanced (WIP)#15
squirelboy360 merged 172 commits into
developfrom
feat/native-engine-port

squirelboy360 commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

squirelboy360 commented Jun 17, 2026

Summary

Verified (x86, headless TCG -smp 2)

SMP-safety fixes (latent bugs the cooperative single-core kernel hid)

aarch64 — WIP, NOT working yet (honest)

Process

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants