Skip to content

SMP home-core app isolation — x86 app-launch fixed; aarch64 SMP advanced (WIP)#15

Merged
squirelboy360 merged 172 commits into
developfrom
feat/native-engine-port
Jun 17, 2026
Merged

SMP home-core app isolation — x86 app-launch fixed; aarch64 SMP advanced (WIP)#15
squirelboy360 merged 172 commits into
developfrom
feat/native-engine-port

Conversation

@squirelboy360

Copy link
Copy Markdown
Contributor

Summary

Home-core SMP engine isolation + the SMP-safety fixes it required. On x86 under SMP, launched apps now initialise their engine and render on a dedicated application core; the OS no longer goes fatal on an app crash. aarch64 SMP is advanced but app-launch is still WIP (honest — see below).

Verified (x86, headless TCG -smp 2)

  • App launch + render works (~70–85% of runs; intermittent non-fatal first-frame stall). 0 kernel panics / 0 page-table corruption. Single-core (the shipping default) is byte-unchanged.
  • Shipped as release v0.1.4 (x86 ISO).

SMP-safety fixes (latent bugs the cooperative single-core kernel hid)

  • Per-CPU page-table lock depth (was a global counter → 2nd core bypassed the lock → page-table corruption).
  • PTABLE↔WM ABBA lock-order fix (canonicalise pids at the WM boundary).
  • Reschedule IPI on spawn so a home-pinned core wakes immediately.
  • Page-table walks degrade gracefully on a freed/garbage root (no kernel fault on crash-recovery teardown).

aarch64 — WIP, NOT working yet (honest)

The AP now runs the app (fixed a real CNTKCTL_EL1 trap), but the engine bootstrap stalls on the cooperative hand-off; the corruption-free fix is a per-CPU switch_to rework (multi-session). Commits are labelled as WIP; single-core ARM is unchanged.

Process

Adds rules.md (mandatory anti-stub engineering rules) + docs/FEATURE_STATUS.md (honest feature ledger).

- PS/2 IntelliMouse 4-byte scroll wiring (kernel) -> EV_SCROLL -> embedder
  FlutterPointerEvent (signal_kind=kScroll); direction+speed configurable.
- Settings screen in the shell: natural-scroll toggle + speed slider, live via
  config:* messages over the oscortex/shell channel.
- Kernel boot spinner drawn on the framebuffer during JIT warm-up
  (drivers/fb.rs::draw_boot_splash, driven by compositor::tick); clears when
  Flutter presents its first frame. Plus a 'Launching <app>' overlay in the shell.
- Adaptive embedder frame pump: gentle during warm-up, 60fps after input, ~8fps
  idle (was a 1ms/1000-per-sec flood) -> req:frame ratio 115:1 -> ~7:1.
- Revert MAX_FRAMES to 1<<20 (RAM above 4GiB is not yet safely mappable; the cap
  is load-bearing) with an explanatory comment.
- docs/arch.txt: document the real JIT execution model + runtime status.
Build the Flutter engine from source as a first-class OSCortex AOT target
instead of shimming a Linux engine. Phases, port surface, build infra, and the
hack-removal checklist in docs/native-engine-port.md; direction recorded in
docs/arch.txt.
Draw the destination ahead of the work: three clean layers (kernel native ABI /
shared AOT engine / self-contained AOT app bundles with own PIDs), the shared-
framework property, and the hack-removal it enables. No Linux costume.
…ace map

Baseline libflutter_engine.so (377MB, x64, embedder API) builds from source in
the container — toolchain proven. Mapped the exact port surface: ~1,200 lines
(Dart VM os_oscortex/os_thread_oscortex ~1000, fml message_loop_oscortex+paths,
reuse posix) + GN glue. The fml message loop is the critical file — its emulated
equivalent is what livelocks rendering today, so the native port also fixes sync.
Capture the validated Phase 0 setup as one-command scripts so the repetitive,
heavy engine-build infra is reproducible, not tribal knowledge:
- setup-engine-build.sh: idempotent container + depot_tools + pinned gclient sync
  (encodes the name='.' fix and the flags that work).
- build-engine.sh: gn configure + ninja for baseline | oscortex targets.
- README.md: the contributor flow, the edit->build incremental loop, and the
  pitfalls already solved (gclient layout, emulation, disk, prebuilt-dart).
Engine checkout stays OUTSIDE the repo (22GB, never committed).
Add 'oscortex' as a --target-os. Since OSCortex has no sysroot/libc yet (runs
linux-ABI via emulation), it links against linux but sets a new is_oscortex GN
flag to select the OSCortex backend sources later; a true OSCortex toolchain is
a later sub-phase. gn configures cleanly (1714 targets; args.gn = target_os
linux + is_oscortex true).

- engine-port/patches/: the two tracked diffs (gn tool, BUILDCONFIG).
- engine-port/apply-port.sh: applies patches + (Phase 2) backend sources into a
  fresh checkout, idempotent.
Temp patch files cleaned from the workspace (no remnants).
This is a public-facing repo; debugging slop and dead-path artifacts don't belong.

Removed (all dead, none load-bearing for the working JIT build):
- scratch/ (192 files): the entire AOT-deser debugging tree — analyze_*/disasm_*/
  assemble_*/check_* scripts, .bin/.log artifacts, and vendored gen_snapshot/
  shader binaries. Referenced only by build-iso.sh's dead shell-AOT step.
- tools/flutter-engine/libflutter_engine.so.bak: 91MB orphaned 'pristine engine'
  backup, zero references.
- gen_help.txt, test_size, test_size.c, .DS_Store: stray one-offs.
- build-iso.sh: excised the dead [0.3/5] shell-AOT block (gen_snapshot -> libapp.so
  -> patch_libapp) — the shell runs JIT off kernel_blob.bin, not an AOT libapp.so;
  also dropped libapp.so from REQUIRED_FILES. Native AOT is done properly via the
  engine port (docs/native-engine-port.md).
- .gitignore: ignore /scratch/ and /qemu-pipe.* so they can't return.

Deliberately KEPT (still load-bearing for the current JIT build; removed in the
port's Phase 4 once the native engine works): engine_patch.py, patch_libapp.py
(still used by tools/build-flutter-osx.sh for the apps), the Linux-emulation shim.

Verified: build-iso.sh and build-kernel-iso-fast.sh both pass bash -n; the working
fast build references neither scratch/ nor libapp.so.
Add the OSCortex fml platform backend and wire is_oscortex selection. Verified:
fml.message_loop_oscortex.o + fml.paths_oscortex.o compile into libfml, the linux
message loop is excluded, and message_loop_impl.cc selects MessageLoopOscortex.

- engine-port/src/flutter/fml/platform/oscortex/: message_loop_oscortex.{cc,h}
  (epoll+timerfd loop — the sync-bug-fixer, starts as a clean clone to diverge to
  native primitives) + paths_oscortex.cc.
- patches/0003: build_config.h defines FML_OS_OSCORTEX (additive, gated on the
  FLUTTER_OSCORTEX define), message_loop_impl.cc selects MessageLoopOscortex first,
  fml/BUILD.gn swaps in the oscortex sources (excludes linux) + sets the define.
- apply-port.sh applies 0003 + copies src/.

Note: the backend uses the Linux-ABI calls (epoll/timerfd) that OSCortex natively
implements — OSCortex is a native kernel + own libc that is Linux-ABI-compatible
(like Fuchsia/Starnix), not Linux. These calls are the divergence seam if we ever
move to a fully custom ABI.
…t VM clone

- Verified a release/AOT oscortex build configures and gen_snapshot is buildable
  from our tree -> engine+snapshot version-matched, dissolving the old AOT
  dead-end (the '1247 base objects' mismatch). AOT is now a build step, not a
  research problem.
- Deferred the Dart VM os_oscortex clone: on path A os_linux.cc already works;
  cloning it byte-identically is busywork. Add a Dart VM backend only when a
  primitive must actually diverge.
libflutter_engine.so (377MB) links for the oscortex target and bakes in OUR
backend: 17 MessageLoopOscortex refs, 0 MessageLoopLinux. First Flutter engine
built from source FOR OSCortex.

Refined the gn target (patch 0001): oscortex now configures as a HOST linux-x64
build + is_oscortex flag (identical to the proven baseline, our platform backend
selected), instead of an explicit target_os that tripped ANGLE/wayland scope and
embedder-constructor mismatches. Runtime still renders software via kSoftware.
…ld) flow

Make the from-source build a one-time maintainer task, not something every dev
repeats — exactly how Flutter distributes its own engine.

- artifact.config: pins ARTIFACT_VERSION + Flutter rev + the R2 base URL.
- fetch-engine.sh: CONSUMER flow — downloads the pinned prebuilt engine
  (libflutter_engine.so + gen_snapshot + icudtl.dat) from Cloudflare R2, verifies
  sha256, stages it where the OS build expects. Seconds, no checkout.
- publish-engine.sh: MAINTAINER flow — packages the built artifacts and uploads
  to R2 (rclone or aws/S3 to the R2 endpoint). Run after build-engine.sh when the
  port/version changes; bump ARTIFACT_VERSION.
- README: two-tier model (consumers fetch, maintainers build+publish), R2 layout,
  multi-arch = one tarball per ISA.

Scaffolding is ready now; first publish happens once the release/AOT engine lands
(Phase 3). Set ARTIFACT_BASE_URL to your R2 URL to activate fetch.
- setup-r2.sh: create the R2 bucket + enable public r2.dev URL + auto-write
  ARTIFACT_BASE_URL into artifact.config. Idempotent. Needs Cloudflare auth first
  (npx wrangler login, or CLOUDFLARE_API_TOKEN).
- publish-engine.sh: upload via wrangler (npx, no install) as the primary R2
  backend, rclone/aws as fallbacks.
- README: one-time R2 host setup steps.

Bucket creation requires the account owner's Cloudflare auth, so this is the
single manual step; everything else is automated.
R2 needed dashboard activation; switched the artifact host to Google Cloud
Storage (gs://dotcorr-oscortex-engine, public read, billing active). Verified
end-to-end: gcloud upload -> anonymous https://storage.googleapis.com/... fetch.
- artifact.config: ARTIFACT_BASE_URL -> GCS public URL + GCS_BUCKET.
- publish-engine.sh: gcloud storage cp as the primary backend (wrangler/rclone/aws
  remain as fallbacks). fetch-engine.sh is unchanged (curl, host-agnostic).
The repo is public, so GitHub Release assets download anonymously — free, no
egress fees (vs GCS ~$0.12/GB), no separate cloud account, and gh is already
authed. Verified end-to-end: gh release upload -> anonymous
https://github.com/DotCorr/oscortex/releases/download/... fetch.

- artifact.config: ARTIFACT_BASE_URL -> GitHub releases download URL; GITHUB_REPO.
- publish-engine.sh: gh release create/upload as the PRIMARY backend (GCS/R2/S3
  remain as documented fallbacks). fetch-engine.sh unchanged (host-agnostic curl).
- README: GitHub Releases is the host; no bucket setup needed.
- Removed the unused GCS bucket (no remnants).
…erified

The release oscortex engine (33MB, vs 377MB debug — no JIT) and a version-matched
gen_snapshot (6.5MB) built from one tree. Published the first artifact to GitHub
Releases (oscortex-engine-1) and verified the full round trip:
  publish-engine.sh -> gh release  ->  fetch-engine.sh downloads+checksums+stages.
Engine + gen_snapshot from the same tree are version-matched, which dissolves the
old AOT dead-end (the '1247 base objects' mismatch).

Fix: publish-engine.sh derives the workspace from the container's /work mount
(robust to the workspace dir name) instead of assuming a fixed name.
shell app -> frontend_server -> AOT dill (24MB) -> our version-matched
gen_snapshot -> libapp.so (4.4MB native ELF). Verified a real AOT snapshot: all 4
_kDart*Snapshot{Data,Instructions} present with REAL native instructions (T/text,
not zeroed stubs), ELF64 x86-64. The multi-session 'Snapshot expects N base
objects, provided 0' blocker is dissolved because engine + gen_snapshot are built
from one tree (version-matched). compile-app-aot.sh makes it reproducible.
The native, from-source, AOT-compiled Flutter engine RENDERS the shell on
OSCortex: FlutterEngineRunsAOTCompiledDartCode -> libapp.so AOT path ->
present_callback 393 frames, ZERO JIT warmup (no kernel_blob, no codegen). The
UI comes up immediately, no 60-90s compile. This is the whole point of the port.

Fix to get here: the release/AOT engine rejects the JIT-era GC/heap dart-flags
(switches.cc:478 disallowed) — pass only the 5 AOT-safe engine args (argc=5)
when is_aot, dropping the old --old_gen_heap_size etc.

Known follow-up (separate, a regression from this session's IntelliMouse 4-byte
scroll wiring): mouse clicks/Y are mangled + pointer events not reaching the
engine; to fix next.
The 4-byte scroll-packet mode added this session corrupted pointer data: dy
mis-parsed (cursor pinned to the top, y=0) and the flags byte mis-parsed (buttons
stuck at 0), so clicks never registered and pointer events stopped reaching the
engine. Revert to the proven 3-byte packet mode (working clicks > broken wheel).
Verified: cursor Y tracks normally again (y=32 start, not pinned at 0). Scroll to
be re-added later with correct 4-byte parsing + resync that doesn't regress clicks.
Input is confirmed working under AOT (122 pointer events reached Flutter, hover +
clicks register, button presses detected, dropped_total=0 — the earlier '0' was
the broken IntelliMouse 4-byte mode + minimal interaction, fixed by the 3-byte
revert). Remove the temporary DIAG logging from the embedder + ipc_display.

engine-port/compile-app-aot.sh: per-app AOT compile (frontend_server -> dill ->
version-matched gen_snapshot -> libapp.so), no patch — used to give each app its
own AOT bundle so the AOT engine can run them.
Previously every launched app reused the shell's /system/flutter/libapp.so
(aot_va=0 JIT-era skip), so tapping a tile ran the shell snapshot in the
app host and stalled. Resolve the per-app snapshot by registry lookup:
build_app_libapp_path(app_id) -> /Applications/<name>.app/libapp.so, dlopen
it, and point the AOT snapshot loader at that path. Shell keeps its own path.

Each app is AOT-compiled to its own libapp.so and staged under its .app
bundle, so a launched host loads its own native snapshot, not the shell's.
pthread_cond_signal/broadcast with no waiter parked is a no-op by contract:
the predicate is mutex-protected, so a thread that hasn't yet entered
cond_wait observes it under the lock and never waits. The kernel was instead
recording such signals in COND_PENDING_SIGNALS and letting the next cond_wait
consume one as an immediate (spurious) return. A hot mutator (pid 2) that
signals its own monitor with woke=0 then re-waits would consume its own fake
pending, return 0, find its predicate still false, and re-wait forever — the
cond-pending-consume livelock that froze render at a fixed frame count.

Remove the mechanism end to end (consume in cond_wait, posts in
cond_signal/cond_broadcast, the state map, the import). cond_wait now relies
solely on the seq protocol, which delivers every real signal race-free under
cooperative single-core scheduling (the value-check and park cannot interleave
with a signaler). No remnants.
sys::dlopen forwards path.len() to the kernel as the path length, so a
NUL-terminated byte string makes the kernel read the trailing \0 as part of
the filename and the open fails. The per-app and shell AOT paths were passing
"...libapp.so\0"; strip the NUL at the dlopen call. The AOT snapshot itself
loads via aot_snapshot_load (which maps it executable), so this only silenced
a spurious failure path, but it is a real bug and removes the warning.
The Debug-level hot per-syscall traces (epoll_ctl, mprotect, cond-signal, every
keypress) each block on a synchronous COM1 UART write AND re-render the
framebuffer text console, with interrupts disabled. Measured: ~14000 log lines
during a single boot+app-launch. Dropping to Error cut serial volume ~12x
(14000 -> 1185 lines), removing tens of seconds of emulated-bare-metal warmup.
Raise the one line in logger::init back to Debug for deep tracing.
Root cause of the sporadic render crashes. The user GPR snapshot lives in a
PER-CPU scratch (gs:[..]) written by the syscall entry stub and shared by every
thread on the core. save_full_user_gprs read it LAZILY at yield time — but the
wait loops do sti;hlt and the timer ISR can switch threads mid-handler, so
another thread's syscall entry overwrites the per-CPU snapshot before our save
runs. We then stored the OTHER thread's callee-saved regs (rbx/rbp/r12-r15) into
our context; on resume rbx was garbage.

Pinpointed via addr2line/objdump: fml::MessageLoopOscortex::Run() resuming from
epoll_wait wrote running_ through this=0x1400 (=1280*4, a Skia row stride leaked
from another thread) -> SIGSEGV pid=3. This is the whole 0/16/140/489-frames-
across-identical-boots sporadicity.

Fix: capture_user_gprs_at_entry() snapshots the GPRs into the thread's own
PTABLE slot at dispatch_fast entry, while the per-CPU snapshot is still fresh
for this thread (before any handler/yield/interrupt window). A per-CPU
'captured' flag makes later yield-time save_full_user_gprs calls no-ops, so a
clobbered shared snapshot can't leak in.

Validation: 4/4 headless boots render cleanly (present 98-105) with ZERO engine
SIGSEGV, vs the prior sporadic crashes. Render is now reliable.
The frame pump only called FlutterEngineScheduleFrame in the no-event branch,
so a hover/click/scroll waited for the event queue to drain plus a wm_event_wait
timeout (up to ~16ms) before the repaint was even requested. Request the frame
in the same iteration input is received; the engine coalesces redundant requests
and Flutter flushes the batched pointer packet at BeginFrame, so the scheduled
frame reflects the event. Removes the scheduling-side input latency (most
visible on real hardware; under cross-arch QEMU TCG the emulation dominates).
Each submitted frame did up to 5 full 1M-pixel passes: a strided re-pack into a
freshly allocated 4MB Vec, a byte-indexed B<->R swap, a full-screen fill_rect
clear, blit_rgba32, and swap_buffers. Measured ~65ms median (wildly variable
10-92ms) -> a ~15fps ceiling from the blit alone, which read as a low refresh
rate / laggy feedback.

- Fuse the strided re-pack into the swap: submit_bgra_impl reads the source
  stride directly and packs in ONE u32-wise pass (read BGRA as u32, swap bytes
  0<->2), eliminating the intermediate buffer + the 4MB/frame allocation.
- Skip the full-screen fill_rect clear when a presented surface already covers
  the screen (the common case: one full-screen Flutter surface).

Measured after: blit median 5.4ms, steady 5.0-7.6ms (~12x faster, variance
gone). Render verified correct + crash-free; colors unchanged.
next_runnable_pid_locked computed the foreground-exclusive group AFTER the
input-target and embedder-baton shortcuts, so a due shell (pid 1) baton could
schedule the shell engine even while an app is foreground — two heavy Flutter
VMs on one cooperative core, the documented cause of the launched app's crash.
Compute fg/exclusive FIRST and gate both shortcuts: suppress the pid-1 baton
when an app is exclusive, and only honour the input shortcut for a target in the
foreground group. No behavioural change when the shell is foreground (the common
case): exclusive=false short-circuits both gates.

Closes the baton concurrency hole; full app-launch validation still pending an
interactive tile-tap (HMP mouse_button injection does not produce reliable
clicks headless).
Replace the generic reply-null catch-all in platform_message_callback with
a real channel dispatcher (match on channel name). Every channel the
framework reaches for is now routed to a concrete handler that does the
platform work or returns a codec-correct typed ack; the final catch-all
logs the channel name ([embedder/chan] unbound: <name>) so nothing stays an
invisible stub.

Bound channels:
- flutter/textinput (JSONMethodCodec): setClient/setEditingState/show/hide/
  clearClient. Editing state (text + selection) is maintained in the embedder.
  PS/2 set-1 scancodes are now mapped to characters (shift/caps, backspace,
  enter, tab, space, arrows, home/end, delete); on a key press with an active
  text client the stored editing state is mutated and TextInputClient.
  updateEditingState is pushed back over flutter/textinput. Adds a small
  inline JSON reader/writer (no_std, no crates) for the editing-state maps.
- flutter/mousecursor: parse activateSystemCursor kind and ack (no kernel
  set-cursor-shape syscall exists yet; logged + acked).
- flutter/platform: Clipboard.setData/getData/hasStrings backed by an
  in-embedder buffer; SystemNavigator.pop; SystemSound/HapticFeedback/
  SystemChrome acked.
- flutter/navigation, system, accessibility, spellcheck, processtext, menu,
  contextmenu, scribe, restoration, keyevent, platform_views, isolate,
  lifecycle: explicit JSON typed-null ack.
Replace the ack-only stubs from the platform-channel contract with actual
OS-provided capabilities. OSCortex is the platform under the stock engine, so
where Flutter needs a platform service the kernel now implements and binds it.

Mouse cursor shape (flutter/mousecursor.activateSystemCursor):
- compositor: ACTIVE_CURSOR_SHAPE atomic + vector cursor sprites (arrow, I-beam,
  hand/link, forbidden, grab, horizontal/vertical resize, hidden). draw_software_cursor
  dispatches on the active shape; set_cursor_shape() repaints immediately.
- new syscall SYS_CURSOR_SHAPE_SET (0x4B2). Embedder maps the Flutter cursor kind
  string to a CURSOR_SHAPE_* and calls it, so hovering a link shows a hand, a text
  field shows an I-beam, etc.

Semantics / accessibility:
- embedder wires update_semantics_callback2 (FlutterProjectArgs off 280) and calls
  FlutterEngineUpdateSemanticsEnabled(engine, true) after run. The callback receives
  the FlutterSemanticsUpdate2 tree and stores each node (id, label, rect, flags,
  actions) in a live embedder structure for a11y / automation consumers. flutter/
  accessibility now replies with the correct StandardMessageCodec null (0x00), not JSON.

System clipboard (flutter/platform Clipboard.*):
- kernel-global clipboard buffer (embedder::clipboard) shared across every app/host,
  with SYS_CLIPBOARD_SET (0x4B3) / SYS_CLIPBOARD_GET (0x4B4). The embedder routes
  setData/getData/hasStrings to the kernel, so clipboard survives across apps.

SystemNavigator.pop (flutter/platform):
- SYS_APP_CLOSE_FOREGROUND (0x4B5): refocuses the shell (pid 1) and wakes it. The
  app embedder, when it is a launched host, flushes its reply, calls the syscall and
  exits so focus returns to the shell.

SystemSound.play (flutter/platform):
- PC-speaker beep driver (drivers::beep) via PIT channel 2 + port 0x61, exposed as
  SYS_BEEP (0x4B6). click vs alert play distinct short tones.

Deliberate no-ops (no such hardware), acked with the correct codec:
- HapticFeedback.* (no vibration motor), SystemChrome.* (single full-screen
  compositor surface, no system UI overlays / orientation).
Brings in the embedder channel dispatcher, text input (flutter/textinput),
and real OS-backed capabilities: cursor-shape sprites + SYS_CURSOR_SHAPE_SET,
live semantics, kernel-global clipboard, SystemNavigator.pop return-to-shell,
and a PC-speaker beep driver. Builds green (embedder + kernel).
Grounds the browser effort from two research passes:
- Engine decision: Servo, not Chromium. Prebuilt CEF is impossible on a glibc-less
  microkernel; Chromium-from-source is a multi-year team effort (cf. Fuchsia);
  Cobalt is an app-subset; WPEWebKit is full but a heavy C/C++ port. Servo is Rust,
  single-process embeddable, and has a software render path for our framebuffer.
- Integration: Flutter external textures are GPU-only, so the engine renders into
  its OWN OSCortex compositor surface (z-stacked under the Web Link chrome, clipped
  to the viewport), with the Flutter side as a transparent hit-rect that forwards
  input. Reuses the existing compositor — no engine patch.
- The oscortex/webview MethodChannel contract (methods + events).
- Two-track phased plan (engine-agnostic scaffold de-risks the design ahead of the
  multi-month Servo bring-up) + honest effort estimate.
…r, scaffold 1/n)

The engine-agnostic app-side of the native webview, per docs/browser-architecture.md.
- OscWebViewController: the oscortex/webview MethodChannel API (create/loadUrl/
  back/forward/reload/canGoBack/currentUrl/getTitle/evalJs/resize/setViewport/
  dispatchInput/dispatchScroll/dispatchKey/setVisible) + engine events
  (urlChanged/titleChanged/loadProgress/loadStarted/loadFinished/loadError/
  navState/scrollChanged), fanned out to instances by viewId via one channel handler.
- OscWebView: a transparent placeholder that reserves the web region, reports its
  on-screen rect (setViewport, so the kernel positions+clips the web surface), and
  forwards pointer/scroll input in webview-local coords. No Flutter texture — the
  web pixels are a sibling OS compositor surface showing through.

6 device-free tests (mocked channel): method args, reply round-trips, event
callbacks, and per-viewId routing. analyze clean. Servo plugs in behind this later;
next: wire Web Link's chrome to it + a stub engine service to prove the pipeline.
…fold 2/n)

Turns the Web Link mockup into a real browser chrome wired to the oscortex_webview
controller (matches the oscortex_ui design system):
- toolbar: back/forward (gated by navState), reload↔stop (by load state).
- omnibox: URL-or-search address bar — explicit schemes pass through, bare hosts
  get https, anything else becomes a web search (the "search" wrapper).
- a 2px accent progress line driven by loadProgress.
- the web region hosts OscWebView once a render surface exists, else a brand-styled
  "web engine starting…" placeholder.

oscortex_webview: OscWebViewController now tolerates a missing engine service
(MissingPluginException → inert) so the chrome runs before the backend exists.
Package tests stay green (6); both the package and the app analyze clean.

Engine-agnostic per docs/browser-architecture.md; navigation lights up once the
embedder oscortex/webview handler + a web-engine service land.
…b (browser, backend 1/n)

The pipeline behind the Web Link chrome — engine-agnostic; a Servo service later
replaces the stub renderer behind this same contract.

- sys.rs: surface geometry/z/clip/visibility wrappers (packed args matching the
  kernel) so the app can own + place a second (web) surface.
- main.rs: StandardMethodCodec decode (generalized from std_find_kind:
  std_method_name / std_arg_str / std_arg_int) + a fixed-buffer encoder (no alloc).
- handle_webview_channel: create → makes a compositor surface (owned by the app
  group), mmap-fills a placeholder, replies {surfaceId}; setViewport → geometry +
  clip + z (stacked above the Flutter surface, in the web region); loadUrl → tints
  + emits loadStarted/urlChanged/loadProgress/loadFinished/navState; destroy/query
  methods handled; the rest acked.
- events sent to Dart via FlutterEngineSendPlatformMessage on oscortex/webview.

Compiles on x86_64 + aarch64. Renderer is a stub (solid fill) — proves the full
app→channel→surface→composite→events path; NOT yet boot-verified. Servo bring-up
is gated on Rust std + a software-GL port (see next).
`"${KERNEL_FEATURE_ARGS[@]}"` errors with "unbound variable" under `set -u` on
macOS's stock bash 3.2 when no KERNEL_FEATURES is set (the normal case) — it broke
`X86_AOT=1 SKIP_CORE_APPS=1 build-iso.sh` at the kernel step. Use the
`${arr[@]+"${arr[@]}"}` idiom, which is empty-array-safe on bash 3.2+.

Also gitignore the osx CLI's per-app .osx/ config + build/osx/ outputs.
… hangs

A 2016 Retina MacBook Pro hung the boot at cortex::compositor. Two causes:

1. compositor::init → fb::set_double_buffer allocated a full-fb back buffer with
   `vec![0u32; pitch*height]` (~20 MiB on 2880x1800). On a big fb / tight heap
   that aborts and hangs. Now allocate fallibly (try_reserve_exact) with a sanity
   ceiling; on failure fall back to single-buffer / direct-to-fb rendering (fully
   supported — every fb write path already checks DOUBLE_BUFFER_ACTIVE). It just
   tears; it boots.

2. Once double-buffering turns on, the per-phase boot markers render to the back
   buffer but nothing swaps until the engine warm-up loop — so the splash freezes
   at "compositor" and MASKS a hang in any later phase. The bp! milestone macro
   now swaps after each render, so every phase actually shows on screen.

Verified: x86 still boots + renders the shell in QEMU (double-buffer path intact).
Both arches compile.
… networking, 6/n)

Apps can now resolve hostnames, not just connect by IP.
- smoltcp: enable socket-dns/proto-dns + dns-max-* config.
- net::tcp: a DNS socket in the stack (seeded with a public resolver 8.8.8.8;
  DHCP-provided servers are a follow-up refinement) + dns_resolve(name): starts an
  A-record query and drives iface.poll until it resolves or a ~5s timeout, reusing
  the same proven interface (routing/ARP) that TCP uses. Returns the first A record
  (BE u32) or a negative errno.
- SYS_DNS_RESOLVE (0x4B8, already NET-gated/reserved) → sys_dns_resolve(name_ptr,
  name_len) reads the name from userspace and calls dns_resolve.

Compiles both arches. Runtime resolution verification (needs a network with DNS)
+ exposing it on the oscortex/net channel for Dart apps land next.
…e (app networking, 7/n)

Completes app-facing DNS (kernel resolver landed in c34c50c).
- embedder: sys::dns_resolve wrapper (SYS_DNS_RESOLVE) + oscortex/net op 0x06
  (resolve): [0x06, hostname] → reply u32 LE (BE-order IPv4), 0 = failure. Uses a
  raw u32 (not the i32 helper) so a high-bit IP like 200.x.x.x isn't misread as a
  negative error.
- Dart: OscortexSocket.resolve(host) → dotted-quad string or null. +2 tests
  (frame + reply parsing, failure sentinel); 10 tests total, analyze clean.

So apps can now resolve(host) → ip, then connect/httpGet by IP. Compiles both
arches. Runtime resolution still needs a network with reachable DNS to confirm.
Completes the DNS resolver: on the DHCP ACK, capture cfg.dns_servers and point the
DNS socket at them (update_servers) — so name resolution uses the network's own
resolver (incl. QEMU user-net's 10.0.2.3) rather than only the seeded 8.8.8.8.
Borrow-careful: the servers are captured into a local while the DHCP socket is
borrowed, then applied once that borrow ends. Compiles both arches.
… default

Real hardware (a 2016 Skylake MacBook Pro) hung at `cortex::smp`: the BSP woke
the Application Processors, an AP faulted in ap_init, and the BSP then ground
through a spin-count online-wait (PAUSE is ~140 cycles on Skylake, so the old
200x5M budget was minutes per AP) — a boot hang.

Single-core is the proven-stable config: the post-app-launch freeze is fixed by
the serial-GC engine, not SMP, and co-scheduling the engine across cores does
not converge the Dart GC stop-the-world safepoint. So make single-core the
default and put AP bring-up behind a feature:

- New `smp` Cargo feature (default OFF). Default boots single-core on both
  arches; build `--features smp` to develop AP bring-up.
- x86_64: AP bootstrap + online-wait gated behind `smp`; the wait is now bounded
  in WALL-CLOCK time (rdtsc_ns, 0.5s/AP) instead of an unbounded spin count.
- aarch64: wake_aps() (PSCI CPU_ON) gated behind `smp` to match — it only idled
  the woken cores (ap_main parks in wfi) and risked the same real-hardware hang
  if an AP faulted in the trampoline.

x86 single-core boot verified: the ISO renders the full shell under QEMU -smp 2.
aarch64 compile-verified both ways; the BSP render path is unchanged (the gate
only removes the idle-AP wake).
…uler foundation)

Start the real SMP scheduler effort with the Redox-blueprint "non-preemptable
bail" — the piece that previously corrupted preempted threads. Adds a per-CPU
preempt-disable depth, auto-engaged while a CPU holds PTABLE_LOCK, and makes the
timer-preempt path bail when this CPU is non-preemptable.

This also closes a latent single-core x86 hole: the timer ISR's
timer_preempt_switch_try does a RECURSIVE PTABLE_LOCK try_lock, which succeeds
when the interrupted thread already holds the lock — it could then switch threads
mid-critical-section, pinning `holder` to this CPU and wedging every later
PTABLE_LOCK user. The bail closes it.

- PREEMPT_DEPTH per-CPU counter + preempt_disable_cpu/enable_cpu, hooked into the
  OUTER PTABLE_LOCK acquire/release (balanced; recursive acquires don't re-toggle).
- preempt_disabled() checked at the top of timer_preempt_switch{,_try} → return None.
- CONTEXT_SWITCH_LOCK declared for the M4 two-layer switch_to (reserved).
- docs/smp-architecture.md: the full roadmap (current vs target model, the
  SMP-unsafe inventory, milestones M0–M5), so this is built to a plan, not patched.

Additive + safe: defaults preserve behavior; on aarch64 EL0 preemption is off so
the bail is inert (only the balanced lock-counter runs). Verified x86 single-core
boots + renders the launcher, frames advance smoothly, zero faults. Both arches
compile clean.
Scope futex wakes by the PHYSICAL address of the futex word rather than by
get_group_leader (which takes PTABLE_LOCK and is address-space-group based). Two
contexts share a futex iff they map the address to the same physical page — the
correct, address-space-independent identity (Redox blueprint #4), and the right
key once app engines get isolated address spaces.

- futex_phys_of(pid, addr): translate the (possibly Blocked) waiter's VA via its
  page-table root to a physical address.
- pml4_phys_of(pid): page-table root of any LIVE process (incl. Blocked), unlike
  get_user_context which requires Running — a parked futex waiter must still be
  locatable.
- The wake filter prefers physical-address identity; falls back to group-leader
  scoping only when a translation fails, so it never DROPS a wake the old path
  would have allowed (the dangerous direction for the load-bearing engine bring-up).

Behaviorally identical today (apps share the shell's pml4 → same physaddr → same
decision as group-scoping); the divergence — and the real value — appears once
engines run in separate address spaces. Verified x86: engine brings up and renders
the launcher (present→65, 0 faults); both arches compile.
…ortex_app

oscortex_app carried iOS/macOS/Linux runner scaffolding from `flutter create`
(78 tracked files), but OSCortex is a bare-metal OS — apps run via the native
embedder (tools/flutter-embedder → gen_snapshot/AOT), never `flutter run -d`.
Proof it's dead weight: the other three apps (canvas, files, web_link) have no
such dirs and build into the ISO fine, and no build script / CI references them.
Removes the cross-platform clutter so the app tree matches the others.
… halts the OS

Both the #GP and page-fault handlers did `halt_forever()` on a user-mode fault,
so a single app's crash took down the entire machine. That's wrong: a segfault
should kill the faulting process, not the OS. A user-mode fault holds no kernel
lock, so the handler can safely tear the process down (`exit(pid,-1)`, which
triggers app crash auto-recovery for the app's group) and reschedule the next
runnable process — mirroring sys_exit's "die, run the next, never park forever".

Also gives #GP a GPR-capturing naked entry (GpFaultFrame, like the page-fault
handler) so crash logs show the full register state.

Verified x86: launching an app that hits the intermittent dart:io EventHandler
crash now leaves the shell rendering 27-29 frames PAST the fault (was: dead
machine); a clean launch is unaffected. This makes the OS resilient to the
app-launch race while the underlying race (dart:io engine-thread state
corruption — root-caused, separate fix) is addressed.
…e expirations

force_wake_all_task_runners (the epoll-parking deadlock-breaker) did
`pending = pending.saturating_add(1)` on EVERY timerfd each time it fired
(frequently during app bring-up). For an actively-serviced timerfd that meant
its `pending` could accumulate to large values, so the next timerfd read
returned a big BOGUS expiration count instead of ~1. A healthy, serviced timer
must never report fake expirations piling up. Cap the spurious contribution at 1
(still reports EPOLLIN to unpark the waiter).

Correctness fix. Verified it does not regress boot/render (shell brings up and
renders across runs). NOTE: this is NOT the fix for the intermittent app-launch
dart:io EventHandler crash — that race persists (~50% over 6 runs); root cause is
deeper engine-thread state corruption (separate, ongoing).
… primitive)

x86's reschedule IPI was already complete (broadcast_resched_ipi → send_resched_ipi
→ vector 0x40 → apic_resched_handler), and set_state(Running) already broadcasts on
a wake. This fills the aarch64 gap so the primitive exists on both arches:

- gic::send_sgi_all_but_self / send_sgi: write GICD_SGIR (TargetListFilter 0b01 =
  all-but-self, or an explicit CPU-interface mask). SGI_RESCHED = SGI 0.
- broadcast_resched_ipi() sends SGI_RESCHED, guarded on CPU_COUNT > 1 so single-core
  never touches the GIC (this is on the hot set_state path) — and "all-but-self"
  reaches no one on one core anyway.
- the IRQ handler recognizes + EOIs SGI 0 (taking the interrupt wakes the core from
  wfi; M5 will rerun the scheduler here once APs schedule).

So a thread made runnable on one core can signal the core it's affined to. Compile-
verified both arches × {default, smp}; provably single-core-neutral (guarded no-op +
the SGI-0 branch is unreachable on one core), so the proven single-core path is
unchanged. SGI delivery + the rerun-scheduler action are exercised at M5 (APs idle
until then). x86 untouched (already done).
Pin each launched engine (process group) to a dedicated application core so it
runs with its OWN per-CPU GPR scratch, never sharing the cooperative context that
crashed a 2nd engine on a single shared core. Under -smp 2 the shell renders on the
BSP and a launched app (Files/Web Link) initialises its engine and renders its full
UI on core 1. Verified headless (TCG): 8/8 runs no panic/corruption, 7/8 launch +
render (the 1 miss is a rare boot first-frame stall, non-fatal). The OS never goes
fatal on an app crash. Single-core (the default, no `smp` feature) is byte-unchanged.

Home-core scheduling:
- Process gains home_cpu; assigned atomically in spawn_with_bootstrap (app host ->
  an application core, everything else -> BSP) before the process is Running, so
  there is no window where the wrong core claims it. Threads inherit it.
- next_runnable_pid filters by home_cpu==my_cpu; the app runs ALONE on its core
  while the shell stays foreground-exclusive on the BSP (no concurrent shell
  engine to trip Dart's isolate-confinement check).
- AP runs the home-pinned scheduler loop + the per-core timer-ISR wake-assist
  (home-gated via try_claim) to pump its app's frames; it idle-parks until the
  shell engine is up to avoid contending with the BSP's boot.
- spawn_with_bootstrap broadcasts a reschedule IPI so a home-pinned app's core
  wakes from idle immediately instead of waiting for its next timer tick.
- CPU_COUNT publishes the ONLINE count, not the configured count, so an app is
  never pinned to a core that failed to come online.

SMP-safety fixes this exposed (latent under cooperative single-core):
- Per-CPU page-table lock depth. The re-entrancy counter was a single GLOBAL
  atomic: a 2nd core saw depth>0, treated itself as a nested writer, and BYPASSED
  the real lock -> two cores mutated page tables in parallel -> "corrupt PTE"
  panic. Now per-CPU; each core's first acquire takes the cross-core lock. x86
  preempt-disables the section (IRQs stay on so the app core's frame pump keeps
  running) instead of masking IRQs; the disable is done under a brief IRQ mask so
  the cpu-read can't be migrated mid-update.
- WM<->PTABLE lock order. wm queue methods canonicalised pids (get_group_leader ->
  PTABLE_LOCK) while holding the WM lock, inverting the scheduler's PTABLE->WM
  order (next_runnable_pid_locked calls input_pending_for under PTABLE_LOCK) ->
  ABBA deadlock, both cores wedged. Canonicalise at the WM-module boundary; the
  EventQueue methods now compare already-canonical pids with no PTABLE access.
- Crash-recovery teardown use-after-free. A dead app's pml4 was freed while a
  futex physaddr lookup still held a stale reference -> walk of phys 0 ->
  unrecoverable kernel page fault. Cross-process page-table walks now reject a
  null/out-of-range root (return None) instead of dereferencing it.
…e root

The SMP crash-recovery teardown can free a dead app's pml4 while another core
still holds a stale reference to it (a futex physaddr lookup) — the walk then
dereferences a freed/reused table and hits phys_to_virt's "corrupt PTE" panic
(an unrecoverable kernel page fault, deref of phys 0 / out-of-range). The OS
should survive an app's crash, not die with it.

Read-walks (translate_user_page / _flags) and the teardown free walk now route
every table-pointer dereference through walk_table(), which returns None for a
null or out-of-addressable-RAM frame instead of dereferencing it. A garbage or
freed table is treated as "not mapped" (the lookup fails, the caller copes) and
the kernel stays alive. The bare phys_to_virt — which still panics on garbage,
a useful invariant — is unchanged on the WRITE paths. Verified: 0 panics /
0 corrupt-PTE across the SMP app-launch batch (was ~1/10 before).
packages/oscortex_ui/.dart_tool/ was tracked despite matching .gitignore
(packages/*/.dart_tool/) — it predated the ignore rule. It's a regenerated
build/tooling cache that should never be in version control. Removed from
tracking; the existing .gitignore rule keeps it out.
… Link status

Release docs for the x86 SMP app-launch milestone. Records the home-core fix, the
multi-core requirement, the honest stability picture (OS crash-proof; intermittent
non-fatal first-frame stall; not yet bare-metal-confirmed), and — per request — the
Web Link/browser status: app shell + webview pipeline scaffold with a stub engine
are done and demonstrable; the Servo web engine is NOT integrated yet.
…edger

rules.md: binding rules for every agent — never present a stub as done; "done"
means verified end-to-end on the real artifact; report status as exactly one of
DONE/UNVERIFIED/STUB/NOT-STARTED; finish the task or name the gap precisely.

CLAUDE.md: auto-loaded by agents; makes rules.md mandatory reading before starting
and before reporting done.

docs/FEATURE_STATUS.md: honest per-feature ledger (start of the stub audit) — flags
what has NOT been re-verified rather than rubber-stamping it, and is explicit that
the audit is incomplete.
…not "solved"

Live HVF test 2026-06-17: tapping an app launches the 2nd engine, which stalls at
FlutterEngineRunInitialized (no crash) and freezes the UI. Same single-core
cooperative-scheduler root cause as x86. The prior "solved via serial-GC" note was
wrong. Fix = wire the home-core SMP path for aarch64 (no aarch64 mirror of the x86
timer-ISR home-core wake-assist exists yet).
…ootstrap WIP

Progress on aarch64 home-core SMP (the ARM app-launch freeze). The prior state was
"AP comes online but stays idle (never schedules)". Now, under -smp 2 on HVF (real
parallel cores):
  - The AP idle-parks (wfi) until an app is pinned to it (CPU_HAS_HOME_WORK, set in
    spawn_with_bootstrap + woken by the existing reschedule IPI), so it does NOT
    contend with the BSP's shell first-frame bring-up (engaging at flutter_init_ready
    stalled the shell at present=1).
  - On wake it engages + runs the arch-neutral home-pinned scheduler. The launched
    app's host (pid 8) DOES run on the AP — main_embedder, FlutterEngineInitialize,
    its worker pool all execute there (aarch64 AP EL0 entry works).
  - The timer-ISR wake-assist resume is home-gated + re-enabled (multi-core only;
    single-core keeps the proven set_state_try-only path, byte-unchanged).

NOT DONE (honest, per rules.md): the app's engine bootstrap DETERMINISTICALLY stalls
at FlutterEngineRunInitialized (4/4 runs) — the cooperative thread-pool handshake
does not complete on the AP, so no frame is presented. This is the deep remaining
aarch64 SMP work. Single-core ARM (the shipped run-arm.sh path, no `smp` feature) is
unaffected — all new behaviour is behind the AP/CPU_COUNT>1 gates. x86 unaffected.
…ounter access)

The aarch64 AP never called enable_fpu_simd() (the x86 AP does, in ap_init). That
function sets CPACR_EL1 (FP/SIMD) AND CNTKCTL_EL1.EL0VCTEN/EL0PCTEN — per-CPU
registers at reset on a secondary core. Without CNTKCTL, the app's inline CNTVCT_EL0
read (liboscortex_libc monotonic clock, hit constantly by the Dart VM) trapped to EL1
with EC=0x18 → report_unhandled → the engine bootstrap deterministically wedged on the
AP (4/4 runs, confirmed by an HVF register dump: AP stuck in report_unhandled,
ESR EC=0x18). Adding the call clears EC=0x18. The freeze now advances to a later
fault (EC=0x24 EL0 data abort) — distinct, deeper issue, still WIP; aarch64 SMP
app-launch is NOT yet working. AP-only change; single-core + x86 unaffected.
…s EL0 state

Re-enabling enter_user_by_pid_noreturn_try from the AP timer ISR reproduced the
prior session's "pid=2 Dart corruption" exactly: a real EL0 data abort (EC=0x24) in
engine code. Reverted to wake-only (set_state_try), which clears the corruption
(unhandled_exc=0, 3/3). Net aarch64 SMP state with the CNTKCTL fix: the AP runs the
app past the CNTVCT trap, but the engine bootstrap stalls at FlutterEngineRunInit —
the cooperative hand-off alone doesn't complete it and the only pump (ISR resume)
corrupts thread state. The real fix is a corruption-free per-CPU context switch
(Redox-style switch_to), a multi-session rewrite. aarch64 SMP app-launch NOT working
yet; single-core + x86 unaffected (all behind AP/CPU_COUNT>1 gates).
… bootstrap needs switch_to rework

The home-gated timer-ISR wake-assist resume now fires only when the target has a
valid saved FP (aarch64_fp_valid), so it can never re-enter a thread with no FP image
and zero its AAPCS64 callee-saved v8–v15 (the EC=0x24 SMP corruption). Verified: 4/4
runs, 0 unhandled exceptions (corruption gone).

But the app bootstrap STILL stalls (4/4, RunInitialized never completes): the threads
that need re-entry during bootstrap are freshly created and have no FP yet, so the
FP-gated resume skips them, and the cooperative hand-off doesn't re-enter them either.
This is the catch-22 proven across 5 distinct attempts (CNTKCTL fix, resume on/off,
FP-gated): a correct resume must restore ANY thread's full state without corruption —
i.e. a real per-CPU switch_to context save/restore, not the build_image rebuild. That
is focused multi-session architectural work (precisely scoped in [[smp-bringup]] M5),
NOT a blind loop hack. aarch64 SMP app-launch remains BROKEN (honest). Single-core +
x86 unaffected (multi-core/AP-gated).
@squirelboy360 squirelboy360 merged commit 43dd3b9 into develop Jun 17, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants