Skip to content

feat: distributed FaaS coordination heuristics (FaaS-MADiG / MAPoD / MABR) + coordination rework#2

Open
miciav wants to merge 108 commits into
mainfrom
feat/uv-migration-and-extended-tests
Open

feat: distributed FaaS coordination heuristics (FaaS-MADiG / MAPoD / MABR) + coordination rework#2
miciav wants to merge 108 commits into
mainfrom
feat/uv-migration-and-extended-tests

Conversation

@miciav

@miciav miciav commented May 19, 2026

Copy link
Copy Markdown
Member

Summary

This branch adds a family of price-free distributed heuristics for DiFRALB/DeFRALB and reworks the shared coordination logic across all three for correctness, determinism, and consistency. The heuristics form a spectrum of coordination styles, all built as controlled ablations of the FaaS-MADeA auction:

  • FaaS-MADiG — greedy diffusion (removes the price signal; buyer scans its whole one-hop neighbourhood greedily by score).
  • FaaS-MAPoD — power-of-d-choices (removes full visibility; buyer probes only d sampled neighbours per step, serves best of sample).
  • FaaS-MABR-S / -R / -O — Gauss-Seidel best response (the sequential counterpart to the simultaneous diffusion methods): fixed-order, randomized-order, and capped-reoptimization variants.

All keep the local planning stack and shared helpers, reuse the same seller-side clearing, and isolate exactly one mechanism each (pricing → visibility → sequential-vs-simultaneous coordination).

Validation: uv run pytest -q270 passed. FaaS-MABR e2e runs under real Gurobi (smoke + same-seed reproducibility for all three variants).

FaaS-MABR (Gauss-Seidel best response) — decentralized_bestresponse.py

  • True best-response sweep: each node releases its current buyer row back to the shared residual-capacity ledger, recomputes its placement greedily by score, and commits the coordinate delta (new_row − previous_row). Later nodes observe earlier nodes' updates through the live ledger — the defining Gauss-Seidel property — and the loop converges to a fixed point.
  • Fixed-point termination on allocation_changed (no node revised its row this sweep), the correct signal under release-and-recompute (raw placement volume never settles).
  • Three variants: FaaS-MABR-S (fixed order), FaaS-MABR-R (seeded random order; variance via --n_experiments), FaaS-MABR-O (capped local re-optimization via LSP_capped/LSP_capped_fixedr and a per-node re-solve).
  • Runtime amortization via compute_sweep_runtime (re-optimization time excluded before amortizing bookkeeping over active nodes); input validation for order/response.
  • Additive wiring: CLI keys faas-br-s/-r/-o, method names FaaS-MABR-S/-R/-O (mkey LSPc), compare_results.py palette + default set, planar_comparison.json br_* blocks. Paper-ready LaTeX note under faas-bestresponse-note/ positioning it honestly as the textbook Gauss-Seidel relaxation (not claimed novel).

Cross-method coordination rework (diffusion / powerd / bestresponse)

Landed in lockstep across the three runners so they stay mutually consistent:

  • Memory-aware seller eligibility: a node can host a replica only if rho[j] >= memory_requirement[f] (was rho[j] > 0).
  • Deterministic seller clearing: explicit (score, index) tie-breaks replace unstable np.argsort; evaluate_assignments reworked — seller_pairs now includes current hosts (so saturated sellers can still be re-evaluated for incumbent replacement), leftover-aware replica start, and lowest-score-incumbent-first reassignment.
  • LSPr_fixedr under --fix_r (social-welfare re-solve now fixes replicas consistently with the subproblem); best_centralized_cost initialized to -inf (fixes a latent bug where a negative centralized objective could never set the initial best).
  • coordination_rho zeroes memory/replica expansion under --fix_r (no new replicas when replicas are pinned).
  • force_memory_bids parity in the block-A memory-bid emission; vectorized rmp_omega/omega/fairness updates.

Shared run_faasmadea.start_additional_replicas: deterministic proportional allocation plus a leftover-memory packing pass (the old per-function floor division left memory unused).

Also on this branch

  • FaaS-MADiG and FaaS-MAPoD (greedy diffusion + power-of-d), each with a design spec, plan, and citation-audited LaTeX note (faas-madig-note/, faas-mapod-note/ — 5/5 cited works verified against CrossRef, PDFs committed).
  • run.py: method→(mkey, name) mapping flattened into a single METHOD_RESULT_MODELS dict.
  • Hierarchical auction, uv migration, and extended test coverage.

Heads-up for reviewers / reproducibility

The determinism rework of define_assignments/evaluate_assignments and the leftover-packing in start_additional_replicas change the numeric outputs of the existing baselines (FaaS-MADeA, FaaS-MADiG, FaaS-MAPoD), not just the new method — the changes are more correct and deterministic, but any benchmark CSVs/figures produced before this rework are now stale and should be regenerated for an apples-to-apples three-way comparison.

🤖 Generated with Claude Code

miciav and others added 30 commits February 16, 2026 16:03
All quality gates pass:
- 83 tests pass (18 hierarchical-specific)
- ruff: clean
- mypy: clean
- coverage: 51% (hierarchical_auction core: 87-96%)
…are after hierarchical levels

- engine: broadcast service_quantum per-function, skip seller==buyer,
  sort candidates by effective bid, compute quantity = min(want, tokens*quantum)
- runner: extract compute_offloaded_demand(), initialize rmp_omega,
  recompute compute_social_welfare after hierarchical allocations,
  pass rmp_omega to check_stopping_criteria
- token_manager: preserve quantity ratio on partial token acceptance
- tests: +5 tests for service quantum, no self-allocation, seller
  preference, offloaded demand, partial acceptance ratio
…unction

Removed the fragile zero-sentinel guard (`if np.allclose(structure_price, 0.0)`)
that prevented recomputation when the legitimate price is zero (eta=0, zero node
prices). The call to compute_structure_price is now made once per structure,
immediately before the inner per-function loop, making intent explicit and safe.
Added regression test for a two-function zero-price network.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… production

Add _extract_latency helper using nx_adjacency_matrix with network_latency weight,
call it once before the time loop, and pass the result to both define_bids and
run_higher_levels (replacing the previous np.zeros placeholders).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…blic exports

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ze loops, cache available tokens, move engine out of loop
…LI parsing

Brings coverage from 52% to 60% (+406 covered statements). Highlights:
- models/sp.py 44→86%, models/auction_models.py 45→84%
- generate_data.py 49→76%, run_centralized_model.py 37→63%
- what_if_analysis.py 33→53%, run_faasmacro.py 26→35%

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add config_files/planar_hierarchical.json for running the hierarchical
auction model on Sage-generated planar degree-3 graphs (Nn 10-50, 3
repetitions). Document the workflow and conda install requirement in README.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- fix: omega_bar and y_bar params use PYO_PARAM_TYPE (NonNegativeReals)
  instead of PYO_VAR_TYPE — solver outputs can be fractional
- fix: import PYO_PARAM_TYPE in models/sp.py
- fix: use nx. prefix for circular_ladder_graph and adjacency_matrix
  in generators/generate_data.py after merge removed explicit imports
- fix: add hierarchical termination condition format to postprocessing
  parser in run.py (missing obj. deviation / best it fields)
- fix: remove undefined title_key references in rlagents/postprocessing.py
- feat: add pre-commit ruff hook (pre-push stage)
- test: regression tests for omega_bar/y_bar float domain and missing import
- config: update planar_hierarchical.json load to sinusoidal trace type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- hierarchical runner now saves runtime.csv with 'tot' column so that
  results_postprocessing can read it without falling back to FaaS-MACrO
  log parsing
- fix deviation append to handle None (not just the string "None") in
  load_termination_condition for hierarchical TC format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace planar_hierarchical.json with planar_comparison.json covering
centralized, faas-macro, and hierarchical on planar degree-3 graphs.
Update README accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
miciav and others added 7 commits June 30, 2026 15:52
…quisites

cmd_run (Inventory -> Project -> jobs -> Dispatcher -> run_batch -> Manifest)
had no automated coverage, only cmd_define and arg parsing were tested. Adds
a fake Dispatcher (context-manager variant of the existing FakeDispatcher
test double from test_remote_experiments_runner.py) to exercise the full
wiring without real Ray/SSH, and asserts the resulting Manifest reflects a
completed run.

README was missing the ../ray-dispatcher sibling-checkout requirement (a
uv path dependency), VM prerequisites (SSH + licensed Gurobi), and
--project-path defaults/excludes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@miciav

miciav commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

Added remote_experiments/: dispatches DFaaSOptimizer algorithm comparisons to remote Gurobi VMs via ray-dispatcher, with a two-phase define/run workflow, a pluggable experiment-suite registry, manifest-based stop/resume, and a rich TUI for live per-experiment/per-VM progress.

13 tasks, TDD throughout, individually reviewed + a final whole-branch review (one fix-wave addressed: added cmd_run integration test coverage and README prerequisites). 320/320 tests passing.

Depends on a small prerequisite addition to ray-dispatcher itself (Dispatcher.running_hosts(), already merged on that repo's main) for live per-VM job attribution.

See docs/superpowers/specs/2026-06-30-remote-experiments-design.md and docs/superpowers/plans/2026-06-30-remote-experiments-implementation.md.

miciav and others added 22 commits June 30, 2026 17:11
The centralized-feasibility series restricted the hierarchical auction's
offloading to neighbours (no ping-pong), but left the replica-acquisition
path (start_additional_replicas) greedily filling residual memory to
receive offload that the restriction now keeps from arriving. The leftover
replicas made the combined solution violate utilization_equilibrium2, and
since combine_solutions validates every auction iteration, the run aborted
on the first such intermediate state. This was a regression: the pre-fix
code produced 0 utilization_equilibrium2 violations on the affected
instances.

combine_solutions now sets r = ceil(served_utilization / max_utilization)
— the centralized model's own replica equilibrium — for the realized
served load (local + received offload). r does not enter the welfare
objective (alpha*x + beta*y - gamma*z), so this is an objective-neutral
feasibility repair: it frees the wasted replicas, never increases memory
use, and satisfies utilization_equilibrium and utilization_equilibrium2 by
construction. For FaaS-MACrO the MILP already pins r to this value, so the
recomputation is a no-op there.

Verified: the three previously-crashing planar instances now run to
completion with centrally-feasible solutions and hierarchical objective
<= centralized at every step.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant