Add cross-library tensor-network benchmark suite (TeNPy, quimb, Cytnx)#940
Add cross-library tensor-network benchmark suite (TeNPy, quimb, Cytnx)#940IvanaGyro wants to merge 69 commits into
Conversation
Adds common/metrics.py (CPU/GPU timing and peak-memory measurement helpers, CSV result writer) and common/model.py (shared Heisenberg and transverse-field-Ising parameters plus the chi x L sweep grid) used by every per-library benchmark script in benchmarks/cross_library/. Adds the TeNPy implementations of all four algorithm classes: dense finite two-site DMRG, U(1)-symmetric block-sparse DMRG, TEBD real-time field-quench dynamics, and a hand-derived analytic-gradient variational ground-state search (TeNPy has no autodiff backend, so the gradient of the Rayleigh quotient is computed via TeNPy's own OneSiteH contraction rather than backpropagation). TeNPy is CPU-only, so no GPU code path is included here. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the quimb implementations of all four algorithm classes: dense finite two-site DMRG, U(1)-symmetric block-sparse DMRG, TEBD real-time field-quench dynamics (using quimb's built-in TEBD engine), and a variational ground-state search using real automatic differentiation through both the JAX and PyTorch backends via quimb's autoray-based array dispatch (psi.arrays / apply_to_arrays). CPU and GPU code paths are both written for every script; the GPU path moves arrays to the JAX/PyTorch CUDA device before stepping. It is untested in this environment, which has no GPU. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…radient Adds the Cytnx implementations of all four algorithm classes, built directly on UniTensor/Network/Contract since Cytnx has no built-in DMRG/TEBD engines or autodiff: - dmrg_dense.py / dmrg_symmetric.py: finite two-site DMRG with a bond-dimension-5 Heisenberg MPO, dense and U(1)-total-Sz-conserving block-sparse variants. Both MPOs were verified against exact diagonalization for L=4,6 before being used here. - tebd.py: a hand-rolled TEBD sweep for the transverse-field-Ising field-quench benchmark. Applies the two-site gate exp(-i*dt*h_bond) in a strictly sequential left-to-right sweep, absorbing the post-SVD singular-value tensor into the not-yet-visited (right) neighbor so the orthogonality center moves forward with the sweep; absorbing into a fixed side regardless of sweep direction breaks the canonical gauge. The on-site field term is split between the two bond gates touching each site (half-weight on interior bonds, full weight at the two chain boundaries) so it isn't double-counted for interior sites. Validated against exact diagonalization for L=4,6 (state overlap -> 1 as dt -> 0). - variational_manual_grad.py: gradient-descent ground-state search using the hand-derived Rayleigh-quotient gradient dE/dA_i* = 2*(H_eff,i(A_i) - E*A_i), with H_eff,i built from the same per-site L/R environment-update Networks used by dmrg_dense.py. CPU and GPU code paths are both written; the GPU path moves every UniTensor to cytnx.Device.cuda before the sweep. It is untested in this environment, which has no GPU. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_all.py invokes every library/algorithm benchmark script in this suite as a subprocess and collects their CSV output under results/, so the full 3-library x 4-algorithm matrix can be run with one command (with --only and --skip flags to restrict to a subset of libraries). Subprocess isolation avoids one library's global state (device placement, RNG seeding, monkeypatched backends) leaking into another library's run. plot_results.py reads the CSVs and produces per-algorithm time/memory vs. chi and vs. L plots, one line per (library, backend) series. README.md documents the shared model and parameter grid, how the gradient is computed differently per library in algorithm class 3 (quimb autodiff vs. TeNPy/Cytnx's independently-derived analytic gradients), which CPU/GPU paths exist per library, and how to run the suite end to end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive cross-library tensor-network benchmark suite comparing TeNPy, quimb, and Cytnx across several algorithms. The review feedback highlights several critical bugs and platform-specific issues: the Lanczos solver in Cytnx benchmarks uses an excessively high convergence threshold that prevents proper iterations; an in-place relabeling chain in Cytnx's TEBD script will cause a crash due to returning None; an incorrect module call and signature are used for generating random abelian MPS in quimb's symmetric DMRG benchmark; and the metrics utility contains platform-dependent discrepancies in RSS memory calculation on macOS as well as an incorrect GPU memory query that reads free memory instead of peak usage.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
resource.getrusage().ru_maxrss is reported in kilobytes on Linux but in bytes on Darwin (macOS), per the getrusage(2) man page on each platform. cpu_timed_block divided the RSS delta by 1024.0 unconditionally, which is correct on Linux but underreports peak_mem_mb by a factor of 1024 on macOS. Pick the divisor based on sys.platform.
…sage cudaMemGetInfo() returns (free_bytes, total_bytes) for the current device. Indexing [0] read the free-memory figure (typically several GB) and reported it as peak_mem_mb, which is the opposite of what the field means. Cytnx exposes no peak-allocator counter to Python (unlike torch.cuda.max_memory_allocated), so there is no substitute available; report 0.0 instead of a misleading number.
…ning cytnx.linalg.Lanczos treats CvgCrit as an energy-difference convergence threshold: once the change between iterations drops below it, the solver stops early. CvgCrit=9999999999 is below essentially any energy difference on the first iteration, so the solver always converged after a single step regardless of LANCZOS_MAXITER. Use CvgCrit=1e-12 so the threshold is effectively unreachable and the solver runs the full LANCZOS_MAXITER iterations these benchmarks intend to measure.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #940 +/- ##
==========================================
+ Coverage 29.87% 30.87% +0.99%
==========================================
Files 240 229 -11
Lines 35425 34757 -668
Branches 14729 14409 -320
==========================================
+ Hits 10584 10730 +146
+ Misses 17593 16720 -873
- Partials 7248 7307 +59
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
The (chi, L) sweep grid in common/model.py spans up to chi=256, L=200; at that end, a single DMRG sweep or gradient step can take several minutes, making a full suite run impractical when running all libraries/algorithms sequentially on one machine. Add STEP_TIMEOUT_SEC (120s) to common/model.py and a SIGALRM-based time_limit() context manager plus StepTimeoutError in common/metrics.py. Each benchmark script's main loop now wraps its run_one(chi, L) call in time_limit(STEP_TIMEOUT_SEC) and, on StepTimeoutError, logs the point as skipped and continues to the next (chi, L) pair instead of writing a CSV row for it.
Add an `answer: float` field to the shared `StepMeasurement` dataclass in common/metrics.py, picked up automatically by CSVResultWriter since its column list is derived from the dataclass fields. Thread the converged/final physical result through every benchmark script's run_one()/main(): ground-state energy for the DMRG-class scripts (dense and symmetric, cytnx/tenpy/quimb), post-quench energy for the TEBD/TDVP scripts, and the final Rayleigh-quotient energy for the variational gradient-descent scripts. For quimb's tebd.py this required building the full Ising MPO via qtn.MPO_ham_ising to compute <psi|H|psi> after stepping, since TEBD's LocalHam1D has no built-in MPO accessor. For quimb's dmrg_symmetric.py this required switching from the generic local_expectation (which raises AttributeError on mps.partial_trace in the installed symmray/quimb version pairing) to local_expectation_exact. This lets cross-library correctness be checked directly from the production CSV output instead of requiring a separate validation pass.
|
The symmetric DMRG code |
|
For L=200 Heisenberg model 6 sweeps (3 full sweeps, in Cytnx terminology) is too few for convergence. But some numbers for comparison. Cytnx is impressively fast compared with Python What follows is an AI-assisted summary of Cytnx benchmark vs the Matrix Product Toolkit (AI ran the Cytnx code, I ran the MPTK codes and supplied the results to Codex that assembled the tables) PR #940 Cytnx DMRG benchmark comparisonComparison target: open spin-1/2 Heisenberg chain, PR #940 uses For this benchmark, the genuine fixed- The best available DMRG energy for this finite chain is about Main comparison
Relative numbers
PR #940 selected Cytnx results
Caveats
|
Add completed_keys() to common/metrics.py, which reads a benchmark's existing output CSV and returns the set of (chi, L) (or (backend, chi, L) for variational_ad.py) combinations already recorded. Each of the 12 benchmark scripts' main() loops now skip grid points already present in the output CSV before running them. Without this, a script interrupted partway through its chi/L grid (e.g. by a process kill) has to redo every point from scratch on the next invocation, including the slow, already-completed ones at large L and chi.
The timing/memory benchmark scripts in run_all.py run a fixed small number of sweeps/steps purely to get a stable per-step timing sample and never record the physical answer (ground-state energy, observable) they compute along the way, so they cannot by themselves confirm that TeNPy, quimb, and Cytnx are solving the same physical problem. validate_correctness.py instead drives each library's own DMRG/TEBD implementation to convergence on a small chain (small enough for dense exact diagonalization) and checks: (1) converged dense-DMRG ground energy agrees across TeNPy, quimb, Cytnx, and exact diagonalization of the same open-chain Heisenberg Hamiltonian; (2) converged U(1)-symmetric DMRG ground energy agrees between TeNPy and Cytnx (quimb's dmrg_symmetric benchmark runs imaginary-time evolution of a random state to exercise contract+truncate cost, not a ground-state search, so it is excluded); (3) post-quench TFIM energy after a short real-time evolution agrees across TeNPy TEBD, quimb TEBD, Cytnx's hand-rolled TEBD, and exact propagation of the same initial state.
…malization After each gradient step, every one of the L MPS tensors was independently rescaled to Frobenius norm 1. The MPS built by MPS_rand_state is not in canonical form, so per-tensor unit normalization does not keep the contracted wavefunction norm <psi|psi> close to 1; it drifts during the gradient descent and eventually underflows, producing NaN energies in both the jax and torch backends. Rescale all L tensors by a single global factor derived from <psi|psi>^(-1/2L) instead, which renormalizes the contracted wavefunction directly rather than each tensor's own norm.
…n MPS gauge
The single-site Rayleigh-quotient shortcut
energy = <theta|H_eff|theta> / <theta|theta>
used at each site during the forward sweep only equals the true
<psi|H|psi>/<psi|psi> when theta sits at the orthogonality center of a
mixed-canonical MPS: all sites left of theta left-orthonormal, all
sites right of theta right-orthonormal. The MPS built by _build_mps was
left-canonical (the wrong direction for a left-to-right sweep) and was
never re-orthonormalized as the sweep advanced, so the shortcut became
increasingly invalid past the first site, and energies collapsed toward
zero (from ~1e-20 at L=20 down to ~1e-91 at L=50).
Add _canonicalize_right(), used both to build the initial MPS
right-canonical (mirroring TeNPy's analogous benchmark, which
constructs its MPS with form="B") and to restore the right-canonical
gauge at the end of every sweep. Within a sweep, push the orthogonality
center forward via Gesvd after updating each site so later sites see a
properly left-orthonormal left environment.
This surfaced a second, independent bug: Cytnx's UniTensor arithmetic
operators (-, *) reset labels to the default ['0','1',...] sequence
instead of preserving the operand's labels. The Gesvd split introduced
above relies on the split-off bond leg's label matching the neighbor
tensor's leg for Contract to find the shared leg; with the reset
labels, Contract silently fell back to an outer product instead of a
contraction, producing a higher-rank tensor that only failed later, at
an unrelated relabel_() call, with a rank-mismatch error. Relabel
new_theta back to its site's real bond names immediately after the
arithmetic update and before the Gesvd split.
Full CPU benchmark run + correctness notesRan the full cross-library benchmark suite ( Bugs found and fixed during this run1. Both the jax and torch grad-descent loops were renormalizing each of the L MPS tensors independently to Frobenius-norm 1 after every gradient step. Since the MPS is not kept in canonical form during the sweep, per-tensor normalization does not keep the contracted 2. The single-site Rayleigh-quotient shortcut Fixing the gauge surfaced a second, independent issue: Cytnx's UniTensor arithmetic operators ( Verified with a full rerun: energies are now sane, negative, O(1)-to-O(10) values that grow with Cross-library correctness agreement
Coverage gaps (timeouts)All scripts use a 120s per-step/per-sweep timeout (
GPU pathsNone of the GPU code paths ( Methodology noteThe container running this environment was reclaimed/restarted several times mid-run, silently killing whatever benchmark process was in flight. The per- Generated by Claude Code |
Full raw results dataRaw CSV contents for the benchmark run summarized above. These files are generated output (
|
By humanI am not familiar with these high level algorithms, so I am not sure if these numbers are useful. Below is posted by the bot. Time / memory summary tables (per-case, per-library)Time columns first, then memory columns, all other fields omitted per request. Dense DMRG (
|
| L | chi | time cytnx | time tenpy | time quimb | mem cytnx | mem tenpy | mem quimb |
|---|---|---|---|---|---|---|---|
| 20 | 16 | 0.3916 | 1.4227 | 0.2903 | 1.668 | 8.418 | 14.672 |
| 20 | 32 | 0.6754 | 1.4518 | 0.6275 | 3.117 | 2.750 | 5.409 |
| 20 | 64 | 1.3104 | 6.0042 | 1.2596 | 11.242 | 7.719 | 8.120 |
| 20 | 128 | 4.3818 | 6.5157 | 1.6853 | 42.691 | 23.997 | 22.791 |
| 20 | 256 | 12.8531 | 8.9549 | 2.3858 | 165.012 | 55.881 | 64.905 |
| 50 | 16 | 1.0047 | 4.2100 | 0.7081 | 1.068 | 3.503 | 7.453 |
| 50 | 32 | 1.6072 | 4.5398 | 2.3738 | 0.152 | 4.853 | 9.536 |
| 50 | 64 | 4.4042 | 26.0227 | 6.1914 | 0.150 | 15.077 | 27.691 |
| 50 | 128 | 17.3190 | — | 15.8443 | 0.149 | — | 94.922 |
| 50 | 256 | — | — | — | — | — | — |
| 100 | 16 | 2.1489 | 9.4076 | 1.6516 | 0.162 | 4.091 | 12.038 |
| 100 | 32 | 3.9295 | 10.0322 | 6.0649 | 0.164 | 8.348 | 21.946 |
| 100 | 64 | 10.6118 | — | 15.8273 | 0.164 | — | 62.193 |
| 100 | 128 | — | — | — | — | — | — |
| 100 | 256 | — | — | — | — | — | — |
| 200 | 16 | 4.2838 | 18.4350 | — | 0.176 | 6.132 | — |
| 200 | 32 | 7.6182 | 21.9050 | — | 0.178 | 15.302 | — |
| 200 | 64 | 21.2987 | — | — | 1.094 | — | — |
| 200 | 128 | — | — | — | — | — | — |
| 200 | 256 | — | — | — | — | — | — |
Symmetric DMRG (dmrg_symmetric, U(1))
quimb's numbers here are from a non-comparable proxy benchmark (imaginary-time evolution, not ground-state search — see earlier comment), included for completeness only.
| L | chi | time cytnx | time tenpy | time quimb | mem cytnx | mem tenpy | mem quimb |
|---|---|---|---|---|---|---|---|
| 20 | 16 | 0.6073 | 1.9222 | 0.2327 | 2.125 | 4.824 | 6.777 |
| 20 | 32 | 0.6729 | 1.9686 | 0.3278 | 1.160 | 0.963 | 5.875 |
| 20 | 64 | 0.7169 | 1.9959 | 0.3159 | 0.750 | 2.028 | 3.625 |
| 20 | 128 | 0.7213 | 2.2141 | 0.2934 | 0.087 | 5.433 | 2.100 |
| 20 | 256 | 0.7346 | 3.0319 | 0.2861 | 0.087 | 12.582 | 1.633 |
| 50 | 16 | 1.6869 | 5.6062 | 0.5910 | 1.068 | 1.224 | 1.731 |
| 50 | 32 | 1.9222 | 5.9628 | 0.9162 | 0.155 | 1.773 | 6.750 |
| 50 | 64 | 2.0831 | 6.5418 | 1.6070 | 1.750 | 4.349 | 32.066 |
| 50 | 128 | 2.0811 | 7.6943 | 1.9180 | 0.153 | 14.523 | 38.250 |
| 50 | 256 | 2.1119 | — | 1.7470 | 0.154 | — | 15.250 |
| 100 | 16 | 3.3927 | 12.6546 | 1.1822 | 0.164 | 2.080 | 1.900 |
| 100 | 32 | 4.1338 | 13.0526 | 2.0340 | 0.166 | 3.235 | 8.372 |
| 100 | 64 | 4.5439 | 14.3994 | 3.6257 | 3.500 | 8.124 | 39.875 |
| 100 | 128 | 4.3787 | 16.9409 | — | 0.165 | 27.364 | — |
| 100 | 256 | 4.4010 | — | — | 0.165 | — | — |
| 200 | 16 | 6.9652 | 27.9103 | 2.4781 | 0.179 | 3.709 | 4.438 |
| 200 | 32 | 8.4066 | 28.9233 | — | 1.096 | 6.163 | — |
| 200 | 64 | 9.0969 | — | — | 5.875 | — | — |
| 200 | 128 | 9.3197 | — | — | 0.178 | — | — |
| 200 | 256 | 9.2514 | — | — | 0.178 | — | — |
TEBD quench (tebd_quench)
quimb energies here use a different operator convention (see earlier comment) — timing/memory are still meaningful for cross-library comparison.
| L | chi | time cytnx | time tenpy | time quimb | mem cytnx | mem tenpy | mem quimb |
|---|---|---|---|---|---|---|---|
| 20 | 16 | 0.0417 | 0.1967 | 0.0505 | 1.906 | 3.543 | 8.305 |
| 20 | 32 | 0.0497 | 0.1955 | 0.0482 | 0.973 | 0.689 | 0.525 |
| 20 | 64 | 0.0916 | 0.1988 | 0.0482 | 3.203 | 0.694 | 0.525 |
| 20 | 128 | 0.2871 | 0.1989 | 0.0480 | 11.770 | 0.690 | 0.525 |
| 20 | 256 | 1.1711 | 0.2017 | 0.0473 | 37.453 | 0.689 | 0.525 |
| 50 | 16 | 0.1132 | 0.5145 | 0.1235 | 0.143 | 0.836 | 0.559 |
| 50 | 32 | 0.1469 | 0.5169 | 0.1245 | 0.138 | 1.401 | 0.559 |
| 50 | 64 | 0.3502 | 0.5199 | 0.1244 | 0.138 | 1.400 | 0.559 |
| 50 | 128 | 1.6120 | 0.5178 | 0.1222 | 0.139 | 1.399 | 0.559 |
| 50 | 256 | — | 0.5220 | 0.1237 | — | 1.401 | 0.559 |
| 100 | 16 | 0.2296 | 1.0409 | 0.2555 | 0.143 | 1.513 | 0.618 |
| 100 | 32 | 0.3093 | 1.0616 | 0.2563 | 0.141 | 2.566 | 0.618 |
| 100 | 64 | 0.7863 | 1.0508 | 0.2549 | 0.141 | 2.568 | 0.618 |
| 100 | 128 | — | 1.0543 | 0.2528 | — | 2.570 | 0.618 |
| 100 | 256 | — | 1.0503 | 0.2575 | — | 2.570 | 0.618 |
| 200 | 16 | 0.4521 | 2.0899 | 0.5188 | 0.147 | 2.859 | 0.735 |
| 200 | 32 | 0.6511 | 2.1237 | 0.5162 | 0.146 | 4.894 | 0.735 |
| 200 | 64 | 1.6647 | 2.1461 | 0.5152 | 0.146 | 4.894 | 0.735 |
| 200 | 128 | — | 2.1408 | 0.5139 | — | 4.895 | 0.735 |
| 200 | 256 | — | 2.1371 | 0.5141 | — | 4.897 | 0.735 |
Variational manual-gradient / AD (variational_manual_grad / variational_ad)
quimb runs both jax and torch backends; cytnx/tenpy use the hand-derived gradient.
| L | chi | time cytnx | time tenpy | time quimb-jax | time quimb-torch | mem cytnx | mem tenpy | mem quimb-jax | mem quimb-torch |
|---|---|---|---|---|---|---|---|---|---|
| 20 | 16 | 0.0753 | 0.3270 | 0.1899 | 0.1595 | 1.379 | 1.027 | 168.922 | 0.789 |
| 20 | 32 | 0.0958 | 0.3486 | 0.1959 | 0.0523 | 1.660 | 2.855 | 90.266 | 0.714 |
| 20 | 64 | 0.1691 | 0.7227 | 0.2523 | 0.0854 | 5.273 | 9.349 | 163.824 | 0.716 |
| 20 | 128 | 0.5435 | 1.7719 | 0.3395 | 0.1953 | 20.988 | 30.856 | 175.207 | 0.716 |
| 20 | 256 | 2.3875 | 2.5459 | 0.6603 | 0.6635 | 73.309 | 98.860 | 505.789 | 0.716 |
| 50 | 16 | 0.1909 | 0.9456 | 0.2225 | 0.1056 | 0.142 | 2.547 | 2.924 | 1.042 |
| 50 | 32 | 0.2670 | 0.9480 | 0.2556 | 0.1402 | 0.142 | 8.678 | 2.936 | 1.041 |
| 50 | 64 | 0.5735 | 2.5102 | 0.3315 | 0.2605 | 0.143 | 32.061 | 2.937 | 1.047 |
| 50 | 128 | 2.7118 | — | 0.7139 | 0.8753 | 0.142 | — | 203.750 | 1.047 |
| 50 | 256 | — | — | 3.2461 | 4.8831 | — | — | 1133.508 | 1.047 |
| 100 | 16 | 0.3924 | 1.7079 | 0.4327 | 0.2399 | 0.147 | 5.210 | 5.409 | 1.455 |
| 100 | 32 | 0.5207 | 1.9003 | 0.5237 | 0.2922 | 0.147 | 18.351 | 5.386 | 1.453 |
| 100 | 64 | 1.2257 | — | 0.6442 | 0.4940 | 0.148 | — | 5.389 | 1.465 |
| 100 | 128 | — | — | 1.5026 | 1.7921 | — | — | 18.508 | 1.465 |
| 100 | 256 | — | — | — | — | — | — | — | — |
| 200 | 16 | 0.8292 | 3.4438 | 0.9004 | 0.4505 | 0.159 | 10.572 | 9.981 | 2.280 |
| 200 | 32 | 1.0977 | 3.7956 | 1.0481 | 0.6096 | 0.159 | 37.706 | 9.918 | 2.277 |
| 200 | 64 | 2.6649 | — | 1.2627 | 1.1007 | 0.159 | — | 9.916 | 2.301 |
| 200 | 128 | — | — | 3.1153 | 4.2115 | — | — | 9.919 | 2.301 |
| 200 | 256 | — | — | — | — | — | — | — | — |
Generated by Claude Code
|
@IvanaGyro this is very interesting, but I suggest being careful about posting preliminary benchmark data for other libraries in a public forum before checking that the comparisons are fair. For this kind of benchmark, the important question is not just fixed-work runtime, but whether the libraries are running equivalent algorithms with equivalent convergence criteria and comparable initialization, and whether the result reflects what a real user would likely experience. Otherwise the table can easily compare different workloads rather than different implementations. For example, in the DMRG data the Cytnx, TeNPy, and quimb scripts do not appear to use fully equivalent initial states, sweep logic, eigensolver settings, truncation settings, or convergence criteria. The symmetric DMRG numbers are especially delicate, since the Cytnx result at larger The The memory numbers are particularly hard to interpret. For such small calculations, the actual memory stored by the tensor data is tiny and will be lost in the noise of the Python runtime, library allocators, and process-level RSS/high-water measurements. In several places the reported memory even decreases as |
Agree. This PR is just for bots to verify the implementation of the benchmarks. However, I forgot that PRs and issues can not be deleted on GitHub. Maybe I have to remove some comments after getting the useful information. The purpose of this benchmark is to investigate the possibility of moving all stuff to Python only. |
BlockUniTensor.shape() reports each bond's nominal (sum-of-sectors) extent, not the number of elements actually stored in the nonzero charge blocks. The Lanczos LinOp built in _optimize_psi was sized from psi.shape(), which over-counts the active dimension of a block-sparse psi. Add _stored_numel(), which sums the element count of psi's actual blocks, and use it instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…benchmarks The variational ground-state benchmarks took an unnormalized gradient g = 2*(H_eff(A) - E*A), then normalized the gradient *direction* (g / ||g||) before scaling by LEARNING_RATE. This turns the update into a fixed-step-size method that ignores the gradient's magnitude, rather than standard gradient descent, and converges far more slowly/poorly than a raw-gradient step at a comparable learning rate. Drop the per-step direction normalization in cytnx_bench and quimb_bench and use the raw gradient directly, raising LEARNING_RATE from 1e-3 to 0.1 to compensate for the now-unnormalized step size. For tenpy_bench, replace the previous renormalize-each-tensor-then-call canonical_form() approach with incremental QR re-canonicalization: each updated site is immediately QR-left-canonicalized and the leftover upper-triangular factor is folded into the next site's theta before that site is read out, with a single trailing canonical_form() call to restore the right-canonical 'B' form for the next sweep's environment caching. This keeps the per-site Rayleigh-quotient energy estimate consistent with the orthogonality-center assumption that OneSiteH's effective Hamiltonian relies on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each script's timed_block() context previously wrapped only the sweep/grad-step loop, while psi/MPO/MPS construction happened before entering the block. cpu_timed_block/jax_gpu_timed_block/torch_gpu_timed_block/ cytnx_gpu_timed_block measure peak process/device memory over their context's lifetime, so peak_mem_mb only reflected the loop's incremental memory growth and missed the (often larger) construction-time allocation. Move construction (MPO/MPS/model/engine setup) inside the timed_block context in all 12 cytnx_bench/tenpy_bench/quimb_bench scripts, and any post-loop measurement calls (e.g. final energy evaluation) that also allocate memory. step_time is computed from a local perf_counter() pair placed tightly around just the sweep/step loop, so per-step timing semantics are unchanged; only the memory measurement window grows to cover the whole benchmarked operation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each of the 12 benchmarks/cross_library/{cytnx,tenpy,quimb}_bench/*.py
scripts gets a sibling test_<name>.py that calls its run_one(chi, L) (or
run_one_jax/run_one_torch for quimb's variational_ad.py) through
pytest-benchmark's benchmark.pedantic at a single fixed (chi=16, L=20)
point, asserting the returned energy against a pytest.approx reference so
a wrong physical answer fails the test run rather than just producing a
plausible-looking timing number. A separate @pytest.mark.limit_memory
test per script exercises the same run_one() under pytest-memray, run
via a distinct `pytest --memray` invocation since pytest-memray's
instrumentation and pytest-benchmark's calibration loop don't compose in
a single test call.
dmrg_dense/dmrg_symmetric/tdvp/tebd use rel=1e-4..1e-6 tolerances since
those either start from a deterministic state or converge via several
sweeps of local eigensolves regardless of the (seeded or unseeded)
initial MPS. The two variational_manual_grad tests use a wider rel=1e-2
tolerance because gradient descent only takes a small, fixed number of
steps from an unseeded random initial MPS, so the energy reached after a
fixed step count is more sensitive to the starting point than a
converged DMRG sweep is.
Adds common/import_util.py: cytnx_bench, tenpy_bench, and quimb_bench
share several script basenames (dmrg_dense.py, dmrg_symmetric.py,
tebd.py, variational_manual_grad.py), so a plain
`from dmrg_dense import run_one` after sys.path.insert(0, dirname(...))
caches the imported module under the bare name in sys.modules --
whichever test file imports a given basename first wins, and every
other test file silently reuses that cached module instead of its own
sibling script. import_util.load_sibling_module() loads each script
under a directory-qualified module name to avoid the collision.
__init__.py markers are added to cross_library/ and its three *_bench
subdirectories so pytest assigns each test_<name>.py a package-qualified
module name too, since plain rootdir-relative imports hit the same
basename collision for test files spread across multiple directories.
Adds a "benchmark" extra (pytest-benchmark, pytest-memray) to
pyproject.toml and documents both invocations in
benchmarks/cross_library/README.md.
Co-Authored-By: Claude <noreply@anthropic.com>
…cking
Each test_*.py in benchmarks/cross_library/{cytnx,tenpy,quimb}_bench
unpacked run_one's full (step_time, peak_mem_mb, energy) return tuple,
naming the first two fields with a leading underscore. Those two values
duplicate what pytest-benchmark and pytest-memray already measure
through their own instrumentation (benchmark.pedantic's timing, and
pytest-memray's --memray pass), so the manually computed step_time and
peak_mem_mb were never read after assignment.
Replace the three-name unpacking with `*_, energy = ...` in all 24
call sites, keeping only the value each assertion actually checks.
Each test_*.py under benchmarks/cross_library/{cytnx,tenpy,quimb}_bench
called run_one(CHI, L) with CHI and L as bare module constants, so
pytest-benchmark's reports and test IDs carried no information about
which (chi, L) point a given run measured -- every test showed up under
its bare function name regardless of the input size.
Replace the module constants with @pytest.mark.parametrize("chi,length",
[(CHI, L)]) on both the benchmark and memory test functions, so pytest
collects and reports each case as e.g. test_dmrg_dense_benchmark[16-20].
This keeps today's single (16, 20) case but makes it straightforward to
add more (chi, length) points later without touching the test bodies.
test_variational_ad.py additionally parametrizes over the jax/torch
backend split (test_variational_ad_benchmark[jax-16-20] /
[torch-16-20]) instead of having two separately named test functions,
using pytest.param(..., marks=pytest.mark.limit_memory(...)) to keep
each backend's distinct memory budget on its corresponding case.
run_one(16, 20) in cytnx_bench/tebd.py consistently returns -19.000359069981286 on repeated direct calls, both standalone and inside the full pytest-benchmark/pytest-memray suite. The committed REFERENCE_ENERGY of -18.999777255176408 never matched this and made test_tebd_benchmark/test_tebd_memory fail deterministically. Update REFERENCE_ENERGY to the value run_one actually produces.
… scripts
Each of the 12 benchmark scripts in benchmarks/cross_library/{cytnx,quimb,tenpy}_bench/
carried its own argparse CLI, CSVResultWriter row-building, and
timed_block/StepMeasurement instrumentation for run_all.py's now-obsolete
custom sweep/CSV harness. Reduce each run_one() to return a bare energy
value, dropping the main()/CLI entry points and CSV-row construction so
the functions can be imported and driven directly from pytest fixtures.
quimb_bench/variational_ad.py's module docstring referenced the removed
--backend jax/--backend torch CLI flag; updated it to describe the
run_one_jax/run_one_torch functions that replace it.
Several constructors in the Cytnx GPU code paths (cytnx.zeros, cytnx.eye,
cytnx.UniTensor.zeros, cytnx.UniTensor.normal, cytnx.physics.pauli, and the
bond-list cytnx.UniTensor(...) constructor) accept a device kwarg directly,
making the construct-on-CPU-then-.to(device) pattern unnecessary. Pass
device=device at construction time instead across test_dmrg_dense.py,
test_dmrg_symmetric.py, test_tebd.py, and test_variational_manual_grad.py.
Operations like Kron and ExpH have no device parameter of their own but
inherit the device of their already-on-device inputs, so derived tensors
(e.g. TEBD gates) end up on the correct device without further calls.
test_tebd.py additionally tracked its device as a string ("gpu"/"cpu")
rather than a cytnx.Device int constant; normalize it to the
cytnx.Device.cuda/cytnx.Device.cpu convention already used by the other
three files so device is passed consistently to every constructor.
Co-Authored-By: Claude <noreply@anthropic.com>
qcntr tracks the running total-Sz charge assigned to each MPS bond as the U(1) sector is walked one unit at a time toward TARGET_Q, and cq is the per-step increment direction; the abbreviated names obscured what each variable represents. Co-Authored-By: Claude <noreply@anthropic.com>
The module-level _L_UPDATE_NET/_R_UPDATE_NET/_HEFF_NET/_LN_UPDATE_NET/ _RN_UPDATE_NET/_N_EFF_NET topology strings are only ever read by _Networks.__init__, and the free functions _update_L/_update_R/_h_eff/ _update_L_N/_update_R_N/_n_eff all take a _Networks instance as their first argument and operate solely on its cached Network objects. Move the topology strings into _Networks as class attributes and the free functions into _Networks as methods, so the helpers read as the class's own behavior rather than functions that happen to take it as a parameter. Co-Authored-By: Claude <noreply@anthropic.com>
The module implements tebd.TEBDEngine, not TDVP, as its own docstring
already noted ("TEBD is used here because it is the simpler, more
universally-supported entanglement-growth benchmark"). The file and
test names (test_tdvp_benchmark/test_tdvp_memory) disagreed with that,
making this TeNPy script look like the suite's only TDVP benchmark and
implying tenpy_bench had no TEBD counterpart to quimb_bench/test_tebd.py
and cytnx_bench/test_tebd.py, when it actually does.
Renamed the file and both test functions to test_tebd_benchmark/
test_tebd_memory, updated the docstring's run instructions, and fixed
the algorithm-class-2 row in benchmarks/cross_library/README.md to
list test_tebd.py instead of test_tdvp.py.
Co-Authored-By: Claude <noreply@anthropic.com>
Every *_memory test in this suite previously hardcoded a single (bond_dim=16, num_sites=20) point, while the corresponding *_benchmark test in the same file already swept the full bond_dim x num_sites grid via stacked @pytest.mark.parametrize decorators. Apply the same parametrization to each _memory test so memray checks memory at every grid point the timing benchmark covers, not just the smallest one. limit_memory thresholds sized only for the (16, 20) point are too tight for the grid's worst (highest bond_dim, highest num_sites) point, so bump them with headroom based on measured high-watermark allocations: - cytnx_bench/test_variational_manual_grad.py: 20MB -> 40MB - tenpy_bench/test_dmrg_dense.py: 50MB -> 110MB - tenpy_bench/test_dmrg_symmetric.py: 40MB -> 50MB - tenpy_bench/test_variational_manual_grad.py: 40MB -> 80MB - quimb_bench/test_tebd.py: 60MB -> 100MB - quimb_bench/test_dmrg.py (dense): 80MB -> 130MB - quimb_bench/test_variational_ad.py (torch): 100MB -> 400MB cytnx_bench/test_dmrg_symmetric.py, cytnx_bench/test_tebd.py, and the quimb_bench/test_dmrg.py (symmetric) / test_variational_ad.py (jax) limits already had enough headroom and are unchanged. Co-Authored-By: Claude <noreply@anthropic.com>
ARRAY_BACKEND previously only converted psi's blocks via apply_to_arrays (and only for "cupy"), leaving the Heisenberg gate and operator built by _heisenberg_two_site_gate/_heisenberg_two_site_op as plain numpy arrays. Selecting "torch" or "jax" therefore failed at the first gate_split_ call with a tensordot type mismatch between torch/jax and numpy arrays. Add a _convert_backend helper that moves a U1Array's blocks to the selected ARRAY_BACKEND (numpy, cupy, torch, or jax) and apply it uniformly to psi, every cached gate, and the one-time Hamiltonian operator, so all three arrays involved in each contraction share a backend. Benchmarked the corrected torch path against numpy across the full (bond_dim, num_sites) grid: both backends converge to identical energies, but torch runs ~1.5-2x slower than numpy at every grid point, consistent with torch's per-call dispatch overhead dominating a workload made of many small per-bond gate/SVD operations. A jax correctness probe at a single grid point did not finish within several minutes under the same workload, for the same reason but more severely, so jax is left available via _convert_backend but excluded from consideration as a default. Keep ARRAY_BACKEND set to "numpy" and update the module comment to record this rationale. Co-Authored-By: Claude <noreply@anthropic.com>
Every test_*.py file's REFERENCE_ENERGIES dict is a 3-sweep fingerprint (common/model.py's N_SWEEPS=3), chosen for fast, stable timing rather than for ground-state convergence, so there was no way to tell whether any of these fingerprints had drifted away from the true ground energy without exact diagonalization, which is infeasible at the benchmark grid's L=20-50. Add a --generate-references mode that runs TeNPy's TwoSiteDMRGEngine to its own convergence (up to 200 sweeps) across the full (bond_dim, num_sites) grid for the dense and U(1)-symmetric Heisenberg chain, and reports the relative difference between each test file's hardcoded fingerprint and this converged value. TeNPy is the source used here because it is the only DMRG implementation in this suite already validated against exact diagonalization, by the small-L check earlier in this same file: Cytnx's two-site sweep is hand-rolled for this benchmark suite, and quimb's symmetric variant is not a ground-state search at all. Running this mode confirms every hardcoded REFERENCE_ENERGIES fingerprint across cytnx_bench/, tenpy_bench/, and quimb_bench/'s dense/symmetric DMRG test files sits within 1e-7 to 1e-13 relative difference of the converged ground energy, well inside the rel=1e-4 to rel=1e-6 tolerances those tests already assert -- so the 3-sweep fingerprints do not need to change. Co-Authored-By: Claude <noreply@anthropic.com>
_optimize_psi and the L/R environment update steps each called cytnx.Network().FromString(...) on every bond update inside the sweep, re-parsing the same fixed contraction topology on every Lanczos local optimization and every environment update. Build the three Network templates (_h_eff_network, _l_update_network, _r_update_network) once per run_one call and reuse them across all bonds and sweeps via PutUniTensors, matching the caching pattern already used in test_variational_manual_grad.py's _Networks class. Co-Authored-By: Claude <noreply@anthropic.com>
Mirrors the Network-caching fix applied to test_dmrg_dense.py: _optimize_psi and the L/R environment update steps each parsed a new cytnx.Network().FromString(...) on every bond update inside the sweep instead of reusing one Network per fixed contraction topology across all bonds and sweeps. Build _h_eff_network/_l_update_network/ _r_update_network once per run_one call and feed them new UniTensors via PutUniTensors on each bond update. Co-Authored-By: Claude <noreply@anthropic.com>
_build_gates computed a separate matrix exponential (and rebuilt the Pauli/identity operators it depends on) for every one of the L-1 bonds, even though every interior bond carries an identical (0.5, 0.5) on-site-field split and only the two boundary bonds differ; share one interior gate object across all interior bonds instead. Separately, sweep() cloned and relabeled each bond's gate on every one of the TFIM_N_STEPS Trotter steps, even though the bond labels are restored to the same fixed lbls[p] after every truncation; relabel each gate once before the time-step loop instead. Co-Authored-By: Claude <noreply@anthropic.com>
run_one_jax already jits the gradient of the energy function on the CPU device path, but norm_sq -- called once per gradient step to derive the post-step global rescale -- was left untraced, so every step re-incurred JAX's per-call dispatch/tracing overhead on that contraction. Jit it under the same DEVICE == "cpu" condition as grad_fn. Co-Authored-By: Claude <noreply@anthropic.com>
The comment claimed LANCZOS_MAXITER is "shared between the Cytnx and quimb dense-DMRG implementations," but only cytnx_bench/test_dmrg_dense.py and test_dmrg_symmetric.py import and pass it to cytnx.linalg.Lanczos; quimb's DMRG2 and TeNPy's TwoSiteDMRGEngine both run their own local eigensolver to its native default convergence instead, with no equivalent cap applied. Document this as an intentional asymmetry rather than a shared parameter. Co-Authored-By: Claude <noreply@anthropic.com>
Cytnx's Svd_truncate exposes an err parameter that discards singular values below a threshold even when the bond-dimension cap isn't reached (src/linalg/Svd_truncate.cpp: the keep_dim loop stops early once Smin < err). TeNPy's svd_min and quimb's cutoffs already apply this same magnitude-cutoff criterion at 1e-10 in this benchmark suite, so Cytnx's dmrg_dense, dmrg_symmetric, and tebd scripts were truncating on bond dimension alone. Add SVD_CUTOFF = 1e-10 to common/model.py and pass it via err= in every Svd_truncate call in the three affected scripts. The dmrg_dense and dmrg_symmetric reference energies are unaffected at every grid point (their SVD spectra don't dip below 1e-10 before the bond-dimension cap is reached), but tebd's spectrum does, shifting its converged energies enough to require regenerating REFERENCE_ENERGIES for all nine (bond_dim, num_sites) grid points in test_tebd.py. Co-Authored-By: Claude <noreply@anthropic.com>
cytnx_bench/test_tebd.py previously applied each bond gate
exp(-i*dt*h_bond) once, in a single left-to-right sweep -- a first-order
Lie-Trotter step, while tenpy_bench/test_tebd.py and
quimb_bench/test_tebd.py both already used order=2 (an even/odd
Suzuki-Trotter splitting of the bond Hamiltonian).
Replace the single full-step sweep with the palindromic forward/backward
composition exp(-i*dt/2*h_{L-2})...exp(-i*dt/2*h_0) *
exp(-i*dt/2*h_0)...exp(-i*dt/2*h_{L-2}): a left-to-right half-step sweep
over every bond followed by a right-to-left half-step sweep over every
bond, absorbing the post-SVD singular values into the not-yet-visited
neighbor on each pass. This symmetric product is second-order accurate
in dt for any ordered decomposition of H into bond terms, matching the
accuracy order of TeNPy's and quimb's order=2 step without requiring
the even/odd bond grouping those libraries use internally.
Regenerate REFERENCE_ENERGIES for all nine (bond_dim, num_sites) grid
points to match the new, more accurate converged energies.
Co-Authored-By: Claude <noreply@anthropic.com>
tenpy_bench/test_tebd.py used TFIChain as-is, which hardcodes the
coupling on Sigmax and the field on Sigmaz -- the opposite axis
assignment from cytnx_bench/test_tebd.py's and quimb_bench/test_tebd.py's
H = -hx*sum(X_i) - J*sum(Z_i Z_{i+1}). To start from the same physical
state under that mismatched convention, the script had to prepare a
Sigmax eigenstate and justify its equivalence to the other two scripts'
Sigmaz "up"/"0" computational-basis state through a basis-swap argument
in a comment, rather than literally using the same product state.
Add _TFIChainZCoupling, a TFIChain subclass that overrides init_terms to
put the coupling on Sigmaz and the field on Sigmax, matching the other
two libraries' Hamiltonian convention directly. run_one now builds this
model and starts every chain from the literal Sigmaz "up" eigenstate
used by cytnx_bench/test_tebd.py and quimb_bench/test_tebd.py, removing
the need for the basis-swap reasoning. REFERENCE_ENERGIES is unchanged,
since the new literal state is numerically equivalent (to float
precision) to the previous basis-swapped one.
Co-Authored-By: Claude <noreply@anthropic.com>
validate_correctness.py --generate-references already computes a single library-independent ground-truth energy per (chi, L) grid point, by running TeNPy's TwoSiteDMRGEngine to its own 200-sweep convergence, and reports each test file's drift against that same value via _report_reference_drift(). For all 9 grid points across cytnx_bench/test_dmrg_dense.py, tenpy_bench/test_dmrg_dense.py, quimb_bench/test_dmrg.py, cytnx_bench/test_dmrg_symmetric.py, and tenpy_bench/test_dmrg_symmetric.py, that drift is at most 3.95e-7 relative -- within each file's existing pytest.approx tolerance (rel=1e-4 or rel=1e-6) -- so swapping each REFERENCE_ENERGIES dict for the ground-truth value tightens what these benchmarks check without affecting which sweep counts or grid points pass. TEBD and variational-gradient-descent REFERENCE_ENERGIES are left untouched: TEBD measures a real-time-evolved expectation value, not a ground-state energy, so no ground-truth dict applies to it; the variational benchmarks are a deliberately weak, non-canonicalized optimizer whose converged energy is documented to land well away from the true ground state. Co-Authored-By: Claude <noreply@anthropic.com>
generate_reference_energies() and _report_reference_drift() previously only covered the dense and symmetric DMRG benchmarks, where TeNPy's 200-sweep-converged energy serves as a ground truth validated against exact diagonalization. TEBD and the manual-gradient variational search have no analogous converged ground truth -- TEBD is a fixed-recipe real-time-evolved expectation value, and the variational search is a documented non-canonicalized whole-network gradient descent whose result depends on the initial state. Both classes are nonetheless expected to land near each other across libraries when the recipe (initial state, algorithm) is shared, so TeNPy's own run_one(chi, L) from tenpy_bench/test_tebd.py and tenpy_bench/test_variational_manual_grad.py is reused as the comparison anchor instead of a ground truth, mirroring TeNPy's role for DMRG. quimb_bench/test_variational_ad.py is excluded from the variational comparison: per its own docstring it builds its initial MPS with a different RNG/construction and differentiates via autodiff rather than the analytic gradient cytnx_bench and tenpy_bench share, so it has no reason to land near their anchor. cytnx_bench/, tenpy_bench/, and quimb_bench/ each contain identically named files (test_tebd.py, test_dmrg_dense.py, etc.). The previous sys.path-manipulation import scheme relied on Python's bare-name module cache, so importing a same-named file from a second directory returned the first directory's already-cached module instead of a distinct one -- silently aliasing, e.g., cytnx_dmrg_dense and tenpy_dmrg_dense to the same module object. This was not visible before because cytnx's and tenpy's DMRG REFERENCE_ENERGIES are now numerically identical ground-truth values, but it would have broken the new TEBD/variational comparisons outright, since those involve three and two same-named files respectively. Added a _load_module() helper that loads each file via importlib.util.spec_from_file_location under a guaranteed-unique module name, and switched all module loads in _report_reference_drift() to use it. Co-Authored-By: Claude <noreply@anthropic.com>
…d_dim Every cross_library benchmark script and validate_correctness.py used the single-letter L and chi internally for chain length and bond dimension, while the pytest-facing parametrize args and REFERENCE_ENERGIES dict keys already used the verbose num_sites/bond_dim names. This extends the verbose naming down into every internal function (run_one, sweep, _build_mps, _build_gates, _n_grad_steps, grad_step, etc.) so the same two quantities are named consistently throughout each file. Left untouched: TeNPy's own L= model-constructor keyword argument and quimb/symmray's own L= keyword argument (external library parameter names, only the values passed to them were renamed); Cytnx Network DSL leg-label strings such as "L"/"R"/"A"/"B"/"M" in FromString/PutUniTensors calls (topology tokens, unrelated to chain length); and the L parameter of cytnx_bench's _optimize_psi, which denotes the left environment tensor, not chain length. Co-Authored-By: Claude <noreply@anthropic.com>
The previous Cytnx TEBD sweep applied bond gates in a palindromic left-to-right then right-to-left order over every bond p=0..num_sites-2, rather than grouping bonds by parity. TeNPy's TEBDEngine (order=2) and quimb's TEBD both split the bond Hamiltonian into H_even (even p) and H_odd (odd p) groups -- bonds within a group never share a site, so gates in the same group commute and the standard Strang splitting exp(-i*dt/2*H_even) * exp(-i*dt*H_odd) * exp(-i*dt/2*H_even) introduces no extra Trotter error beyond the even/odd grouping itself. Replace the old sweep with this even/odd grouping (apply_group over even_bonds and odd_bonds), matching the decomposition TeNPy's suzuki_trotter_decomposition(order=2) and evolve_step() use internally. Regenerate REFERENCE_ENERGIES from the new algorithm's output. The old values were nearly identical across bond_dim=16/32/64, which in hindsight indicates the old sweep wasn't tracking bond-dimension- dependent truncation correctly; the new values vary with bond_dim as expected, and drift relative to TeNPy's anchor now shrinks as bond_dim increases (confirmed at num_sites=20: ~5.6e-6 at bond_dim=16 down to ~2.0e-9 at bond_dim=64), consistent with genuine SVD-truncation error rather than an algorithmic mismatch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
I would expect that the high-level (DMRG) code can be safely implemented in Python without too much overhead. But for the low-level manipulations (UniTensor), we should consider that the numbers here would significantly change once we have non-Abelian symmetries implemented. In this case, the actual data blocks are much smaller, and more time is spend on symmetry-related metadata operations. Then, Python might become a bottleneck, even if this is not the case currently. For C++ vs Python we should also keep parallelization and runs on parallel clusters in mind. |
|
Generally, I am wondering what we can learn from such numbers. For example, MPTK is much faster (as expected, as it is written and optimized heavily for DMRG). Are there any obvious improvements/optimizations that we could (easily) implement? |
cytnx_bench/test_variational_manual_grad.py and tenpy_bench/test_variational_manual_grad.py already asserted at rel=1e-6; quimb_bench/test_variational_ad.py asserted at rel=2e-2 instead. Since each test's run is fully deterministic given its seeded initial state, rel=1e-6 is achievable for quimb's jax/torch backends too, matching the tolerance already used by the other two libraries. Reword each file's docstring to describe the shared rel=1e-6 design: per-library/per-backend self-consistency against that file's own REFERENCE_ENERGIES, not a claim that different libraries or backends should land close to each other -- quimb's own JAX_REFERENCE_ENERGIES and TORCH_REFERENCE_ENERGIES differ from each other by far more than 1e-6 despite sharing the seed and gradient-descent formula. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JAX_REFERENCE_ENERGIES[(16, 50)] was -21.572669982910156, but run_one_jax(16, 50) reproducibly returns -21.572589874267578 in this environment -- a relative difference of ~3.7e-6, just over the rel=1e-6 tolerance these tests assert at. Every other grid point in JAX_REFERENCE_ENERGIES matches its freshly computed value within 2.9e-7, so this one entry was stale relative to the JAX/XLA build producing it, rather than indicating run-to-run nondeterminism. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TFIM_HX_INITIAL was defined in common/model.py but never referenced by any of the three TEBD implementations (cytnx_bench/test_tebd.py, tenpy_bench/test_tebd.py, quimb_bench/test_tebd.py) -- all three quench directly from the all-down product state |0...0> under the post-quench Hamiltonian rather than preparing a separate initial-field ground state. Update the module docstring and README to describe the actual quench protocol instead of the unused parameter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TeNPy's TwoSiteDMRGEngine ran with "mixer": True (density-matrix
subspace expansion) while Cytnx and quimb's DMRG2 do a plain two-site
sweep with no such step, and TeNPy's local Lanczos-equivalent solver
defaulted to N_max=20 against Cytnx's Lanczos(Maxiter=4) -- neither
difference changes the converged energy meaningfully, but both add
uncompensated work to TeNPy's timed sweep. Set "mixer": False and pass
"lanczos_params": {"N_max": LANCZOS_MAXITER} in both
tenpy_bench/test_dmrg_dense.py and test_dmrg_symmetric.py so TeNPy's
per-bond eigensolver budget matches Cytnx's.
Tighten the dense/symmetric DMRG assertions from rel=1e-4 to rel=1e-6
across cytnx_bench/test_dmrg_symmetric.py, both tenpy_bench dmrg files,
and quimb_bench/test_dmrg.py's dense case (quimb_bench's symmetric
case stays at rel=2e-2 -- it runs imaginary-time evolution, not DMRG,
per its own docstring). At rel=1e-6, cytnx_bench/test_dmrg_symmetric.py's
hardest grid point (bond_dim=16, num_sites=50) no longer converges
tightly enough at LANCZOS_MAXITER=4, while TeNPy's own solver already
matches the shared reference to ~1e-8 at that same budget -- raise
LANCZOS_MAXITER from 4 to 6 in common/model.py (shared by both
libraries) to close the gap to ~7e-7 instead of loosening the
tolerance or letting Cytnx's budget diverge from TeNPy's. With the
larger budget, cytnx_bench/test_dmrg_symmetric.py's Lanczos CvgCrit can
stay at 1e-8 (matching test_dmrg_dense.py) rather than the tighter
1e-12, since it no longer governs the binding constraint.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dmrg.solve(tol=1e-6, ...) let quimb exit before max_sweeps whenever its own energy-convergence check was satisfied, while Cytnx and TeNPy's DMRG engines always run exactly N_SWEEPS sweeps regardless of convergence. Set tol=0.0 to disable the early-stopping check so all three libraries spend the same fixed sweep budget for a fair timing comparison. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
REFERENCE_ENERGIES[(16, 30)] was -28.99996982305337, but run_one(16, 30) reproducibly returns -29.000170747421212 in this environment -- a relative difference of ~6.9e-6, over the rel=1e-6 tolerance this test asserts at. Every other grid point in this file's REFERENCE_ENERGIES matches its freshly computed value within ~1e-6, so this one entry was stale relative to whatever BLAS/LAPACK build produced it, rather than indicating run-to-run nondeterminism: repeated runs in this environment reproduce the new value bit-for-bit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ger() grad_step recomputed A_conj = [a.Dagger().permute_(a.labels()) for a in A] from scratch at the top of every call, even though the incoming A is exactly the previous call's new_A scaled by a real positive factor (den_new ** (-1/(2*num_sites))). Since Dagger(scale * a) == scale * Dagger(a) exactly for a real positive scale, new_A_conj can be computed once per tensor (during the post-update LNn norm scan, where it's already needed) and carried forward as the next call's A_conj argument, scaled by the same factor as new_A, instead of re-deriving it via a fresh Dagger()/permute_() pass over every tensor on every gradient step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_one_symmetric summed psi.local_expectation_exact(h_op, where=(i, i+1)) over num_sites-1 separate calls, each rebuilding the left/right overlap environments from scratch. psi.compute_local_expectation(terms, method="envs") computes the same sum of two-site expectation values in a single pass over shared environments instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Warning
This PR has not been reviewed by a human. Correctness of the benchmark code and results is not assured.
Summary
Adds a benchmark suite under
benchmarks/cross_library/comparing TeNPy, quimb, and Cytnx on four algorithm classes against the same physical models and the same(chi, L)parameter grid:Each library implementation lives in its own directory (
tenpy_bench/,quimb_bench/,cytnx_bench/) and shares model/grid definitions (common/model.py) and timing/memory instrumentation (common/metrics.py).run_all.pyorchestrates every script as a subprocess and collects CSVs underresults/;plot_results.pyturns those CSVs into time/memory-vs-(chi, L) plots. Seebenchmarks/cross_library/README.mdfor full details, including the GPU code paths (written but not exercised — no GPU in the development environment).Test plan
Generated by Claude Code