|
1 | | -<div align="center"> |
2 | | -<picture> |
3 | | - <img alt="TrainCheck logo" width="55%" src="assets/images/traincheck_logo.png"> |
4 | | -</picture> |
5 | | -</div> |
6 | | - |
7 | | -# TrainCheck: Invariant Checking & Observability for AI Training |
8 | | - |
9 | | -[](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml) |
10 | | -[](https://github.com/OrderLab/traincheck/actions/workflows/correctness_checks.yml) |
11 | | -[](https://discord.gg/ZvYewjsQ9D) |
12 | | -[](https://deepwiki.com/OrderLab/TrainCheck) |
13 | | - |
14 | | -**Stop flying blind.** TrainCheck gives you deep visibility into your training dynamics, continuously validating correctness and stability where standard metrics fail. |
15 | | - |
| 1 | +--- |
| 2 | +hide: |
| 3 | + - navigation |
| 4 | + - toc |
16 | 5 | --- |
17 | 6 |
|
18 | | -### Why TrainCheck? |
19 | | - |
20 | | -✅ **Continuous Invariant Checking** |
21 | | - |
22 | | -TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants—such as gradient norms, tensor shapes, and update magnitudes—effectively catching silent corruption before it wastes GPU hours. |
| 7 | +<div class="hero" markdown="1"> |
| 8 | + <img alt="TrainCheck logo" width="180" src="assets/images/traincheck_logo.png"> |
| 9 | + <h1>TrainCheck</h1> |
| 10 | + <p><strong>Invariant Checking & Observability for AI Training</strong></p> |
| 11 | + <p>Stop flying blind. Validate training dynamics, catch silent errors, and debug with confidence automatically.</p> |
| 12 | + |
| 13 | + [Get Started](installation-guide.md){ .md-button .md-button--primary } |
| 14 | + [5-Min Tutorial](5-min-tutorial.md){ .md-button } |
| 15 | + [View on GitHub](https://github.com/OrderLab/traincheck){ .md-button } |
| 16 | +</div> |
23 | 17 |
|
24 | | -🚀 **Holistic Observability** |
| 18 | +<div class="feature-grid"> |
25 | 19 |
|
26 | | -Traditional tools only show you *if* your model crashed. TrainCheck shows you *why* it's degrading, analyzing internal state dynamics that loss curves miss. |
| 20 | +<div class="feature-item"> |
| 21 | +<h3>✅ Continuous Invariant Checking</h3> |
| 22 | +<p>TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants (such as gradient norms, tensor shapes, and update magnitudes) effectively catching silent corruption before it wastes GPU hours.</p> |
| 23 | +</div> |
27 | 24 |
|
28 | | -🧠 **Zero-Config Validation** |
| 25 | +<div class="feature-item"> |
| 26 | +<h3>🚀 Holistic Observability</h3> |
| 27 | +<p>Traditional tools only show you <em>if</em> your model crashed. TrainCheck shows you <em>why</em> it's degrading, analyzing internal state dynamics that loss curves miss.</p> |
| 28 | +</div> |
29 | 29 |
|
30 | | -No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly. |
| 30 | +<div class="feature-item"> |
| 31 | +<h3>🧠 Zero-Config Validation</h3> |
| 32 | +<p>No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.</p> |
| 33 | +</div> |
31 | 34 |
|
32 | | -⚡ **Universal Compatibility** |
| 35 | +<div class="feature-item"> |
| 36 | +<h3>⚡ Universal Compatibility</h3> |
| 37 | +<p>Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more.</p> |
| 38 | +</div> |
33 | 39 |
|
34 | | -Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more. |
| 40 | +</div> |
35 | 41 |
|
36 | 42 | --- |
37 | 43 |
|
38 | 44 | ### How It Works |
39 | 45 |
|
40 | | -1. **Instrument**: We wrap your training loop with lightweight probes—no code changes needed. |
| 46 | +1. **Instrument**: We wrap your training loop with lightweight probes. No code changes needed. |
41 | 47 | 2. **Learn**: We analyze correct runs to infer *invariants* (mathematical rules of healthy training). |
42 | 48 | 3. **Check**: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults. |
43 | 49 |
|
|
0 commit comments