[Docs] Clean up architecture docs: remove duplicates, fix stale content (#19399)

tlopex · web-flow · commit e0babea1528b · 2026-04-13T10:26:41.000-04:00
- Remove duplicate `tvm/s_tir/meta_schedule` and `tvm/s_tir/dlight`
sections from `arch/index.rst` (already covered in the `tvm/s_tir`
section with cross-reference to TensorIR Deep Dive)
- Remove duplicate `device_target_interactions` toctree entry (was
listed under both `tvm/runtime` and `tvm/target`; keep only under
`tvm/target`)
- Remove duplicate CUDA pipeline listing in `arch/fusion.rst` "How
Backends Use Fusion" section (already shown in Overview); add
cross-reference to BYOC doc
- Remove duplicated intro sentences in `arch/relax_vm.rst` that were
identical to `arch/index.rst`
- Fix `R.call_dps` → `R.call_dps_packed` (the former does not exist)
- Replace outdated GraphExecutor example
(`set_input`/`run`/`get_output`) with Relax VM example (GraphExecutor
has been removed from the codebase)
- Replace broken external `mlc.ai` image link (returns 404) with local
image in `deep_dive/relax/learning.rst`
- Fix stale `use pass instrument` link in `arch/pass_infra.rst` that
pointed to an unrelated page
diff --git a/docs/_static/img/e2e_fashionmnist_mlp_model.png b/docs/_static/img/e2e_fashionmnist_mlp_model.png
diff --git a/docs/arch/fusion.rst b/docs/arch/fusion.rst
@@ -345,10 +345,7 @@ How Backends Use Fusion
 -----------------------
 
 The default backend pipelines (CUDA, ROCm, CPU, etc.) all include ``FuseOps`` + ``FuseTIR``
-in their ``legalize_passes`` phase for automatic fusion. For example, the CUDA pipeline
-(``python/tvm/relax/backend/cuda/pipeline.py``) runs::
-
-    LegalizeOps → AnnotateTIROpPattern → FoldConstant → FuseOps → FuseTIR → DLight
+in their ``legalize_passes`` phase for automatic fusion, as shown in the `Overview`_ above.
 
 For external library dispatch (cuBLAS, CUTLASS, cuDNN, DNNL), ``FuseOpsByPattern`` is used
 separately. These are **not** included in the default pipeline — users add them explicitly
@@ -358,7 +355,7 @@ when building a custom compilation flow. The typical sequence is:
    offloaded to external libraries. For example, CUTLASS patterns match
    matmul+bias+activation combinations (``python/tvm/relax/backend/cuda/cutlass.py``).
    Functions marked by patterns are annotated with ``Composite`` and optionally ``Codegen``
-   attributes.
+   attributes. See :ref:`external-library-dispatch` for the full BYOC pipeline.
 
 2. **Automatic fusion** (``FuseOps`` + ``FuseTIR``): remaining operators that were not
    matched by backend patterns are fused automatically based on their pattern kinds.
diff --git a/docs/arch/index.rst b/docs/arch/index.rst
@@ -68,7 +68,7 @@ contains a collection of functions. Currently, we support two primary variants o
   threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.
 
 During the compilation and transformation, all relax operators are lowered to ``tirx::PrimFunc`` or ``TVM PackedFunc``, which can be executed directly
-on the target device, while the calls to relax operators are lowered to calls to low-level functions (e.g. ``R.call_tir`` or ``R.call_dps``).
+on the target device, while the calls to relax operators are lowered to calls to low-level functions (e.g. ``R.call_tir`` or ``R.call_dps_packed``).
 
 Transformations
 ~~~~~~~~~~~~~~~
@@ -160,22 +160,19 @@ following types: POD types(int, float), string, runtime.PackedFunc, runtime.Modu
 
 :py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.
 
-The above example only deals with a simple `addone` function. The code snippet below gives an example of an end-to-end model execution using the same interface:
+The above example only deals with a simple `addone` function. The code snippet below gives an example of an end-to-end model execution using the Relax Virtual Machine, which is built on the same runtime.Module and runtime.PackedFunc interface:
 
 .. code-block:: python
 
    import tvm
-   # Example runtime execution program in python, with types annotated
-   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
-   # Create a stateful graph execution module for resnet18 on cuda(0)
-   gmod: tvm.runtime.Module = factory["resnet18"](tvm.cuda(0))
+   from tvm import relax
+   # Load the compiled artifact
+   mod: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a VM instance on cuda(0)
+   vm = relax.VirtualMachine(mod, tvm.cuda(0))
    data: tvm.runtime.Tensor = get_input_data()
-   # set input
-   gmod["set_input"](0, data)
-   # execute the model
-   gmod["run"]()
-   # get the output
-   result = gmod["get_output"](0).numpy()
+   # Run the model — vm["main"] returns a PackedFunc
+   result = vm["main"](data).numpy()
 
 The main take away is that runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs (such as addone), as well as the end-to-end models.
 
@@ -236,10 +233,9 @@ for learning-based optimizations.
    :maxdepth: 1
 
    introduction_to_module_serialization
-   device_target_interactions
 
 Relax Virtual Machine
-^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~
 
 Relax defines *what* to compute — it is a graph-level IR that describes the operators and dataflow
 of a model. The Relax Virtual Machine (VM) handles *how* to run it — it is the runtime component
@@ -257,7 +253,7 @@ pipeline, instruction set details, execution model, and Python interface.
    relax_vm
 
 Disco: Distributed Runtime
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Disco is TVM's distributed runtime for executing models across multiple devices. When a model is
 too large to fit on a single GPU, the ``relax.distributed`` module annotates how tensors should be
@@ -416,18 +412,3 @@ and then integrate it into the IRModule.
 While possible to construct operators directly via TensorIR or tensor expressions (TE) for each use case, it is tedious to do so.
 `topi` (Tensor operator inventory) provides a set of pre-defined operators defined by numpy and found in common deep learning workloads.
 
-tvm/s_tir/meta_schedule
------------------------
-
-MetaSchedule is a system for automated search-based program optimization,
-and can be used to optimize TensorIR schedules. Note that MetaSchedule only works with static-shape workloads.
-
-tvm/s_tir/dlight
-----------------
-
-DLight is a set of pre-defined, easy-to-use, and performant s_tir schedules. DLight aims:
-
-- Fully support **dynamic shape workloads**.
-- **Light weight**. DLight schedules provides tuning-free schedule with reasonable performance.
-- **Robust**. DLight schedules are designed to be robust and general-purpose for a single rule. And if the rule is not applicable,
-  DLight not raise any error and switch to the next rule automatically.
diff --git a/docs/arch/pass_infra.rst b/docs/arch/pass_infra.rst
@@ -617,7 +617,7 @@ Note that it is recommended to use the ``pass_instrument`` decorator to implemen
 ``PassInstrument`` instances can be registered through ``instruments`` argument in
 :py:class:`tvm.transform.PassContext`.
 
-`use pass instrument`_ tutorial provides examples for how to implement ``PassInstrument`` with Python APIs.
+See `python/tvm/ir/instrument.py`_ for examples of how to implement ``PassInstrument`` with Python APIs.
 
 .. _pass_instrument_overriden:
 
@@ -668,4 +668,3 @@ new ``PassInstrument`` are called.
 
 .. _use pass infra: https://github.com/apache/tvm/blob/main/docs/how_to/tutorials/customize_opt.py
 
-.. _use pass instrument: https://github.com/apache/tvm/blob/main/docs/how_to/dev/index.rst
diff --git a/docs/arch/relax_vm.rst b/docs/arch/relax_vm.rst
@@ -20,11 +20,9 @@
 Relax Virtual Machine
 =====================
 
-Relax defines *what* to compute — it is a graph-level IR that describes the operators and dataflow
-of a model. The Relax Virtual Machine (VM) handles *how* to run it — it is the runtime component
-that executes the compiled result. This document explains the VM architecture in detail, covering
-the compilation pipeline from Relax IR to bytecode, the instruction set, the execution model, and
-the Python-level user interface.
+This document explains the Relax VM architecture in detail, covering the compilation pipeline
+from Relax IR to bytecode, the instruction set, the execution model, and the Python-level user
+interface.
 
 Overview
 --------
diff --git a/docs/deep_dive/relax/learning.rst b/docs/deep_dive/relax/learning.rst
@@ -32,7 +32,7 @@ In this chapter, we will use the following model as an example. This is
 a two-layer neural network that consists of two linear operations with
 relu activation.
 
-.. image:: https://mlc.ai/_images/e2e_fashionmnist_mlp_model.png
+.. image:: /_static/img/e2e_fashionmnist_mlp_model.png
    :width: 85%
    :align: center