Merge pull request #4 from cachevector/example-notebooks

maskedsyntax · web-flow · commit fb6db874dd70 · 2026-04-10T17:23:14.000+05:30
Add example notebooks
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,6 +13,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   p50/p90/p99, min/max, and throughput. `cx.compare_benchmarks()` returns a
   before/after comparison with speedup and latency/throughput deltas. Quantized
   models are automatically run on CPU. New `comprexx bench` CLI command.
+- Example notebooks: ResNet18 edge deployment (prune + quantize + ONNX export)
+  and BERT-tiny quantization (low-rank decomposition + INT4 weight quant).
 - GitHub Actions CI workflow running `pytest` on Python 3.10, 3.11, 3.12 plus a
   `ruff check` lint job.
 - `CHANGELOG.md` with history for v0.1.0 and v0.2.0.
diff --git a/README.md b/README.md
@@ -229,6 +229,13 @@ And for picking what to compress:
 |------|-------------|
 | Sensitivity analysis | `cx.analyze_sensitivity()` probes each Conv2d/Linear layer with a prune or noise perturbation, re-runs your `eval_fn`, and ranks layers by metric drop. Can also suggest `exclude_layers` above a chosen threshold. |
 
+## Examples
+
+Check out the example notebooks in [`examples/`](./examples/):
+
+- [ResNet18 edge deployment](./examples/resnet18_edge_deploy.ipynb): profile, fuse, prune, quantize, benchmark, and export a ResNet18 to ONNX.
+- [BERT-tiny quantization](./examples/bert_tiny_quantize.ipynb): low-rank decomposition + INT4 weight quantization on a small transformer, with latency benchmarks.
+
 ## License
 
 Apache 2.0
diff --git a/examples/bert_tiny_quantize.ipynb b/examples/bert_tiny_quantize.ipynb
@@ -0,0 +1,233 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# BERT-tiny Quantization with Comprexx\n",
+    "\n",
+    "This notebook shows how to compress a small transformer model:\n",
+    "\n",
+    "1. Profile the model\n",
+    "2. Apply low-rank decomposition to shrink Linear layers\n",
+    "3. Apply weight-only INT4 quantization\n",
+    "4. Benchmark before/after\n",
+    "\n",
+    "We use a minimal 2-layer transformer so this runs in seconds on CPU.\n",
+    "\n",
+    "Install: `pip install comprexx`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import torch.nn as nn\n\nimport comprexx as cx"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Define a small transformer\n",
+    "\n",
+    "2-layer encoder, d_model=128, 4 heads, feedforward dim=512. Small enough for a notebook, large enough to show compression working."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class TinyBERT(nn.Module):\n",
+    "    def __init__(self, vocab_size=1000, d_model=128, nhead=4, num_layers=2, num_classes=4):\n",
+    "        super().__init__()\n",
+    "        self.embedding = nn.Embedding(vocab_size, d_model)\n",
+    "        encoder_layer = nn.TransformerEncoderLayer(\n",
+    "            d_model=d_model, nhead=nhead, dim_feedforward=512, batch_first=True,\n",
+    "        )\n",
+    "        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)\n",
+    "        self.classifier = nn.Linear(d_model, num_classes)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        x = self.embedding(x)\n",
+    "        x = self.encoder(x)\n",
+    "        # Mean pooling over sequence dim\n",
+    "        x = x.mean(dim=1)\n",
+    "        return self.classifier(x)\n",
+    "\n",
+    "model = TinyBERT()\n",
+    "model.eval()\n",
+    "print(f\"Model: {sum(p.numel() for p in model.parameters()):,} parameters\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Profile the model\n",
+    "\n",
+    "We pass token IDs as input, but the profiler needs a float tensor. We'll profile using the embedding output shape and note that the embedding table is counted in params."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# For profiling, we use a float input that skips the embedding.\n",
+    "# The full model takes integer token IDs, so we profile the encoder+classifier separately.\n",
+    "class EncoderClassifier(nn.Module):\n",
+    "    \"\"\"Wraps encoder + classifier with float input for profiling.\"\"\"\n",
+    "    def __init__(self, encoder, classifier):\n",
+    "        super().__init__()\n",
+    "        self.encoder = encoder\n",
+    "        self.classifier = classifier\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        x = self.encoder(x)\n",
+    "        return self.classifier(x.mean(dim=1))\n",
+    "\n",
+    "profiling_model = EncoderClassifier(model.encoder, model.classifier)\n",
+    "profile = cx.analyze(profiling_model, input_shape=(1, 32, 128))  # (batch, seq_len, d_model)\n",
+    "print(profile.summary())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Low-rank decomposition\n",
+    "\n",
+    "The feedforward layers inside the transformer encoder are 128x512 and 512x128. SVD can factorize these into pairs of smaller layers. We keep 50% of the singular values (by energy)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from comprexx.stages.base import StageContext\n",
+    "\n",
+    "stage_lr = cx.stages.LowRankDecomposition(mode=\"energy\", energy_threshold=0.9)\n",
+    "ctx = StageContext(input_shape=(1, 32, 128), device=\"cpu\")\n",
+    "\n",
+    "model_lr, report_lr = stage_lr.apply(profiling_model, ctx)\n",
+    "print(report_lr.summary())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Weight-only INT4 quantization\n",
+    "\n",
+    "After SVD, we quantize remaining Linear weights to INT4 (group size 64, symmetric). Activations stay in float32."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "stage_wq = cx.stages.WeightOnlyQuant(bits=4, group_size=64, symmetric=True)\n",
+    "\n",
+    "model_quant, report_wq = stage_wq.apply(model_lr, ctx)\n",
+    "print(report_wq.summary())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Or use a Pipeline for the same thing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = cx.Pipeline([\n",
+    "    cx.stages.LowRankDecomposition(mode=\"energy\", energy_threshold=0.9),\n",
+    "    cx.stages.WeightOnlyQuant(bits=4, group_size=64),\n",
+    "])\n",
+    "\n",
+    "result = pipeline.run(profiling_model, input_shape=(1, 32, 128))\n",
+    "print(result.report.summary())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Benchmark latency"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cmp = cx.compare_benchmarks(\n",
+    "    profiling_model, result.model,\n",
+    "    input_shape=(1, 32, 128),\n",
+    "    warmup=10,\n",
+    "    iters=50,\n",
+    ")\n",
+    "print(cmp.summary())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Dynamic PTQ as an alternative\n",
+    "\n",
+    "If weight-only quantization isn't giving you enough speedup, dynamic PTQ quantizes both weights and activations to INT8 at runtime. It's a simpler path that works well for inference on CPU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline_ptq = cx.Pipeline([\n",
+    "    cx.stages.LowRankDecomposition(mode=\"energy\", energy_threshold=0.9),\n",
+    "    cx.stages.PTQDynamic(),\n",
+    "])\n",
+    "\n",
+    "result_ptq = pipeline_ptq.run(profiling_model, input_shape=(1, 32, 128))\n",
+    "print(result_ptq.report.summary())\n",
+    "\n",
+    "cmp_ptq = cx.compare_benchmarks(\n",
+    "    profiling_model, result_ptq.model,\n",
+    "    input_shape=(1, 32, 128),\n",
+    "    warmup=10,\n",
+    "    iters=50,\n",
+    ")\n",
+    "print(\"\\n\" + cmp_ptq.summary())"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/examples/resnet18_edge_deploy.ipynb b/examples/resnet18_edge_deploy.ipynb