You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -229,6 +229,13 @@ And for picking what to compress:
229
229
|------|-------------|
230
230
| Sensitivity analysis |`cx.analyze_sensitivity()` probes each Conv2d/Linear layer with a prune or noise perturbation, re-runs your `eval_fn`, and ranks layers by metric drop. Can also suggest `exclude_layers` above a chosen threshold. |
231
231
232
+
## Examples
233
+
234
+
Check out the example notebooks in [`examples/`](./examples/):
235
+
236
+
-[ResNet18 edge deployment](./examples/resnet18_edge_deploy.ipynb): profile, fuse, prune, quantize, benchmark, and export a ResNet18 to ONNX.
237
+
-[BERT-tiny quantization](./examples/bert_tiny_quantize.ipynb): low-rank decomposition + INT4 weight quantization on a small transformer, with latency benchmarks.
"print(f\"Model: {sum(p.numel() for p in model.parameters()):,} parameters\")"
63
+
]
64
+
},
65
+
{
66
+
"cell_type": "markdown",
67
+
"metadata": {},
68
+
"source": [
69
+
"## 2. Profile the model\n",
70
+
"\n",
71
+
"We pass token IDs as input, but the profiler needs a float tensor. We'll profile using the embedding output shape and note that the embedding table is counted in params."
72
+
]
73
+
},
74
+
{
75
+
"cell_type": "code",
76
+
"execution_count": null,
77
+
"metadata": {},
78
+
"outputs": [],
79
+
"source": [
80
+
"# For profiling, we use a float input that skips the embedding.\n",
81
+
"# The full model takes integer token IDs, so we profile the encoder+classifier separately.\n",
82
+
"class EncoderClassifier(nn.Module):\n",
83
+
"\"\"\"Wraps encoder + classifier with float input for profiling.\"\"\"\n",
"The feedforward layers inside the transformer encoder are 128x512 and 512x128. SVD can factorize these into pairs of smaller layers. We keep 50% of the singular values (by energy)."
"If weight-only quantization isn't giving you enough speedup, dynamic PTQ quantizes both weights and activations to INT8 at runtime. It's a simpler path that works well for inference on CPU."
0 commit comments