Add RF-DETR implementation by markaren · Pull Request #350 · markaren/threepp

markaren · 2026-05-27T22:43:50Z

RF-DETR joins RT-DETR and YOLO as a state-of-the-art deep vision model implementation.

Why, you ask? These are all examples showcasing how threepp's Vulkan backend isn't just for drawing triangles — the same GPU device that renders your scene can run a modern neural network end-to-end, with nothing but hand-written GLSL compute shaders. No PyTorch. No CUDA. No ONNX Runtime. No TensorRT. No 2 GB of Python dependencies. Just Vulkan — the same one threepp already uses to render.

That's the whole point of these examples:

One device, one process. Perception and rendering live on the same Vulkan device and share the same memory. You can render a scene and detect objects in it without ever leaving the GPU or shelling out to a separate inference framework.
Runs anywhere Vulkan runs. NVIDIA, AMD, Intel, mobile Mali/Adreno — no vendor lock-in. (TensorRT is NVIDIA-only; this isn't.)
It's the real model, not a toy. Each one is the actual pretrained network, reimplemented op-by-op as compute shaders.

What's in here

Three object detectors spanning both major families:

YOLOv8n — the classic real-time CNN detector.
RT-DETR — a transformer-based real-time detector (HGNetv2 backbone + AIFI encoder + deformable decoder).
RF-DETR-Nano (new) — Roboflow's recent DINOv2-windowed-ViT detector (windowed attention backbone + C2f projector + two-stage deformable decoder).
Each ships as a threepp example: load weights, run on an image, and visualize the detections through the renderer's ortho overlay.

Is it correct?

Yes, and provably so. Each port is validated layer-by-layer against the reference PyTorch model — per-layer activations are captured from PyTorch and diffed element-wise (the backbones match to ~1e-5). On real images the detections match the reference model (e.g. bus.jpg → bus + 4 people, same boxes). There's a --validate mode in each example so you can check it yourself.

Is it fast?

Competitive. On an RTX 4070, end-to-end (preprocess → forward → postprocess), benchmarked honestly against optimized PyTorch (TorchScript, lean pre/post):

YOLOv8n: ~127 FPS — on par with optimized PyTorch.
RF-DETR-Nano: ~42 FPS — within ~10% of optimized PyTorch.
To be clear about what this is and isn't: hand-written shaders don't beat vendor-tuned cuBLAS/cuDNN on raw compute, and that's fine — the win here is deployment: framework-free, portable, and co-resident with the renderer. PyTorch baselines are included (scripts/bench_*) so the comparison is reproducible.

markaren added 4 commits May 27, 2026 21:40

rf_detr wip [skip ci]

8d2e086

rf_detr working [skip ci]

6de54ba

rf_detr optimization [skip ci]

c5e7591

rf_detr optimization [skip ci]

6d08d99

markaren merged commit 9d4f8c6 into dev May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RF-DETR implementation#350

Add RF-DETR implementation#350
markaren merged 4 commits into
devfrom
rf_detr

markaren commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

markaren commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in here

Is it correct?

Is it fast?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

markaren commented May 27, 2026 •

edited

Loading