Skip to content

Add RF-DETR implementation#350

Merged
markaren merged 4 commits into
devfrom
rf_detr
May 28, 2026
Merged

Add RF-DETR implementation#350
markaren merged 4 commits into
devfrom
rf_detr

Conversation

@markaren
Copy link
Copy Markdown
Owner

@markaren markaren commented May 27, 2026

RF-DETR joins RT-DETR and YOLO as a state-of-the-art deep vision model implementation.

Why, you ask? These are all examples showcasing how threepp's Vulkan backend isn't just for drawing triangles — the same GPU device that renders your scene can run a modern neural network end-to-end, with nothing but hand-written GLSL compute shaders. No PyTorch. No CUDA. No ONNX Runtime. No TensorRT. No 2 GB of Python dependencies. Just Vulkan — the same one threepp already uses to render.

That's the whole point of these examples:

One device, one process. Perception and rendering live on the same Vulkan device and share the same memory. You can render a scene and detect objects in it without ever leaving the GPU or shelling out to a separate inference framework.
Runs anywhere Vulkan runs. NVIDIA, AMD, Intel, mobile Mali/Adreno — no vendor lock-in. (TensorRT is NVIDIA-only; this isn't.)
It's the real model, not a toy. Each one is the actual pretrained network, reimplemented op-by-op as compute shaders.

What's in here

Three object detectors spanning both major families:

  • YOLOv8n — the classic real-time CNN detector.
  • RT-DETR — a transformer-based real-time detector (HGNetv2 backbone + AIFI encoder + deformable decoder).
  • RF-DETR-Nano (new) — Roboflow's recent DINOv2-windowed-ViT detector (windowed attention backbone + C2f projector + two-stage deformable decoder).
    Each ships as a threepp example: load weights, run on an image, and visualize the detections through the renderer's ortho overlay.

Is it correct?

Yes, and provably so. Each port is validated layer-by-layer against the reference PyTorch model — per-layer activations are captured from PyTorch and diffed element-wise (the backbones match to ~1e-5). On real images the detections match the reference model (e.g. bus.jpg → bus + 4 people, same boxes). There's a --validate mode in each example so you can check it yourself.

Is it fast?

Competitive. On an RTX 4070, end-to-end (preprocess → forward → postprocess), benchmarked honestly against optimized PyTorch (TorchScript, lean pre/post):

YOLOv8n: ~127 FPS — on par with optimized PyTorch.
RF-DETR-Nano: ~42 FPS — within ~10% of optimized PyTorch.
To be clear about what this is and isn't: hand-written shaders don't beat vendor-tuned cuBLAS/cuDNN on raw compute, and that's fine — the win here is deployment: framework-free, portable, and co-resident with the renderer. PyTorch baselines are included (scripts/bench_*) so the comparison is reproducible.

image

@markaren markaren merged commit 9d4f8c6 into dev May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant