RF-DETR β Core AI (.aimodel)
RF-DETR (Roboflow's real-time detection transformer, COCO-pretrained) converted to Apple Core AI for iOS 27 / macOS 27 β the answer to apple/coreai-models#14. DETR family = no NMS: post-processing is one sigmoid + top-k.

Use it
βΆοΈ Run it (source) β the DetectCamera runner (real-time object detection on the zero-copy camera path):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/DetectCamera/DetectCamera.xcodeproj
# β Run, then pick "Nano" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/DetectCamera
swift run detect-cli --model rf-detr --image Resources/gate_image.jpg
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKitVision
let detector = try await ObjectDetector(catalog: "rf-detr")
let image = try ImageFile.load(imageURL) // any image file β CGImage + EXIF orientation
let detections = try await detector.detect(in: image.cgImage)
// detections: [Detection] β label, score, normalized box (top-left origin)
The take-home is Examples/DetectCamera/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI runs the same detector per camera frame on a zero-copy pixel-buffer fast path.
Real time? Use detect(in: CVPixelBuffer) β vImage scales the frame with no CGImage
round-trip; CameraFeed (kit API) streams the buffers.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKitVision - Info.plist:
NSCameraUsageDescriptionβ only for the live camera; the snippet needs none - Entitlements: none needed
- First run downloads the model β 0.1 GB (Mac) / 0.1 GB (iPhone) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
Files
| file | input | params | M4 Max GPU | iPhone 17 Pro GPU |
|---|---|---|---|---|
rfdetr-nano_float32.aimodel |
384Γ384 | 30.5M | 8.6 ms (~116 FPS) | ~25 ms (33β39 FPS live) |
rfdetr-small_float32.aimodel |
512Γ512 | 32.1M | 12.0 ms (~83 FPS) | β |
rfdetr-medium_float32.aimodel |
576Γ576 | 33.7M | 14.8 ms (~68 FPS) | 56β63 ms (15β17 FPS live) |
rfdetr-large_float32.aimodel |
704Γ704 | 33.9M | 19.1 ms (~52 FPS) | β |
iPhone numbers are end-to-end live-camera measurements from the CoreAIKit DetectCamera example (Release; zero-copy capture pipeline β AVCaptureVideoPreviewLayer display, hardware-scaled 32BGRA buffers, vImage preprocessing overlapped with GPU inference). Peak measured 39.6 FPS β the nano model ceiling; sustained max-load throughput drops on a hot chassis (thermal).
fp32 is the ship dtype: it gates detection-set exact vs the PyTorch fp32 reference on CPU and GPU (per confident detection: same class, IoU β₯ 0.999 measured, score within 2e-3), and fp16 only bought ~7% latency on M4 Max while adding near-tie ranking noise.
Graph contract
input "image" [1, 3, R, R] float32, RGB in [0, 1] (ImageNet mean/std folded in-graph)
output "dets" [1, 300, 4] boxes, cxcywh normalized to [0, 1]
output "labels" [1, 300, 91] raw class logits; column index = ORIGINAL COCO id (0 unused, 1=person β¦ 17=cat β¦ 90)
Python decode sketch (Swift is the same three steps):
import numpy as np, coreai.runtime as rt
model = await rt.AIModel.load(path, rt.SpecializationOptions.default())
fn = model.load_function("main")
out = await fn({"image": rt.NDArray(rgb01)}) # rgb01: [1,3,R,R] in [0,1]
prob = 1 / (1 + np.exp(-out["labels"].numpy()[0])) # [300, 91]
scores, classes = prob.max(-1), prob.argmax(-1) # column index IS the COCO id
boxes = out["dets"].numpy()[0] # cxcywh, multiply by image W/H
keep = scores > 0.5 # done β no NMS
RF-DETR-Seg (instance segmentation)
rfdetr-seg-{nano,small,medium,large,xlarge,2xlarge}_float32.aimodel β same
contract plus masks [1, Q, R/4, R/4]: per-query FULL-FRAME logit planes at
stride 4 (host: sigmoid > 0.5; no ROI plumbing, no NMS). All six gate on CPU
and GPU with binary-mask IoU 1.000 on stable scenes. M4 Max GPU:
seg-nano 312Β² 10.7 ms β seg-2xlarge 768Β² 59.1 ms.

Split deployment (split/)
split/rfdetr-{nano,medium}_{backbone,head}.aimodel separate the pure-ViT
backbone (image β features) from the deformable head (features β dets/labels;
position encodings baked in). The chain is bit-exact vs the monolith. Purpose:
per-stage compute-unit preferences β e.g. backbone on the Neural Engine.
Measured honestly: on iOS 27 beta the runtime still executes the backbone on
the GPU delegate even under .neuralEngine preference (identical detection
fingerprint, no ANE-compile pause), so today the monolith on GPU is the
fastest config; the split exists so ANE placement can be adopted the moment
the runtime honors it. Regenerate with export_rf_detr.py --variant <v> --split.
Conversion
Exported with
conversion/export_rf_detr.py
from rfdetr==1.7.1 weights. The port surfaced four Core AI converter/runtime bugs
(float-arg arange abort, int64-comparison buffer clobber, GPU-delegate
floor/trunc/ceil = identity, cast-pair cancellation) β each worked around numerically
identically; details and minimal repros in
zoo/rf-detr.md.
License: Apache-2.0 (upstream RF-DETR code and COCO-pretrained weights are Apache-2.0).
- Downloads last month
- 104