TripoSplat β Core AI (zoo's first 3D)
VAST-AI/TripoSplat β single image β 3D Gaussian
splats (.ply/.splat), MIT. The zoo's first 3D model: outputs drop straight into a Gaussian-splat
viewer (e.g. Apple RealityKit on visionOS, or MetalSplatter on iOS/macOS).
Pure-PyTorch pipeline (no diffusers/CUDA kernels): bg-removal β DINOv3 ViT-H encode + Flux2-VAE encode β 20-step flow-matching DiT denoiser β octree probability sampler β Gaussian decoder β splats.
This repo holds the Core AI .aimodel bundles (each is a directory). Conversion + runner scripts
live in the coreai-models-community zoo (conversion/triposplat/).
What runs on Core AI
5 neural nets converted (each gated converted-vs-eager cos = 1.000000):
| net | shape | bundle | dtype |
|---|---|---|---|
| DINOv3 ViT-H encoder | (1,3,1024,1024)β(1,4101,1280) | dinov3_fp16.aimodel |
fp16 |
| Flux2-VAE encoder | (1,3,1024,1024)β(1,4096,128) | vae_fp16.aimodel |
fp16 |
| DiT denoiser (one step) | latent(1,8192,16)+cam(1,1,5)+t+feat1(1,4101,1280)+feat2(1,4101,128)βlatent,cam | dit_fp16.aimodel |
fp16 |
| Octree probability decoder | x(1,8192,3)+l(1,)+cond(1,8192,16)βlogits(1,8192,8) | octree_fp32.aimodel |
fp32 |
| Decode (gs + build_gaussians + .ply activations, baked) | points(1,8192,3)+cond(1,8192,16)β(262144,14) | decode_fp32.aimodel |
fp32 |
The flow-matching sampler (FlowEulerCfgSampler) and the octree sample_probs systematic resampling
stay host-side (data-dependent control flow). Scripts: _conv_*.py convert+gate each net;
_conv_fp16.py makes the half-size fp16 bundles; _conv_decode.py bakes build_gaussians + the
Gaussian .ply-activation math into one net so the runner just writes raw floats.
model.py patches (the reusable contribution β see the zoo's conversion guide)
coreai-torch 0.4.0 needed six edits to VAST's model.py; all are general gotchas:
- float-arg
aten.arangeβbad_optional_accessC++ abort. Use int-arg arange (DINOv3 RoPE). - fx
got multiple values for 'mod'β submodule called withmod=kwarg. Pass positionally. - No complex ops β rewrote the DiT's complex RoPE (
torch.polar/view_as_complex) as real cos/sin math (apply_rotary_emb,RePo3DRotaryEmbedding.forward). - Constant-folded
sin/cosof huge args is low-precision (cosβ0.5) β the DiT positional embed computed from the fixed Sobol constant was folded wrong; precompute it into aregister_buffer. F.normalizedrops the eps clamp β near-zero vectors blow up ~1e13; rewroteMultiHeadRMSNormas explicitx*rsqrt(mean(xΒ²)+eps). (Emergent only at large seq len β gate by VISUAL/true-scale.)prog.optimize()hangs on the 24-block/12k-token DiT graph (>90 min) β skip it (convert(optimize=False)), AOTcoreai-buildoptimizes for the device anyway.
Plus: int8 desaturates this model (per-net cos 0.9998 but colors collapse β use fp16, which is
GPU-identical to fp32 β gate fp16 on GPU/visual, its CPU cos looks bad but that's a CPU-compute
artifact). Octree decoder: int64 l (resolution) input β CoreAIError 3 at runtime, pass it as float32.
Running it
- Mac:
_run_coreai.py(orapp_backend.py --input <img>) loads the bundles via coreai.runtime (SpecializationOptions.default()= GPU; ~2 min/gen at 20 steps on Apple silicon, full quality). End-to-end latent gate vs torch-DiT: cos 0.999999. - Mac app / iPhone client:
TripoSplatMac(standalone) andTripoSplatPhone(capture on iPhone β Mac serverserver.pyβ view splats in MetalSplatter / RealityKit).
On-device note
Full on-device (iPhone) was verified infeasible with this model: DINOv3 ViT-H AOT .aimodelc
is ~3.1 GB and the DiT's 12294-token full-attention score matrix alone is ~4.8 GB, both over the
~3.3 GB iOS app memory budget (weight precision doesn't fix the attention working set). Needs
flash-attention conversion / weight streaming. The Mac-link client is the shipped path.
Model tree for mlboydaisuke/TripoSplat-CoreAI
Base model
VAST-AI/TripoSplat