Instructions to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus", filename="Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus # Run inference directly in the terminal: llama cli -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus # Run inference directly in the terminal: llama cli -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus # Run inference directly in the terminal: ./llama-cli -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus # Run inference directly in the terminal: ./build/bin/llama-cli -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Use Docker
docker model run hf.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
- LM Studio
- Jan
- vLLM
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
- Ollama
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Ollama:
ollama run hf.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
- Unsloth Studio
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus to start chatting
- Pi
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Docker Model Runner:
docker model run hf.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
- Lemonade
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Run and chat with the model
lemonade run user.Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-{{QUANT_TAG}}List all available models
lemonade list
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
This is an extremely high quality FPX3 / ROCmFPX Q3 GGUF build of stepfun-ai/Step-3.7-Flash, tuned for AMD Strix Halo local serving with Step MTP.
The goal is simple: keep Step 3.7 Flash useful at 256K context, keep the quality as high as possible, and keep it as small as possible. This release is a true tight Q3-weight build: 3.57 BPW, 81.77 GiB of language-model shards, and strong agent/tool behavior in local evals.
Use this if you want the Step 3.7 behavior profile, MTP support, and a much smaller local footprint than the stock GGUF Q3_K_L or ROCmFP4 STRIX_LEAN builds.
Required runtime: these GGUFs do not run on stock upstream llama.cpp. They use ROCmFPX tensor types such as
q3_0_rocmfpxplus Chadrock/ROCmFPX serving support for Step MTP. Build the pinned Ciru ROCmFPX runner below before trying to load the model.
Why This One
Step 3.7 is huge. The practical local problem is not only speed; it is fitting enough context, KV, and agent workload into memory.
This FPX3/Q3 QualityPlus recipe was built for that constraint:
3.57 BPWeffective language-model size81.77 GiBtotal language GGUF shards16.31%smaller than the local ROCmFP4 STRIX_LEAN build14.35%smaller than StepFun's originalQ3_K_LGGUF split- up to 256K one-slot serving profile with q8_0 target KV and q8_0 draft KV
- Step MTP Q8 draft support through
draft-mtp - downloadable fixed Step tool/chat template using native
tool_responseobservations and protocol-boundary escaping
In practice, the original StepFun Q3_K_L local split was not a compact 3-bit-feeling model: it measured about 95.46 GiB, or roughly 4.17 BPW by effective size. This QualityPlus build is the one I would publish/use as the FPX3 lane.
Size Comparison
Measured from local GGUF shards:
| Build | Effective BPW | Shard total | Difference vs this release |
|---|---|---|---|
| ROCmFPX Q3 QualityPlus | 3.57 BPW |
81.77 GiB |
baseline |
StepFun original Q3_K_L |
~4.17 BPW |
95.46 GiB |
+13.70 GiB larger |
| ROCmFP4 STRIX_LEAN | ~4.27 BPW |
97.70 GiB |
+15.93 GiB larger |
That size gap matters because Step 3.7 needs memory for long context, q8 KV, and MTP draft state. On the tested Strix Halo host, the Q3 QualityPlus 64K MTP profile used about 96.3 GiB peak pooled GPU memory during long tool/Hermes runs, leaving enough RAM headroom to run the evals cleanly.
Quality Highlights
This is not a throwaway low-bit build. The recipe protects the tensors that were most important for behavior while pushing the giant expert FFN tensors into q3_0_rocmfpx.
Local quality results on AMD Ryzen AI Max+ 395 / Strix Halo:
| Benchmark | Result | Notes |
|---|---|---|
| Tool-Eval full, 69 scenarios | 88/100, 122/138 raw points |
Same headline score as the recorded Step ROCmFP4 tool-eval row |
| HermesAgent-20, best Q3 run | 85/100 |
13.40 min, 35.31 tok/s decode, 96.37 GiB peak pooled GPU |
The best recorded Q3 HermesAgent-20 run was very close to the local BF16 Qwen3.6 27B MTP reference row:
| Model / row | HermesAgent-20 score | Wall time |
|---|---|---|
| BF16 Qwen3.6 27B MTP GGUF | 87/100 |
42.4 min |
| Step 3.7 ROCmFPX Q3 QualityPlus | 85/100 |
13.4 min |
That is within two points of the BF16 Qwen3.6 27B row on the local HermesAgent-20 suite, while running in a much more compact Step 3.7 Q3 package.
Exact Q3 QualityPlus tool-eval score summary: evals/tool-eval-q3-qualityplus.json. Public reference page for the Step 3.7 tool-calling work: StepFun Step 3.7 Tool Eval on llm.ciru.ai. The Q3 QualityPlus full run used the same 69-scenario tool-eval harness and scored 88/100 locally.
Speed
Q3 QualityPlus speed was effectively tied with the local ROCmFP4 Step build while using much less disk space.
Short-context MTP speed, Vulkan0, q8_0/q8_0 target KV, q8_0/q8_0 draft KV, one slot, n_max=2, p_min=0.75, b8192/u2048, 128 generated tokens:
| Prompt | PP tok/s | TG tok/s |
|---|---|---|
2k |
309.44 |
29.97 |
4k |
325.18 |
29.39 |
8k |
311.15 |
28.58 |
16k |
306.37 |
26.26 |
Compared with the local ROCmFP4 Step build:
| Prompt | Q3 QualityPlus TG | ROCmFP4 TG | Takeaway |
|---|---|---|---|
2k |
29.97 |
26.52 |
Q3 faster |
4k |
29.39 |
29.37 |
tied |
8k |
28.58 |
28.02 |
tied/slightly Q3 |
16k |
26.26 |
26.42 |
tied |
128K stress row:
| Context | PP tok/s | TG tok/s | Peak pooled GPU |
|---|---|---|---|
~130k prompt |
146.67 |
14.52 |
~95.36 GiB |
At 128K, MTP initialized but produced no accepted drafts in that particular row, so treat the 128K decode number as an effective no-draft long-context decode reference.
256K load proof:
| Context | Proof | Memory state |
|---|---|---|
262144 |
target + Q8 MTP draft loaded, one slot, draft-mtp, /v1/models reports n_ctx=262144 and n_ctx_train=262144 |
~99.04 GiB pooled GPU used, ~16 GiB system RAM available |
The 256K row is a load/allocation proof, not a 256K prompt prefill benchmark.
Files
Published shard names intentionally match the model name:
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00002-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00003-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00004-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00005-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00006-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00007-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00008-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00009-of-00009.gguf
The Step MTP draft model is not duplicated here. If you enable draft-mtp, you
must also download and pass the separate Q8 draft from
notSnix/Step-3.7-Flash-MTP-Draft-GGUF,
for example Step-3.7-Flash-MTP-Q8_0.gguf. The main Q3 target GGUF does not
contain the MTP draft layers.
This repo also includes the tested chat/tool template:
step37-native-tool-response-template.jinja
Download the target shards and template:
huggingface-cli download jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus \
--include "Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-*.gguf" \
--include "step37-native-tool-response-template.jinja" \
--local-dir /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Download the required Q8 MTP draft:
huggingface-cli download notSnix/Step-3.7-Flash-MTP-Draft-GGUF \
Step-3.7-Flash-MTP-Q8_0.gguf \
--local-dir /mnt/models/notSnix-Step-3.7-Flash-MTP-Draft-GGUF
Direct template URL:
https://huggingface.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/resolve/main/step37-native-tool-response-template.jinja
Direct Q8 draft URL:
https://huggingface.co/notSnix/Step-3.7-Flash-MTP-Draft-GGUF/resolve/main/Step-3.7-Flash-MTP-Q8_0.gguf
Required ROCmFPX Runner
This model is tied to the Charlie/Ciru ROCmFPX llama.cpp runner family. A stock llama-server will not understand the ROCmFPX tensor types in these shards and will not reproduce the MTP serving behavior used for the benchmark rows.
Use the pinned Ciru runner:
repo: https://github.com/ciru-ai/ROCmFPX
current recommended pin: 221402af8574faf652b101b6afe225a3f329561f
branch at time of pin: main
upstream lineage: charlie12345/ROCmFPX
The earlier Chadrock v2 speed-runner tag remains useful for historical comparison:
tag: chadrockv2-runner-20260622
commit: 7aa484a2f0a504dc612a3d74a068024f3e6d6353
The Q3 QualityPlus Step 3.7 rows on this card were validated with the Chadrock/ROCmFPX runner path on AMD Ryzen AI Max+ 395 / Strix Halo. For fresh installs, use the current Ciru pin above unless you are reproducing an older benchmark exactly.
Build the runner on a Linux system with a working ROCm/HIP toolchain, Vulkan development headers, CMake, and a C++ compiler. This is the pinned Strix Halo reference build used by Ciru; it is not a universal distro installer, so package names and ROCm paths may differ on Ubuntu, Arch, Fedora, NixOS, and other distros.
git clone https://github.com/ciru-ai/ROCmFPX.git
cd ROCmFPX
git checkout 221402af8574faf652b101b6afe225a3f329561f
env JOBS="$(nproc)" \
CMAKE_HIP_ARCHITECTURES=gfx1151 \
ROCMFPX_DECODE_TUNE=stable \
scripts/build-strix-rocmfp4-mtp.sh llama-server llama-bench
If your ROCm or rocWMMA headers live outside the script defaults, set the relevant environment variables before running the build, for example ROCM_WMMA_INCLUDE=/path/to/rocWMMA/library/include. If your GPU is not Strix Halo / gfx1151, change CMAKE_HIP_ARCHITECTURES for your target.
The script and build directory still use the historical rocmfp4 name, but this is the ROCmFPX/Chadrock runner. For this model, the required support is ROCmFPX Q3 tensor support, not a ROCmFP4-only runtime.
The server binary should be:
./build-strix-rocmfp4/bin/llama-server
Again, build-strix-rocmfp4 is the historical build-directory name used by the ROCmFPX runner script.
If the model load fails with an unknown GGUF tensor type, you are using the wrong runner.
Recommended Serving Profile
The locally tested long-context profile:
context: up to 262144
slots: 1
backend: Vulkan0 target + Vulkan0 draft
MTP: --spec-type draft-mtp
draft model: Step-3.7-Flash-MTP-Q8_0.gguf from notSnix/Step-3.7-Flash-MTP-Draft-GGUF
speculative.n_max: 2
speculative.n_min: 0
speculative.p_min: 0.75
speculative.p_split: 0.10
batch / ubatch: 8192 / 2048
target KV: q8_0 / q8_0
draft KV: q8_0 / q8_0
prompt cache: disabled for 256K fit runs
sampler: temperature 1.0, top_p 0.95, min_p 0.0, repeat_penalty 1.0
reasoning: on, DeepSeek format
chat template: Step native tool_response template with protocol-boundary escaping
Serving backend note: on the tested AMD Ryzen AI Max+ 395 / Strix Halo system, this Step 3.7 Q3 build worked best through the ROCmFPX/Chadrock runner serving on Vulkan0 for both target and draft. In the command below, ROCmFPX is the required tensor/runtime support; -dev Vulkan0 and --spec-draft-device Vulkan0 are the recommended serving backend.
For models.ini-style launchers, make sure the draft path is present. Setting
spec-type = draft-mtp without spec-draft-model makes the runner try to build
an MTP draft context from the main target GGUF, which fails because the target
does not contain MTP draft layers.
model = /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf
chat-template-file = /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/step37-native-tool-response-template.jinja
spec-type = draft-mtp
spec-draft-model = /mnt/models/notSnix-Step-3.7-Flash-MTP-Draft-GGUF/Step-3.7-Flash-MTP-Q8_0.gguf
spec-draft-device = Vulkan0
spec-draft-ngl = all
spec-draft-type-k = q8_0
spec-draft-type-v = q8_0
spec-draft-n-max = 2
spec-draft-n-min = 0
spec-draft-p-min = 0.75
spec-draft-p-split = 0.10
If you see context type MTP requested but model doesn't contain MTP layers,
the draft model is missing or the path is wrong.
Example shape:
./build-strix-rocmfp4/bin/llama-server \
-m Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf \
--alias step-3.7-flash-rocmfpx-q3-qualityplus \
--host 127.0.0.1 \
--port 8080 \
--jinja \
-c 262144 \
--reasoning on \
--reasoning-format deepseek \
--reasoning-budget -1 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-b 8192 \
-ub 2048 \
--parallel 1 \
--no-mmap \
--cache-ram 0 \
-ctk q8_0 \
-ctv q8_0 \
--spec-draft-model Step-3.7-Flash-MTP-Q8_0.gguf \
--spec-draft-device Vulkan0 \
--spec-type draft-mtp \
--spec-draft-ngl all \
--spec-draft-type-k q8_0 \
--spec-draft-type-v q8_0 \
--spec-draft-n-max 2 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.75 \
--spec-draft-p-split 0.10 \
--chat-template-file /path/to/step37-native-tool-response-template.jinja \
--metrics
Template Note
The best local Step setup uses the included step37-native-tool-response-template.jinja template. It renders tool outputs as tool_response turns and escapes protocol-boundary tokens inside tool output. This is a general protocol-adapter fix: tool/file/search results stay observations instead of being flattened into user text.
Download:
curl -L -o step37-native-tool-response-template.jinja \
https://huggingface.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/resolve/main/step37-native-tool-response-template.jinja
That matters for real agents because Step 3.7 can otherwise confuse tool output with conversation authority, especially in file/search-result injection cases.
Build Notes
These are model-build notes, not runner-build instructions. Build the pinned ROCmFPX runner in the section above before serving the GGUFs.
The QualityPlus policy used here:
- huge
ffn_*_expstensors:q3_0_rocmfpx - attention q/output protected at
q5_K - attention k/v protected at
q4_K - shared/dense FFN protected at
q5_K - output/token embeddings at
q4_0_rocmfp4_fast
Converter-reported size: 83726.08 MiB / 3.57 BPW, 9 shards.
Credits
- Base model:
stepfun-ai/Step-3.7-Flash - MTP draft GGUF source:
notSnix/Step-3.7-Flash-MTP-Draft-GGUF - ROCmFPX creator: Charlie,
charlie12345/@italianclownz,charlie12345/ROCmFPX - Pinned public runner fork and build recipe:
ciru-ai/ROCmFPX, current recommended pin221402af8574faf652b101b6afe225a3f329561f - Quantization, the ROCmFPX Step 3.7 Q3 QualityPlus recipe, Strix Halo profile, and local benchmark work: Crown / Ciru
Caveats
- This is a custom ROCmFPX GGUF release. It requires the compatible ROCmFPX/Chadrock llama.cpp runner; stock llama.cpp is not expected to load it.
- Quality numbers are local Strix Halo measurements and depend on runtime, chat template, KV type, and MTP settings.
- The model is strong but not perfect at autonomous email/message side effects; it can be cautious and ask for subject/body/recipient details instead of sending with inferred defaults.
- Downloads last month
- -
We're not able to determine the quantization variants.
Model tree for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
Base model
stepfun-ai/Step-3.7-Flash