AI & ML interests

Deep experimentation on ablation and quanting. Goal: trim the fat on models. Question: Is all data, good data, or, is some just noise?

Recent Activity

deucebucket  updated a collection 6 days ago
Gemma 4 - Cerebellum
deucebucket  updated a collection 6 days ago
North-Mini-Code - Cerebellum
deucebucket  updated a collection 6 days ago
Fits 16 GB cards
View all activity

Organization Card

Cerebellum

The goal: smaller without being dumber, for people running 8 to 24 GB cards.

Most quantization picks one recipe and applies it to every tensor in the model. Cerebellum doesn't guess. Each tensor gets crushed to 2-bit on its own, the damage gets measured, and then a fixed size budget gets spent where the measurements say it matters. Fragile tensors keep their precision. Tensors that don't care get cut hard.

The name is the method: dissect the brain, cut one of the five senses so the other four come back sharper.

Every release is a normal GGUF that runs on stock llama.cpp. No fork, no custom runtime, no special loader. Everything here is built and audited on a single RTX 3090.

A head-to-head

Qwen3.6-35B-A3B, same weights, same harness, only the bit allocation differs:

Build Size PPL MMLU-Redux HumanEval+ ARC-C HellaSwag
Uniform Q3_K_M 16.87 GB 7.2204 74.88 57.93 95.56 91.92
Cerebellum 11.96 GB 7.1573 75.42 64.63 95.48 91.78

4.9 GB smaller. PPL, MMLU, and HumanEval+ favor the Cerebellum build; ARC and HellaSwag are inside noise. Per-question results and the audit trail ship with every model card.

What the ablations keep showing

Sensitivity is architectural, and intuition from one model family is actively wrong in the next.

  • Hybrid SSM layers fail hard below 4-bit. NaN, not gradual degradation.
  • In MoE models the expert weights are the fragile part, not the routers. The opposite of dense intuition.
  • Gemma's per-layer embeddings collapse one quant level before anything else. Protecting them took one build from perplexity 104 to 55.
  • In some dense models, lowering attention precision improves perplexity.

None of this is guessed. It is measured per tensor, per model, and the recipe follows the data.

Failures, kept on the record

  • A perplexity-only hill-climber improved wiki PPL 35% while HumanEval+ fell 14 points. Deprecated. PPL alone is not a quality gate, and every decision is now gated on task benchmarks too.
  • A finished Gemma E2B abliterated build was killed before release: the damage traced to the source weights, not the quantization. The finding is written up instead of the model being shipped.
  • Early published HumanEval rows lost 6 to 8 points to harness bugs, not model quality. Those scores were corrected, the bugs documented, and every run since gets its wrong answers audited by hand before a score is published.

Publishing rules

  • Every build is benchmarked against a same-size uniform-quant baseline: perplexity, MMLU, ARC, HellaSwag, HumanEval+.
  • Wrong answers get pulled and audited before a score goes in a model card. Benchmark scripts lie more often than models do.
  • A build that can't beat its baseline doesn't ship.
  • Every model card links its full benchmark evidence, down to per-question results.

Around here

Stock and Heretic (abliterated) variants live in separate collections below. The Heretic builds reuse the exact recipe proven on the stock weights, transferred verbatim, then re-benchmarked from scratch.

Tooling and method docs: github.com/deucebucket/cerebellum

models 0

None public yet

datasets 0

None public yet