How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tinymoe2m",
	filename="",
)
output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

TinyStories Mixtral 2M Top-2 MoE (tinymoe2m) GGUF & HF Validation Suite

This repository provides an ultra-lightweight Mixtral model variant (a Mixture-of-Experts architecture utilizing the Llama 2 compute topology) scaled down to a 1.95M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the TinyStories dataset and optimized as a precise validation asset.

It is designed specifically for debugging custom inference engines, and native tensor compilers against MoE-specific runtime features. These include Gating network weight allocation, token distribution/gathering (Scatter/Gather loops), and the weighted addition combining multiple independent expert outputs.


πŸ“Š Comparison: tinymoe2m vs Other 1M Variants

To help track feature coverage across the 1M/2M verification suite, the core structural layouts are outlined below:

Feature / Metric tiny1m (Standard) tinybpe1m (BPE Variant) tinygemma1m (Gemma 2 Variant) tinymoe2m (This Repository)
Base Architecture Llama 2 Llama 2 Gemma 2 Llama 2 (Mixtral Format)
FFN Structure Single FFN (Dense) Single FFN (Dense) Single FFN (Dense) Mixture-of-Experts (MoE)
Attention Mechanism MHA (Multi-Head) MHA (Multi-Head) GQA (Grouped-Query) MHA (Multi-Head)
Total Experts 1 (Non-MoE) 1 (Non-MoE) 1 (Non-MoE) 4 Experts
Selected Experts - - - Top-2 Experts
Expert FFN Dim (intermediate_size) 564 352 352 352 (Shared across all experts)
Total Parameters ~1.2M ~1.0M ~1.0M ~1.95M (1.95M Total)
Active Parameters ~1.2M ~1.0M ~1.0M ~1.14M (1.14M Active)
Primary Debug Target Core matrix mult & layout byte_fallback decode Gemma 2 advanced graph Dynamic Routing & Scatter/Gather

πŸ’‘ Compute Cost vs Capacity Optimization

With a total parameter count of approximately 1.95M, this model retains roughly twice the absolute capacity of standard 1M dense variants, allowing it to maintain a stable command of grammar rules and coherent phrasings from the TinyStories corpus. Crucially, because only the top-2 experts fire per token, the active parameter execution count is capped at ~1.14M. This layout perfectly replicates the fundamental benefit of MoE architectures: expanding a model's total internal capacity by 2x while restricting the added floating-point operation (FLOPs) overhead to just a 1.1x–1.2x increase compared to a 1M dense counterpart.


πŸ“‚ Repository Structure & File Descriptions

1. GGUF Formats (Root Directory ./)

Binary files optimized for execution via llama.cpp or compatible lower-level inference engines. Upstream parsers will automatically recognize this architecture under the mixed (Mixtral) type descriptor.

Filename Type Size Target / Validation Focus
tinymoe2m.F32.gguf F32 ~8.0 MB Baseline Test. Eliminates quantization noise to isolate and verify the raw probability mathematics of the Gating network and expert tensor synthesis.
tinymoe2m.F16.gguf
tinymoe2m.BF16.gguf
F16
BF16
~4.0 MB Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and stability under parallelized accumulation layers.
tinymoe2m.Q8_0.gguf Q8_0 ~2.2 MB Standard Quantization. Verifies block-based uniform scaling (32-element blocks) across decentralized MoE structures.
tinymoe2m.Q4_0.gguf
tinymoe2m.Q4_1.gguf
Q4_0
Q4_1
~1.4 MB Classic Quantization. Tests 4-bit linear scaling and unpacking logic across multiple discontinuous expert weight matrices.
tinymoe2m.Q2_K.gguf Q2_K ~1.1 MB Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines.
tinymoe2m.Q3_K_M.gguf Q3_K_M ~1.2 MB Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors.
tinymoe2m.Q4_K_M.gguf Q4_K_M ~1.4 MB Standard K-Quant (4-bit). The baseline testing target for modern 4-bit super-block logic coupled with MoE paths.
tinymoe2m.Q5_K_M.gguf Q5_K_M ~1.5 MB Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts.
tinymoe2m.Q6_K.gguf Q6_K ~1.7 MB Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization.

2. Hugging Face Native Format (./hf/)

Unquantized components formatted for direct instantiation inside the PyTorch transformers library ecosystem:

  • hf/model.safetensors: Raw unquantized matrix parameters containing all 4 expert sub-networks alongside the master router tensor.
  • hf/config.json: Architectural specifications built around MixtralConfig criteria (layer depth, head maps, absolute expert counts, and top-k selection targets).
  • hf/generation_config.json: Standard generation defaults.
  • hf/tokenizer.model: The custom 512-vocabulary size SentencePiece BPE master binary.
  • hf/tokenizer_config.json: Metadata linking LlamaTokenizer classes to guarantee correct handling of prefix spacing and manage automatic <s> (BOS) injection properly on the Hugging Face backend.
  • hf/special_tokens_map.json: Structural map linking token strings (<s>=1, </s>=2) back to internal index bounds.

πŸš€ Usage Examples

A. Running GGUF via llama.cpp

To process the MoE execution graph and evaluate dynamic expert routing directly on your shell:

./llama-cli -m tinymoe2m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

B. Loading Hugging Face Formats via Python

Because the configuration parameters are seamlessly matched with the custom vocabulary schema, you can invoke the classes using standard automated loaders without building proprietary wrapper systems.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/tinymoe2m"

print("Loading MoE configuration and tokenizer layers...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Tom and Jerry are "
inputs = tokenizer(prompt, return_tensors="pt").to(device)

print("Running inference loop (Validating Top-2 sparse routing matrices)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_length=64, 
        do_sample=False
    )
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

πŸ“ Model Specifications

  • Architecture: Mixtral (MixtralForCausalLM)
  • Dataset: TinyStories
  • Total Parameters (num_local_experts = 4): ~1.95M
  • Active Parameters (num_experts_per_tok = 2): ~1.14M
  • Vocabulary Size (vocab_size): 512 (Custom SentencePiece BPE with byte_fallback enabled)
  • Hidden Size (hidden_size): 128
  • Number of Hidden Layers (num_hidden_layers): 3
  • Number of Attention Heads (num_heads / num_kv_heads): 2 / 2 (MHA layout)
  • Individual Expert Internal Dimension (intermediate_size): 352 (SwiGLU structure)
  • Max Position Embeddings (max_position_embeddings): 256

πŸ“œ License

  • License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.
Downloads last month
85
GGUF
Model size
1.95M params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shibatch/tinymoe2m

Quantized
(42)
this model