Instructions to use shibatch/tinymoe2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shibatch/tinymoe2m with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("shibatch/tinymoe2m", device_map="auto")

llama-cpp-python

How to use shibatch/tinymoe2m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tinymoe2m",
	filename="tinymoe2m.BF16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use shibatch/tinymoe2m with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf shibatch/tinymoe2m:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf shibatch/tinymoe2m:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf shibatch/tinymoe2m:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf shibatch/tinymoe2m:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf shibatch/tinymoe2m:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf shibatch/tinymoe2m:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf shibatch/tinymoe2m:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf shibatch/tinymoe2m:Q4_K_M

Use Docker

docker model run hf.co/shibatch/tinymoe2m:Q4_K_M

LM Studio
Jan
Ollama
How to use shibatch/tinymoe2m with Ollama:
```
ollama run hf.co/shibatch/tinymoe2m:Q4_K_M
```

Unsloth Studio

How to use shibatch/tinymoe2m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinymoe2m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinymoe2m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for shibatch/tinymoe2m to start chatting

Atomic Chat new
Docker Model Runner
How to use shibatch/tinymoe2m with Docker Model Runner:
```
docker model run hf.co/shibatch/tinymoe2m:Q4_K_M
```

Lemonade

How to use shibatch/tinymoe2m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull shibatch/tinymoe2m:Q4_K_M

Run and chat with the model

lemonade run user.tinymoe2m-Q4_K_M

List all available models

lemonade list

TinyStories Mixtral 2M Top-2 MoE (tinymoe2m) GGUF & HF Validation Suite (4k Context)

This repository provides an ultra-lightweight Mixtral model variant (a Mixture-of-Experts architecture utilizing the Llama 2 compute topology) scaled down to a 1.95M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the TinyStories dataset and optimized as a precise validation asset.

This asset is calibrated to a 4,096 token context window (4k) with an adjusted RoPE base frequency (rope_theta) of 15,000.0 to maintain sharp localized attention coordinates.

It is designed specifically for debugging custom inference engines, and native tensor compilers against MoE-specific runtime features. These include Gating network weight allocation, token distribution/gathering (Scatter/Gather loops), and the weighted addition combining multiple independent expert outputs.

📊 Comparison: `tinymoe2m` vs Other 1M Variants

To help track feature coverage across the 1M/2M verification suite, the core structural layouts are outlined below:

Feature / Metric	`tiny1m` (Standard)	`tinybpe1m` (BPE Variant)	`tinygemma1m` (Gemma 2 Variant)	`tinymoe2m` (This Repository)
Base Architecture	Llama 2	Llama 2	Gemma 2	Llama 2 (Mixtral Format)
FFN Structure	Single FFN (Dense)	Single FFN (Dense)	Single FFN (Dense)	Mixture-of-Experts (MoE)
Attention Mechanism	MHA (Multi-Head)	MHA (Multi-Head)	GQA (Grouped-Query)	MHA (Multi-Head)
Total Experts	1 (Non-MoE)	1 (Non-MoE)	1 (Non-MoE)	4 Experts
Selected Experts	-	-	-	Top-2 Experts
Expert FFN Dim (`intermediate_size`)	564	352	352	352 (Shared across all experts)
Max Position Embeddings	-	-	-	4,096
RoPE Base (`rope_theta`)	-	-	-	15,000.0
Total Parameters	~1.2M	~1.0M	~1.0M	~1.95M (1.95M Total)
Active Parameters	~1.2M	~1.0M	~1.0M	~1.14M (1.14M Active)
Primary Debug Target	Core matrix mult & layout	`byte_fallback` decode	Gemma 2 advanced graph	Dynamic Routing & Scatter/Gather

💡 Compute Cost vs Capacity Optimization

With a total parameter count of approximately 1.95M, this model retains roughly twice the absolute capacity of standard 1M dense variants, allowing it to maintain a stable command of grammar rules and coherent phrasings from the TinyStories corpus. Crucially, because only the top-2 experts fire per token, the active parameter execution count is capped at ~1.14M. This layout perfectly replicates the fundamental benefit of MoE architectures: expanding a model's total internal capacity by 2x while restricting the added floating-point operation (FLOPs) overhead to just a 1.1x–1.2x increase compared to a 1M dense counterpart.

📂 Repository Structure & File Descriptions

1. GGUF Formats (Root Directory `./`)

Binary files optimized for execution via llama.cpp or compatible lower-level inference engines. Upstream parsers will automatically recognize this architecture under the mixed (Mixtral) type descriptor.

Filename	Type	Size	Target / Validation Focus
`tinymoe2m.F32.gguf`	`F32`	~8.0 MB	Baseline Test. Eliminates quantization noise to isolate and verify the raw probability mathematics of the Gating network and expert tensor synthesis.
`tinymoe2m.F16.gguf` `tinymoe2m.BF16.gguf`	`F16` `BF16`	~4.0 MB	Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and stability under parallelized accumulation layers.
`tinymoe2m.Q8_0.gguf`	`Q8_0`	~2.2 MB	Standard Quantization. Verifies block-based uniform scaling (32-element blocks) across decentralized MoE structures.
`tinymoe2m.Q4_0.gguf` `tinymoe2m.Q4_1.gguf`	`Q4_0` `Q4_1`	~1.4 MB	Classic Quantization. Tests 4-bit linear scaling and unpacking logic across multiple discontinuous expert weight matrices.
`tinymoe2m.Q2_K.gguf`	`Q2_K`	~1.1 MB	Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines.
`tinymoe2m.Q3_K_M.gguf`	`Q3_K_M`	~1.2 MB	Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors.
`tinymoe2m.Q4_K_M.gguf`	`Q4_K_M`	~1.4 MB	Standard K-Quant (4-bit). The baseline testing target for modern 4-bit super-block logic coupled with MoE paths.
`tinymoe2m.Q5_K_M.gguf`	`Q5_K_M`	~1.5 MB	Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts.
`tinymoe2m.Q6_K.gguf`	`Q6_K`	~1.7 MB	Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization.

2. Hugging Face Native Format (`./hf/`)

Unquantized components formatted for direct instantiation inside the PyTorch transformers library ecosystem:

hf/model.safetensors: Raw unquantized matrix parameters containing all 4 expert sub-networks alongside the master router tensor.
hf/config.json: Architectural specifications built around MixtralConfig criteria (layer depth, head maps, absolute expert counts, and top-k selection targets). Fully updated to enforce max_position_embeddings: 4096 and rope_theta: 15000.0.
hf/generation_config.json: Standard generation defaults.
hf/tokenizer.model: The custom 512-vocabulary size SentencePiece BPE master binary.
hf/tokenizer_config.json: Metadata linking LlamaTokenizer classes to guarantee correct handling of prefix spacing and manage automatic <s> (BOS) injection properly on the Hugging Face backend. Configured with model_max_length: 4096.
hf/special_tokens_map.json: Structural map linking token strings (<s>=1, </s>=2) back to internal index bounds.

🎯 Purpose & Design Philosophy (Verification Targets)

This checkpoint is specifically engineered as a deterministic validation test asset for computing platforms and is not designed for long-context semantic extraction tasks (such as Needle-in-a-Haystack password retrieval).

Due to the extreme capacity boundaries (~1.95M total parameters) and ultra-compact vocabulary layout (512 tokens), the internal network matrices allocate their expressiveness exclusively toward mastering English syntax and high-frequency phrases. It lacks the multi-layer, high-order dynamic copy induction circuits required to trace out-of-context injection strings or narrow characters across large windows.

Expected Token Output Behavior

When processed with template phrases containing temporary password identifiers like: "The magic password of the giant was key X. I remember that the magic password of the giant was"

The network will cleanly bypass copying the literal character X and instead continue generating standard learned unigram-biased blocks such as "about to go home. Every day...". This is mathematically expected behavior. Validation is achieved strictly via Bit-Exact Logit Verification across runtime backends to confirm matching compute kernels, KV cache memory indices, causal attention layers, and precise RoPE phase calculation.

🚀 Usage Examples

A. Running GGUF via llama.cpp

To process the MoE execution graph and evaluate dynamic expert routing directly on your shell:

./llama-cli -m tinymoe2m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

B. Loading Hugging Face Formats via Python

Because the configuration parameters are seamlessly matched with the custom vocabulary schema, you can invoke the classes using standard automated loaders without building proprietary wrapper systems.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/tinymoe2m"

print("Loading MoE configuration and tokenizer layers...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Tom and Jerry are "
inputs = tokenizer(prompt, return_tensors="pt").to(device)

print("Running inference loop (Validating Top-2 sparse routing matrices)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_length=64, 
        do_sample=False
    )
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

📝 Model Specifications

Architecture: Mixtral (MixtralForCausalLM)
Dataset: TinyStories
Total Parameters (num_local_experts = 4): ~1.95M
Active Parameters (num_experts_per_tok = 2): ~1.14M
Vocabulary Size (vocab_size): 512 (Custom SentencePiece BPE with byte_fallback enabled)
Hidden Size (hidden_size): 128
Number of Hidden Layers (num_hidden_layers): 3
Number of Attention Heads (num_heads / num_kv_heads): 2 / 2 (MHA layout)
Individual Expert Internal Dimension (intermediate_size): 352 (SwiGLU structure)
Max Position Embeddings (max_position_embeddings): 4,096
RoPE Base Frequency (rope_theta): 15,000.0

📜 License

License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.

Downloads last month: 421

GGUF

Model size

1.95M params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support