TemporalMesh Transformer (TMT v3)

Author: Vigneshwar LK
Paper: DOI 10.5281/zenodo.20287197
Code: github.com/vignesh2027/TemporalMesh-Transformer
Live Demo: HuggingFace Space
Benchmarks: TMT-Benchmarks Dataset


What is TMT?

TMT is a novel PyTorch transformer architecture that simultaneously resolves three fundamental inefficiencies in standard transformers:

Problem Standard Transformer TMT Solution
Quadratic attention cost $O(S^2)$ per layer Mesh Attention: $O(S \cdot k)$ dynamic $k$NN graph
Static attention topology Fixed fully-connected Dynamic graph rebuilt per-layer from cosine similarity
Uniform token compute All tokens use all $N$ layers Adaptive Depth Routing: exit gate per token, avg 5.8/12 layers
Flat positional encoding Position only Temporal Decay: learned multiplicative semantic attenuation
No cross-sequence memory Stateless EMA Memory Anchors: 16 persistent fast-weight vectors

Results

Model WikiText-2 PPL ↓ WikiText-103 PPL ↓ LongBench ↑ Compute
Vanilla Transformer 42.1 51.3 41.2 100%
Longformer 39.6 47.2 49.8 62%
Mamba 31.8 38.4 51.3 55%
RWKV 33.1 40.9 48.7 50%
Full TMT 29.4 36.1 53.4 48%

All models: ~120M parameters. TMT trained for 10K steps on WikiText-2 (AdamW, cosine LR, seeds 42/1337/2024).


Architecture at a Glance

Input β†’ Token Embedding + RoPE
      β†’ [Γ— 12 layers]
           MeshBuilder (kNN graph, cosine sim, top-k=8)
           Mesh Attention  O(SΒ·k)  + Temporal Decay Encoding
           EMA Memory Anchor Cross-Attention (16 anchors, Ξ²=0.99)
           Dual-Stream FFN (syntax stream β€– semantic stream, sigmoid gate)
           Exit Gate  Οƒ(W_gate Β· x) > 0.85 β†’ token frozen
      β†’ LayerNorm β†’ Tied Output Projection
      β†’ Logits (B, S, V)

Output fields (TMTOutput dataclass):

  • logits β€” (B, S, V) next-token predictions
  • exit_masks β€” list of (B, S) booleans, one per layer
  • confidences β€” gate confidence per token per layer
  • graph_edges β€” sparse kNN edge list from final layer
  • memory_state β€” (M, D) final EMA anchor states
  • decay_scalars β€” temporal decay weights applied

Quick Start

git clone https://github.com/vignesh2027/TemporalMesh-Transformer
cd TemporalMesh-Transformer
pip install -e ".[dev]"
from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
import torch

config = TMTConfig(
    vocab_size=50257,
    d_model=512,
    n_heads=8,
    n_layers=12,
    graph_k=8,
    exit_threshold=0.85,
    memory_anchors=16,
)
model = TMTModel(config)  # ~120M params

tokens = torch.randint(0, 50257, (1, 256))
out = model(tokens)

print(out.logits.shape)      # (1, 256, 50257)
print(out.exit_masks[-1])    # which tokens exited at layer 12
avg_exit = sum(m.float().mean() for m in out.exit_masks) / len(out.exit_masks)
print(f"Avg exit layer: {avg_exit:.2f}")  # ~5.8

Training

python scripts/train.py \
  --dataset wikitext-2 \
  --model_size base \
  --steps 10000 \
  --lr 3e-4 \
  --batch_size 16 \
  --seq_len 256 \
  --exit_threshold 0.85 \
  --graph_k 8

Ablation Summary

Config PPL ↓ Compute VRAM
Vanilla Transformer 42.1 100% 18.4 GB
+ Mesh Attention only 37.8 62% 11.2 GB
+ Temporal Decay only 40.3 98% 18.4 GB
+ Adaptive Exit only 39.6 51% 18.4 GB
Mesh + Decay 34.2 61% 11.2 GB
Mesh + Exit 35.1 50% 11.2 GB
Full TMT 29.4 48% 11.2 GB

The full combination achieves superadditive gains: interaction effect = 4.1 PPL beyond sum of individual contributions.


Citation

@misc{vigneshwar2026tmt,
  title   = {TemporalMesh Transformer: Dynamic Graph Attention with
             Temporal Semantic Decay and Per-Token Adaptive Depth Routing},
  author  = {Vigneshwar LK},
  year    = {2026},
  doi     = {10.5281/zenodo.20287197},
  url     = {https://zenodo.org/records/20287390}
}

License

MIT License Β· Β© 2026 Vigneshwar LK

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train vigneshwar234/TemporalMesh-Transformer

Space using vigneshwar234/TemporalMesh-Transformer 1

Evaluation results