Salesforce/wikitext
Viewer β’ Updated β’ 3.71M β’ 1.33M β’ 708
How to use vigneshwar234/TemporalMesh-Transformer with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("vigneshwar234/TemporalMesh-Transformer", dtype="auto")Author: Vigneshwar LK
Paper: DOI 10.5281/zenodo.20287197
Code: github.com/vignesh2027/TemporalMesh-Transformer
Live Demo: HuggingFace Space
Benchmarks: TMT-Benchmarks Dataset
TMT is a novel PyTorch transformer architecture that simultaneously resolves three fundamental inefficiencies in standard transformers:
| Problem | Standard Transformer | TMT Solution |
|---|---|---|
| Quadratic attention cost | $O(S^2)$ per layer | Mesh Attention: $O(S \cdot k)$ dynamic $k$NN graph |
| Static attention topology | Fixed fully-connected | Dynamic graph rebuilt per-layer from cosine similarity |
| Uniform token compute | All tokens use all $N$ layers | Adaptive Depth Routing: exit gate per token, avg 5.8/12 layers |
| Flat positional encoding | Position only | Temporal Decay: learned multiplicative semantic attenuation |
| No cross-sequence memory | Stateless | EMA Memory Anchors: 16 persistent fast-weight vectors |
| Model | WikiText-2 PPL β | WikiText-103 PPL β | LongBench β | Compute |
|---|---|---|---|---|
| Vanilla Transformer | 42.1 | 51.3 | 41.2 | 100% |
| Longformer | 39.6 | 47.2 | 49.8 | 62% |
| Mamba | 31.8 | 38.4 | 51.3 | 55% |
| RWKV | 33.1 | 40.9 | 48.7 | 50% |
| Full TMT | 29.4 | 36.1 | 53.4 | 48% |
All models: ~120M parameters. TMT trained for 10K steps on WikiText-2 (AdamW, cosine LR, seeds 42/1337/2024).
Input β Token Embedding + RoPE
β [Γ 12 layers]
MeshBuilder (kNN graph, cosine sim, top-k=8)
Mesh Attention O(SΒ·k) + Temporal Decay Encoding
EMA Memory Anchor Cross-Attention (16 anchors, Ξ²=0.99)
Dual-Stream FFN (syntax stream β semantic stream, sigmoid gate)
Exit Gate Ο(W_gate Β· x) > 0.85 β token frozen
β LayerNorm β Tied Output Projection
β Logits (B, S, V)
Output fields (TMTOutput dataclass):
logits β (B, S, V) next-token predictionsexit_masks β list of (B, S) booleans, one per layerconfidences β gate confidence per token per layergraph_edges β sparse kNN edge list from final layermemory_state β (M, D) final EMA anchor statesdecay_scalars β temporal decay weights appliedgit clone https://github.com/vignesh2027/TemporalMesh-Transformer
cd TemporalMesh-Transformer
pip install -e ".[dev]"
from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
import torch
config = TMTConfig(
vocab_size=50257,
d_model=512,
n_heads=8,
n_layers=12,
graph_k=8,
exit_threshold=0.85,
memory_anchors=16,
)
model = TMTModel(config) # ~120M params
tokens = torch.randint(0, 50257, (1, 256))
out = model(tokens)
print(out.logits.shape) # (1, 256, 50257)
print(out.exit_masks[-1]) # which tokens exited at layer 12
avg_exit = sum(m.float().mean() for m in out.exit_masks) / len(out.exit_masks)
print(f"Avg exit layer: {avg_exit:.2f}") # ~5.8
python scripts/train.py \
--dataset wikitext-2 \
--model_size base \
--steps 10000 \
--lr 3e-4 \
--batch_size 16 \
--seq_len 256 \
--exit_threshold 0.85 \
--graph_k 8
| Config | PPL β | Compute | VRAM |
|---|---|---|---|
| Vanilla Transformer | 42.1 | 100% | 18.4 GB |
| + Mesh Attention only | 37.8 | 62% | 11.2 GB |
| + Temporal Decay only | 40.3 | 98% | 18.4 GB |
| + Adaptive Exit only | 39.6 | 51% | 18.4 GB |
| Mesh + Decay | 34.2 | 61% | 11.2 GB |
| Mesh + Exit | 35.1 | 50% | 11.2 GB |
| Full TMT | 29.4 | 48% | 11.2 GB |
The full combination achieves superadditive gains: interaction effect = 4.1 PPL beyond sum of individual contributions.
@misc{vigneshwar2026tmt,
title = {TemporalMesh Transformer: Dynamic Graph Attention with
Temporal Semantic Decay and Per-Token Adaptive Depth Routing},
author = {Vigneshwar LK},
year = {2026},
doi = {10.5281/zenodo.20287197},
url = {https://zenodo.org/records/20287390}
}
MIT License Β· Β© 2026 Vigneshwar LK