Instructions to use Dan44788/NSTS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Dan44788/NSTS with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Dan44788/NSTS", dtype="auto") - Notebooks
- Google Colab
- Kaggle
NSTS: Narrative Structure in Tiny Stories
A depth-controlled series of GPT-Neo models (1M–33M non-embedding parameters) trained from scratch on FK-filtered TinyStories subsets, with checkpoints saved every 100 training steps. Released as a research dataset for mechanistic interpretability work on the boundary between fluency and narrative coherence.
Background
TinyStories (Eldan & Li, 2023) showed that transformer models below 10M parameters can produce grammatical, prompt-consistent text on constrained synthetic corpora. The evaluation framework assessed grammar, creativity, and prompt-following — local, sentence-level properties. Narrative coherence — causal chaining between events, goal persistence, referent consistency, narrative resolution — was not independently measured.
This series revisits that question directly. Using a purpose-built evaluation instrument sensitive to narrative structure, and a depth-controlled width series with dense checkpoint coverage, we find that fluency and narrative coherence dissociate sharply: fluency scales with model size and training duration; coherence plateaus to a low ceiling that corpus enrichment, model scaling, and extended training all fail to raise.
A companion mechanistic analysis (Paper 2) asks what the models do represent internally, and what geometry the optimiser settles into when training is complete.
The Fluency/Coherence Dissociation
The core finding is a sharp dissociation between two properties that previous evaluations treated as a single dimension.
Training on sentence-shuffled stories (red) vs. intact corpus (blue). Validation loss descends at nearly identical rates. Coherence plateaus at ~1.45 from step 500 and never recovers. The gap — 0.55 points, 6× the measurement noise floor — confirms that sentence order is load-bearing for coherence acquisition, and that the evaluation instrument is responding to genuine narrative structure rather than local sentence-level statistics.
Seven interventions targeting candidate explanations each produced null results:
- Syntactic register complexity (Cond A vs Cond B): no effect on coherence ceiling
- Vocabulary distribution manipulation: no effect
- Extended training (doubled epochs): no effect
- Ground-truth prefix at generation time: no effect
- Recognition–generation gap test: models assign higher probability to their own output than to corpus continuations
- Structured narrative injection (5.6× causal connective density): no effect
- Explicit
[CAUSE]/[EFFECT]structural markers in training: no effect
The consistent null pattern across interventions that varied corpus quality, training duration, and generation mechanics points to the training objective as the site of the limitation. Cross-entropy next-token prediction recovers locally predictive properties at high yield; globally distributed narrative structure is recovered partially and at low yield, and that yield does not improve when the corpus improves.
Causal Direction Preference
Paper 2 probes whether causal structure is represented at all, using a direction score methodology: log P(cause because effect) − log P(reversed).
A causal direction preference is present above 1M parameters and scales monotonically with width in Condition A. The Condition B inversion above 10M — present from early in training, stable across extended epochs, not explained by token count or hyperparameters — remains an open result.
The preference is established by step 200–300 — before the fluency transition window identified in Paper 1. This rules out causal tracking as a downstream consequence of fluency acquisition.
Connective specificity
Replacing because with neutral connectives (although, and, but) reduces the direction score, and that reduction grows with model width — confirming the signal is not purely a clause-order artefact. Note that at 5M, the non-connective baseline (today) still exceeds because; connective-specific routing emerges between 5M and 10M.
The because-advantage increases monotonically in Condition A (+0.028 at 1M → +0.289 at 33M).
Surface ordering, not causal structure
At the 10M capacity threshold where the connective-specific signal first emerges cleanly, the model goes positive on both probe types simultaneously — including probes where canonical ordering is causally inappropriate. A model sensitive to causal structure should discriminate between them. This one does not. What is present is a surface statistical regularity, not a representation of causal structure.
The Same Dissociation in Grammar
An independent BLiMP grammaticality probe (67k minimal pairs, log-probability comparisons, no rubric) reveals the same local/non-local split:
Grammar and fluency are tightly coupled across all runs and checkpoints. The pattern follows the same fast/slow phase structure as the coherence trajectories.
Local grammatical constraints (determiner-noun agreement, subject-verb agreement, causative alternation) scale cleanly with model width — the Rising group. Non-local constraints (distractor agreement, reflexive binding, NPI licensing scope) are flat at chance across all scales — the Flat group. The local/non-local boundary in grammar mirrors the fluency/coherence boundary in narrative.
Effective Rank Analysis
Paper 2 examines what geometry the optimiser settles into, independent of behavioural evaluation.
Left: absolute effective rank scales sublinearly with model width — a 33× increase in width produces approximately a 4× increase in effective rank. Right: the fraction of available dimensional space actually used falls monotonically with model size, from ~19% at 1M to ~10% at 33M.
The width^0.7 empirical fit against a linear reference. The gap between the two lines is unused capacity. Both conditions fall well below full utilisation at every scale.
Rank declines from initialisation — training finds a compact solution rather than building one. The rank floor is established by step 300–400 (green band), before the behavioural coherence ceiling becomes measurable at step 800–1000 (red band). Rank trajectory is therefore a practical early diagnostic: the geometric signal precedes the behavioural one by several hundred steps.
Key observations from the rank analysis:
- Weight matrices operate in at most ~20% of available dimensional space throughout training
- This utilisation fraction declines to ~10% at 33M parameters
- The subspace is established within the first 300–400 training steps
- Condition B produces consistently lower effective rank than Condition A at every scale, consistent with lower per-token local predictability in longer stories
- Larger models do not use their additional dimensions — the corpus, not model capacity, is the binding constraint on rank
Dataset Contents
- 10 standard training runs: 5 model sizes × 2 corpus conditions
- 6 extended epoch runs: 1M, 5M, 10M × both conditions, independent initialisations
- 437 total checkpoints at 100-step resolution
| Size | Hidden | Non-emb params | Total params |
|---|---|---|---|
| 1M | 64 | ~1.4M | ~3.6M |
| 5M | 144 | ~5.3M | ~9.2M |
| 10M | 256 | ~10.7M | ~19.3M |
| 28M | 480 | ~28.4M | ~46.5M |
| 33M | 512 | ~33.4M | ~51.0M |
Architecture: GPT-Neo, 8 layers, 8 attention heads throughout. Depth held constant — differences across scales reflect width, not depth. No dropout.
Condition A: FK grade < 3 · mean sentence length 8 words
Condition B: FK grade 4–5 · mean sentence length 12 words
Vocabulary distribution held constant across both conditions by construction.
Open Questions
The Condition B inversion. Direction scores peak at 5M and decline monotonically above 10M in the harder syntactic register. The inversion is present from step 400 in training trajectories, stable across extended epoch runs, and not accounted for by token count, epoch count, or hyperparameter differences.
The layer 0 MLP. Activation patching produces negative recovery values at the layer 0 MLP sublayer — most pronounced in Condition A at larger models, developing over training. The attention sublayer at the same layer shows no such effect. The computational basis is unresolved.
No dominant circuit. The directional signal is distributed across positions and dimensions, with layer 0 contributing nearly twice the per-dimension signal of any deeper layer. No single head accounts for more than 4.5% of the total direction score. Where the causal preference is actually computed remains an open question.
The 1M capacity wall. The 1M model concentrates early updates in layers 0–1 without corresponding rank differentiation visible at larger scales, and hits a harder capacity wall than parameter count alone would predict.
Rank as diagnostic. The rank floor precedes the behavioural ceiling by several hundred steps. Whether this generalises as a practical early diagnostic to other constrained corpus training settings is untested.
Citation
@misc{2026fluency,
title = {Fluency Is Not Coherence: What Small Language Models Actually Learn},
year = {2026},
note = {Version 1.1}
}
@misc{2026belowtheceling,
title = {Fluency Is Not Coherence II: Below the Ceiling},
year = {2026},
note = {Version 0.8}
}










