YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Or run from Hugging Face: open https://colab.research.google.com/ β File β Open notebook β URL tab β paste
https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb
Demo
A self-contained, CPU-only Colab notebook is provided at
examples/SF_Cluster_Demo.ipynb. It installs the
package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix,
200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score
distribution and per-subset means, and writes A3M files ready for AF2. Expected
end-to-end runtime on a free Colab CPU instance: **2 minutes**.
SF-Cluster (workshop OSS release)
Frustration-guided MSA subset builders for AlphaFold2 multi-conformer prediction. This is the open-source workshop distribution of two subset methods from the SF-Cluster benchmark:
- mosaic β each subset mixes high / mid / low contrast-FI sequences.
- gradient β each subset is homogeneous within a contrast-FI quartile.
The contrast score is computed from a per-residue Frustration Index (FI)
matrix produced by FrustrAI-Seq
(HF model: leuschj/FrustrAI-Seq).
This package is dependency-light (numpy, scipy), provides a CLI, and is
designed to be a drop-in replacement for random / uniform MSA subsampling in
AF-Cluster-style pipelines.
Algorithm
Given a filtered MSA A of N sequences over L match-state columns, and a
per-residue FI matrix F β β^{NΓL}:
- Column variance:
v_l = Var_i(F_{i,l})over sequences. - High-variance mask:
HV = {l : v_l β₯ percentile(v, 80)},LV = Β¬HV. - Contrast score per sequence:
contrast_hvlv(i) = mean_{l β HV} F_{i,l} β mean_{l β LV} F_{i,l} - Mosaic (N_SUBSETS = 12, TARGET_SIZE = 32):
sort pool by
contrast_hvlv, tri-stratify into low/mid/high terciles; for each subsets β {0..11}, draw11 high + 11 low + 10 midwithnp.random.default_rng(seed=s). - Gradient (N_SUBSETS = 12, TARGET_SIZE = 32):
split sorted pool into 4 quartiles; for each bin
b β {0..3}ands β {0..2}draw 32 sequences from that bin only withnp.random.default_rng(seed=10*b + s).
Install
pip install -e .
Python β₯ 3.10. Dependencies: numpy, scipy.
Inputs
You need two files per case:
- A filtered A3M file (ColabFold-style). Lowercase insertion-state letters are preserved verbatim in output subsets; only match-state (uppercase) columns are scored.
- A per-residue FI matrix
.npyof shape(N_seq, L), whereN_seqis the number of sequences in the A3M andLis the number of match-state columns.
The FI matrix is produced by FrustrAI-Seq. We do not bundle weights β see
https://github.com/leuschj/FrustrAI-Seq (model card:
https://huggingface.co/leuschj/FrustrAI-Seq) for inference instructions.
A reference usage pattern is documented in examples/run_demo.sh.
CLI
sf-cluster build \
--a3m path/to/filtered.a3m \
--fi path/to/fi_matrix.npy \
--method mosaic \
--n-subsets 12 \
--subset-size 32 \
--seed 20260422 \
--out subsets/kaib_mosaic/
Outputs:
subsets/kaib_mosaic/
βββ mosaic_subset_000.a3m
βββ mosaic_subset_001.a3m
βββ ...
βββ mosaic_subset_011.a3m
βββ mosaic_subset_index.tsv # subset_id, pool_index, header, score
βββ mosaic_meta.json # provenance + score stats
Library
from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient
pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix) # (N,) per-sequence
subsets = method_mosaic(score) # list[list[int]] of 12 Γ 32
# or
subsets = method_gradient(score)
Each subset is a list of indices into pool.headers / pool.sequences.
Reproducibility
All RNG draws use np.random.default_rng(seed=...) with method-specific
deterministic seeds (see Algorithm Β§4βΒ§5). Re-running the same A3M + FI
matrix yields byte-identical subset assignments. The CLI also records a
provenance JSON ({method}_meta.json) capturing inputs, sizes, and the
package version.
LIMITATIONS
- No frustration model included. You must run FrustrAI-Seq separately to
obtain the
(N_seq, L)FI matrix. This package only handles the scoring + subset-construction stage. - No AF2 runner included. The package emits A3M files; downstream inference (AF2 / ColabFold) is the user's responsibility.
- Only
mosaicandgradientarms are open-sourced here. The other SF-Cluster arms (region_cluster,contrast_nc) require additional feature pipelines and are intentionally excluded from this workshop release. - No re-sampling guarantee across subsets. A sequence can appear in
multiple subsets (gradient draws from a single quartile with replacement
if the quartile is smaller than
subset_size). - Empirical caveat (read this). Controlled comparison shows uniform subsampling performs equivalently on most Main-21 cases β see paper for boundary conditions under which contrast-FI stratification yields a measurable lift over random subsampling. Treat this package as a research baseline, not a turnkey accuracy improvement.
Citation
If you use this code, please cite the SF-Cluster paper (forthcoming) and FrustrAI-Seq.
License
MIT. See LICENSE.