YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Or run from Hugging Face: open https://colab.research.google.com/ → File → Open notebook → URL tab → paste https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb

Demo

A self-contained, CPU-only Colab notebook is provided at examples/SF_Cluster_Demo.ipynb. It installs the package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix, 200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score distribution and per-subset means, and writes A3M files ready for AF2. Expected end-to-end runtime on a free Colab CPU instance: **2 minutes**.

SF-Cluster (workshop OSS release)

Frustration-guided MSA subset builders for AlphaFold2 multi-conformer prediction. This is the open-source workshop distribution of two subset methods from the SF-Cluster benchmark:

mosaic — each subset mixes high / mid / low contrast-FI sequences.
gradient — each subset is homogeneous within a contrast-FI quartile.

The contrast score is computed from a per-residue Frustration Index (FI) matrix produced by FrustrAI-Seq (HF model: leuschj/FrustrAI-Seq).

This package is dependency-light (numpy, scipy), provides a CLI, and is designed to be a drop-in replacement for random / uniform MSA subsampling in AF-Cluster-style pipelines.

Algorithm

Given a filtered MSA A of N sequences over L match-state columns, and a per-residue FI matrix F ∈ ℝ^{N×L}:

Column variance: v_l = Var_i(F_{i,l}) over sequences.
High-variance mask: HV = {l : v_l ≥ percentile(v, 80)}, LV = ¬HV.

Contrast score per sequence:

contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} − mean_{l ∈ LV} F_{i,l}

Mosaic (N_SUBSETS = 12, TARGET_SIZE = 32): sort pool by contrast_hvlv, tri-stratify into low/mid/high terciles; for each subset s ∈ {0..11}, draw 11 high + 11 low + 10 mid with np.random.default_rng(seed=s).
Gradient (N_SUBSETS = 12, TARGET_SIZE = 32): split sorted pool into 4 quartiles; for each bin b ∈ {0..3} and s ∈ {0..2} draw 32 sequences from that bin only with np.random.default_rng(seed=10*b + s).

Install

pip install -e .

Python ≥ 3.10. Dependencies: numpy, scipy.

Inputs

You need two files per case:

A filtered A3M file (ColabFold-style). Lowercase insertion-state letters are preserved verbatim in output subsets; only match-state (uppercase) columns are scored.
A per-residue FI matrix .npy of shape (N_seq, L), where N_seq is the number of sequences in the A3M and L is the number of match-state columns.

The FI matrix is produced by FrustrAI-Seq. We do not bundle weights — see https://github.com/leuschj/FrustrAI-Seq (model card: https://huggingface.co/leuschj/FrustrAI-Seq) for inference instructions. A reference usage pattern is documented in examples/run_demo.sh.

CLI

sf-cluster build \
    --a3m   path/to/filtered.a3m \
    --fi    path/to/fi_matrix.npy \
    --method mosaic \
    --n-subsets 12 \
    --subset-size 32 \
    --seed 20260422 \
    --out   subsets/kaib_mosaic/

Outputs:

subsets/kaib_mosaic/
├── mosaic_subset_000.a3m
├── mosaic_subset_001.a3m
├── ...
├── mosaic_subset_011.a3m
├── mosaic_subset_index.tsv   # subset_id, pool_index, header, score
└── mosaic_meta.json          # provenance + score stats

Library

from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient

pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix)         # (N,) per-sequence
subsets = method_mosaic(score)                # list[list[int]] of 12 × 32
# or
subsets = method_gradient(score)

Each subset is a list of indices into pool.headers / pool.sequences.

Reproducibility

All RNG draws use np.random.default_rng(seed=...) with method-specific deterministic seeds (see Algorithm §4–§5). Re-running the same A3M + FI matrix yields byte-identical subset assignments. The CLI also records a provenance JSON ({method}_meta.json) capturing inputs, sizes, and the package version.

LIMITATIONS

No frustration model included. You must run FrustrAI-Seq separately to obtain the (N_seq, L) FI matrix. This package only handles the scoring + subset-construction stage.
No AF2 runner included. The package emits A3M files; downstream inference (AF2 / ColabFold) is the user's responsibility.
Only mosaic and gradient arms are open-sourced here. The other SF-Cluster arms (region_cluster, contrast_nc) require additional feature pipelines and are intentionally excluded from this workshop release.
No re-sampling guarantee across subsets. A sequence can appear in multiple subsets (gradient draws from a single quartile with replacement if the quartile is smaller than subset_size).
Empirical caveat (read this). Controlled comparison shows uniform subsampling performs equivalently on most Main-21 cases — see paper for boundary conditions under which contrast-FI stratification yields a measurable lift over random subsampling. Treat this package as a research baseline, not a turnkey accuracy improvement.

Reproducing the main benchmark

The scoring/evaluation code, reference structures, region definitions, and the AlphaFold2 prediction sets for the headline arms are included so the main minority_hit_rate table can be regenerated on CPU (no GPU):

pip install numpy biopython pyyaml
python reproduce_benchmark.py --mode score        # re-score all 1440 prediction PDBs
python reproduce_benchmark.py --mode precomputed   # aggregate shipped evals.tsv (fast)

The driver re-runs the Biopython Superimposer scorer (eval/evaluate_prediction.py, hit = Cα RMSD ≤ 3 Å on common core AND pLDDT ≥ 70 overall AND ≥ 70 in the switch region) over bench/ and asserts each cell matches the published table (mosaic-SF KaiB 0.95, GA98 0.925, GB98 0.1875; AF-Cluster baseline 0.36/0.46/0.44; full table and the GA_GB scoring caveat in README_benchmark.md). All 15 cells reproduce exactly. TMalign is optional (extra columns only).

Citation

If you use this code, please cite the SF-Cluster paper (forthcoming) and FrustrAI-Seq.

License

MIT. See LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support