YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Open In Colab

Or run from Hugging Face: open https://colab.research.google.com/ β†’ File β†’ Open notebook β†’ URL tab β†’ paste https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb

Demo

A self-contained, CPU-only Colab notebook is provided at examples/SF_Cluster_Demo.ipynb. It installs the package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix, 200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score distribution and per-subset means, and writes A3M files ready for AF2. Expected end-to-end runtime on a free Colab CPU instance: **2 minutes**.

SF-Cluster (workshop OSS release)

Frustration-guided MSA subset builders for AlphaFold2 multi-conformer prediction. This is the open-source workshop distribution of two subset methods from the SF-Cluster benchmark:

  • mosaic β€” each subset mixes high / mid / low contrast-FI sequences.
  • gradient β€” each subset is homogeneous within a contrast-FI quartile.

The contrast score is computed from a per-residue Frustration Index (FI) matrix produced by FrustrAI-Seq (HF model: leuschj/FrustrAI-Seq).

This package is dependency-light (numpy, scipy), provides a CLI, and is designed to be a drop-in replacement for random / uniform MSA subsampling in AF-Cluster-style pipelines.

Algorithm

Given a filtered MSA A of N sequences over L match-state columns, and a per-residue FI matrix F ∈ ℝ^{NΓ—L}:

  1. Column variance: v_l = Var_i(F_{i,l}) over sequences.
  2. High-variance mask: HV = {l : v_l β‰₯ percentile(v, 80)}, LV = Β¬HV.
  3. Contrast score per sequence:
    contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} βˆ’ mean_{l ∈ LV} F_{i,l}
    
  4. Mosaic (N_SUBSETS = 12, TARGET_SIZE = 32): sort pool by contrast_hvlv, tri-stratify into low/mid/high terciles; for each subset s ∈ {0..11}, draw 11 high + 11 low + 10 mid with np.random.default_rng(seed=s).
  5. Gradient (N_SUBSETS = 12, TARGET_SIZE = 32): split sorted pool into 4 quartiles; for each bin b ∈ {0..3} and s ∈ {0..2} draw 32 sequences from that bin only with np.random.default_rng(seed=10*b + s).

Install

pip install -e .

Python β‰₯ 3.10. Dependencies: numpy, scipy.

Inputs

You need two files per case:

  1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters are preserved verbatim in output subsets; only match-state (uppercase) columns are scored.
  2. A per-residue FI matrix .npy of shape (N_seq, L), where N_seq is the number of sequences in the A3M and L is the number of match-state columns.

The FI matrix is produced by FrustrAI-Seq. We do not bundle weights β€” see https://github.com/leuschj/FrustrAI-Seq (model card: https://huggingface.co/leuschj/FrustrAI-Seq) for inference instructions. A reference usage pattern is documented in examples/run_demo.sh.

CLI

sf-cluster build \
    --a3m   path/to/filtered.a3m \
    --fi    path/to/fi_matrix.npy \
    --method mosaic \
    --n-subsets 12 \
    --subset-size 32 \
    --seed 20260422 \
    --out   subsets/kaib_mosaic/

Outputs:

subsets/kaib_mosaic/
β”œβ”€β”€ mosaic_subset_000.a3m
β”œβ”€β”€ mosaic_subset_001.a3m
β”œβ”€β”€ ...
β”œβ”€β”€ mosaic_subset_011.a3m
β”œβ”€β”€ mosaic_subset_index.tsv   # subset_id, pool_index, header, score
└── mosaic_meta.json          # provenance + score stats

Library

from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient

pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix)         # (N,) per-sequence
subsets = method_mosaic(score)                # list[list[int]] of 12 Γ— 32
# or
subsets = method_gradient(score)

Each subset is a list of indices into pool.headers / pool.sequences.

Reproducibility

All RNG draws use np.random.default_rng(seed=...) with method-specific deterministic seeds (see Algorithm Β§4–§5). Re-running the same A3M + FI matrix yields byte-identical subset assignments. The CLI also records a provenance JSON ({method}_meta.json) capturing inputs, sizes, and the package version.

LIMITATIONS

  • No frustration model included. You must run FrustrAI-Seq separately to obtain the (N_seq, L) FI matrix. This package only handles the scoring + subset-construction stage.
  • No AF2 runner included. The package emits A3M files; downstream inference (AF2 / ColabFold) is the user's responsibility.
  • Only mosaic and gradient arms are open-sourced here. The other SF-Cluster arms (region_cluster, contrast_nc) require additional feature pipelines and are intentionally excluded from this workshop release.
  • No re-sampling guarantee across subsets. A sequence can appear in multiple subsets (gradient draws from a single quartile with replacement if the quartile is smaller than subset_size).
  • Empirical caveat (read this). Controlled comparison shows uniform subsampling performs equivalently on most Main-21 cases β€” see paper for boundary conditions under which contrast-FI stratification yields a measurable lift over random subsampling. Treat this package as a research baseline, not a turnkey accuracy improvement.

Citation

If you use this code, please cite the SF-Cluster paper (forthcoming) and FrustrAI-Seq.

License

MIT. See LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support