new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 26

EVA: Towards a universal model of the immune system

The effective application of foundation models to translational research in immune-mediated diseases requires multimodal patient-level representations that can capture complex phenotypes emerging from multicellular interactions. Yet most current biological foundation models focus only on single-cell resolution and are evaluated on technical metrics often disconnected from actual drug development tasks and challenges. Here, we introduce EVA, the first cross-species, multimodal foundation model of immunology and inflammation, a therapeutic area where shared pathogenic mechanisms create unique opportunities for transfer learning. EVA harmonizes transcriptomics data across species, platforms, and resolutions, and integrates histology data to produce rich, unified patient representations. We establish clear scaling laws, demonstrating that increasing model size and compute translates to improvements in both pretraining and downstream tasks performance. We introduce a comprehensive evaluation suite of 39 tasks spanning the drug development pipeline: zero-shot target efficacy and gene function prediction for discovery, cross-species or cross-diseases molecular perturbations for preclinical development, and patient stratification with treatment response prediction or disease activity prediction for clinical trials applications. We benchmark EVA against several state-of-the-art biological foundation models and baselines on these tasks, and demonstrate state-of-the-art results on each task category. Using mechanistic interpretability, we further identify biological meaningful features, revealing intertwined representations across species and technologies. We release an open version of EVA for transcriptomics to accelerate research on immune-mediated diseases.

  • 11 authors
·
Feb 10

HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity

Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms, along with inconsistent metadata and preprocessing procedure, hinders AI-driven discovery. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready, rigorously curated dataset that integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies encompassing over 2.56 million cells. Spanning vaccination, inoculation, and mixed exposures, the dataset includes microarray, bulk RNA-seq, and single-cell RNA-seq from whole blood, PBMCs, and nasal swabs, sourced from GEO, ImmPort, and ArrayExpress. We harmonized subject-level metadata, standardized outcome measures, applied unified preprocessing pipelines with rigorous quality control, and aligned all data to official gene symbols. To demonstrate the utility of HR-VILAGE-3K3M, we performed predictive modeling of vaccine responders and evaluated batch-effect correction methods. Beyond these initial demonstrations, it supports diverse systems immunology applications and benchmarking of feature selection and transfer learning algorithms. Its scale and heterogeneity also make it ideal for pretraining foundation models of the human immune response and for advancing multimodal learning frameworks. As the largest longitudinal transcriptomic resource for human respiratory viral immunization, it provides an accessible platform for reproducible AI-driven research, accelerating systems immunology and vaccine development against emerging viral threats.

  • 17 authors
·
May 19, 2025

Immunocto: a massive immune cell database auto-generated for histopathology

With the advent of novel cancer treatment options such as immunotherapy, studying the tumour immune micro-environment (TIME) is crucial to inform on prognosis and understand potential response to therapeutic agents. A key approach to characterising the TIME may be through combining (1) digitised microscopic high-resolution optical images of hematoxylin and eosin (H&E) stained tissue sections obtained in routine histopathology examinations with (2) automated immune cell detection and classification methods. In this work, we introduce a workflow to automatically generate robust single cell contours and labels from dually stained tissue sections with H&E and multiplexed immunofluorescence (IF) markers. The approach harnesses the Segment Anything Model and requires minimal human intervention compared to existing single cell databases. With this methodology, we create Immunocto, a massive, multi-million automatically generated database of 6,848,454 human cells and objects, including 2,282,818 immune cells distributed across 4 subtypes: CD4^+ T cell lymphocytes, CD8^+ T cell lymphocytes, CD20^+ B cell lymphocytes, and CD68^+/CD163^+ macrophages. For each cell, we provide a 64times64 pixels^2 H&E image at 40times magnification, along with a binary mask of the nucleus and a label. The database, which is made publicly available, can be used to train models to study the TIME on routine H&E slides. We show that deep learning models trained on Immunocto result in state-of-the-art performance for lymphocyte detection. The approach demonstrates the benefits of using matched H&E and IF data to generate robust databases for computational pathology applications.

  • 6 authors
·
Feb 27, 2025