S1-Omni-Image
A Unified Multimodal Model for Scientific Image Understanding, Generation, and Editing
English | 简体中文
📖 Introduction
S1-Omni-Image is a unified multimodal model developed by the ScienceOne team at the Chinese Academy of Sciences, for scientific scenarios. It supports scientific image understanding, scientific image generation, and scientific image editing within a unified model.
Built on the scientific multimodal reasoning foundation model S1-VL-32B, S1-Omni-Image adopts a unified Think-Before-Generate paradigm. Given a user instruction and optional input image, the model first produces task-oriented reasoning, a textual response, and a task-specific token. The hidden states from this reasoning process are then used to guide subsequent image generation or image editing. The model is optimized for scientific illustration generation and text rendering, and further unifies scientific image segmentation, medical image translation, and medical image super-resolution under an image-editing formulation.
S1-Omni-Image substantially improves scientific illustration generation over mainstream open-source models, achieves leading results on multiple scientific image editing benchmarks, and preserves the scientific image understanding and reasoning capability of S1-VL-32B.
📥 Models and Dataset
The model weights and the SciGenEdit-10K dataset are available on Hugging Face and ModelScope.
Model Weights
| Platform | Link |
|---|---|
| Hugging Face | ScienceOne-AI/S1-Omni-Image |
| ModelScope | ScienceOne-AI/S1-Omni-Image |
SciGenEdit-10K Dataset
| Platform | Link |
|---|---|
| Hugging Face | ScienceOne-AI/SciGenEdit-10K |
| ModelScope | ScienceOne-AI/SciGenEdit-10K |
🎨 Showcase
Scientific Image Generation
The following examples demonstrate S1-Omni-Image's representative capabilities in scientific image generation, including multi-disciplinary, multi-format, and text-rich scientific illustration generation.
Scientific Image Editing
The following examples demonstrate S1-Omni-Image's representative capabilities in scientific image editing, including scientific illustration editing, scientific image segmentation, medical image translation, and medical image super-resolution.
🧠 Model Architecture
The overall architecture of S1-Omni-Image is shown below. The model uses S1-VL-32B as the scientific multimodal reasoning foundation model. It understands user instructions, input images, and scientific context, and produces explicit reasoning traces, textual responses, and task-specific tokens. The image generation and editing module follows the MMDiT architecture and is initialized from the MMDiT weights of Qwen-Image-Edit. A reasoning-to-diffusion alignment layer then maps the hidden states from S1-VL-32B into the conditioning space of the MMDiT module, which drives the final image generation or editing process.
For text response and image understanding tasks, the model directly uses the VLM text branch to produce answers. For image generation and image editing tasks, the model emits <image_gen> or <image_edit> task tokens and uses the hidden states from the autoregressive generation process as diffusion conditions. This design avoids feeding only a short prompt into an image model; instead, it uses scientific reasoning to provide richer semantic and structural guidance for visual generation.
The model is trained in three stages:
- Stage I uses the full SciGenEdit dataset to train S1-VL-32B under the scientific reasoning paradigm, enabling it to produce task-oriented reasoning, textual responses, and task-specific tokens for scientific image tasks.
- Stage II trains the reasoning-to-diffusion alignment layer on pre-training data, mapping hidden states from S1-VL-32B into the conditioning space of the image generation module.
- Stage III jointly trains the alignment layer and image generation module on the image generation and editing data from SciGenEdit, enabling scientific reasoning hidden states to stably drive final image generation and editing.
🗂️ Training Data
We construct SciGenEdit, a training dataset covering three major task categories: scientific image understanding, scientific image generation, and scientific image editing. The full dataset contains approximately 314K samples. The image generation data targets scientific illustrations, structured diagrams, complex text rendering, and scientific visualization. The image editing data covers scientific illustration editing, medical and geographic image segmentation, medical image translation, and medical image super-resolution. The image understanding data is used to preserve scientific image understanding and Thinking-with-Images capabilities.
To support community research, we release SciGenEdit-10K, a public subset sampled from the full training data. It covers major task types and representative scientific scenarios, and can be used for model analysis, instruction-format reference, and future research on scientific image generation and editing.
🚀 Quick Start
S1-Omni-Image provides inference service code and an OpenAI Chat Completion-compatible API. For detailed environment setup, model loading, API parameters, and Python examples, please refer to the GitHub repository:
git clone https://github.com/ScienceOne-AI/S1-Omni-Image.git
cd S1-Omni-Image
pip install -e ".[server]"
After downloading the model weights, place the complete S1-Omni-Image/ model directory anywhere on disk and start the service:
s1-omni-image-serve \
--model /path/to/S1-Omni-Image \
--host 0.0.0.0 \
--port 8000
Once the service is running, open http://localhost:8000/ for the web interface, or call /v1/chat/completions for unified scientific image understanding, generation, and editing.
⚠️ Limitations
Although S1-Omni-Image is specifically optimized for scientific image generation and editing, the current version still has several limitations:
- Complex text rendering: Long text, dense annotations, and complex Chinese text may still contain misspellings, wrong characters, or blurred glyphs.
- Fine-grained local editing: Complex instructions, multi-object editing, and strongly constrained local modifications may still be insufficiently executed or spatially misaligned.
- General image aesthetics: The model focuses on scientific image tasks and may not outperform frontier general-purpose models on open-domain natural image generation or creative design.
- Professional reliability: In high-stakes medical or scientific scenarios, model outputs should be reviewed by domain experts and should not be directly used for diagnosis, experiments, or decision-making.
📄 License
This project is released under the Apache License 2.0. Please also comply with the licenses of the underlying foundation models, datasets, and third-party components.
📚 Citation
If you find S1-Omni-Image useful for your research or applications, please cite our work:
@article{li2026s1omniimage,
title={S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing},
author={Li, Qingxiao and Wang, Zikai and Wang, Qingli and Xu, Nan},
journal={arXiv preprint arXiv:2606.24441},
year={2026}
}
- Downloads last month
- 2