Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
๐ Accepted at ICLR 2026
XModBench is a comprehensive benchmark designed to evaluate the cross-modal capabilities and consistency of omni-language models. It systematically assesses model performance across multiple modalities (text, vision, audio) and various cognitive tasks, revealing critical gaps in current state-of-the-art models.
- ๐ฏ Multi-Modal Evaluation: Comprehensive testing across text, vision, and audio modalities
- ๐งฉ 5 Task Dimensions: Perception, Spatial, Temporal, Linguistic, and External Knowledge tasks
- ๐ 13 SOTA Models Evaluated: Including Gemini 2.5 Pro, Qwen2.5-Omni, EchoInk-R1, and more
- ๐ Consistency Analysis: Measures performance stability across different modal configurations
- ๐ฅ Human Performance Baseline: Establishes human-level benchmarks for comparison
The dataset is available on Hugging Face: RyanWW/XModBench
Counts below reflect the actual released dataset (HF RyanWW/XModBench), summed over the 6 modality configurations.
| Family | Subtask | Samples |
|---|---|---|
| Perception | finegrained | 6,000 |
| Perception | general_activities | 6,000 |
| Perception | instruments | 6,000 |
| Perception | instruments_comp | 3,000 |
| Perception | natures | 3,000 |
| Perception total | 24,000 | |
| Spatial | 3D_movements | 2,646 |
| Spatial | arrangements | 2,790 |
| Spatial | panaroma | 2,340 |
| Spatial total | 7,776 | |
| Linguistic | recognition | 4,032 |
| Linguistic | translation | 4,212 |
| Linguistic total | 8,244 | |
| Temporal | calculation | 3,000 |
| Temporal | count | 3,000 |
| Temporal | order | 3,000 |
| Temporal total | 9,000 | |
| Knowledge | emotion_classification | 4,200 |
| Knowledge | movie_matching | 1,200 |
| Knowledge | music_genre_classification | 6,000 |
| Knowledge | singer_identification | 900 |
| Knowledge total | 12,300 | |
| Grand total | 17 subtasks | 61,320 |
The benchmark covers all six configurations of Audio, Vision (image or video), and Text as condition โ answer options. Each configuration has the same 10,220 items (same semantics, permuted modality):
| Condition โ Options | Samples |
|---|---|
| Audio โ Text | 10,220 |
| Audio โ Vision | 10,220 |
| Text โ Audio | 10,220 |
| Text โ Vision | 10,220 |
| Vision โ Audio | 10,220 |
| Vision โ Text | 10,220 |
| Total | 61,320 |
XModBench-Lite: a balanced 6,000-sample subset (5 families ร 6 configs ร 200) for fast evaluation.
XModBench/
โโโ benchmark/
โ โโโ Data/ # Raw media files (audio, image, video)
โ โ โโโ vggss_audio_bench/ # VGGSound audio clips
โ โ โโโ landscape_audiobench/ # Landscape images
โ โ โโโ emotions/ # Emotion classification media
โ โ โโโ ...
โ โโโ tasks/ # Source QA JSON files, organised by subtask
โ โ โโโ 01_perception/
โ โ โ โโโ finegrained/ # 6 modality-combo JSON files, 1000 instances each
โ โ โ โโโ general_activities/
โ โ โ โโโ instruments/
โ โ โ โโโ instruments_comp/
โ โ โ โโโ natures/
โ โ โโโ 02_spatial/
โ โ โโโ 03_speech/
โ โ โโโ 04_temporal/
โ โ โโโ 05_Exteral/
โ โโโ results/ # Model evaluation results
โโโ models/ # โ
per-model eval scripts (run.py โฆ)
โ โโโ Qwen2.5-Omni/ Qwen3-Omni/ Qwen2.5-VL/ OmniVinci/ VITA/ โฆ
โ โโโ ... # upstream weights/impl install separately
โโโ scripts/ # data-processing helpers (process/, download/)
Per-model evaluation code lives in
models/โ eachmodels/<Model>/run.pyloads the benchmark, builds prompts, calls the model and scores. Only the XModBench-side scripts are tracked here; the upstream model weights/implementations are installed separately (see each model's upstream repo). For a turnkey reproducible path use the lmms-eval port.
#!/bin/bash
#SBATCH --job-name=VLM_eval
#SBATCH --output=log/job_%j.out
#SBATCH --error=log/job_%j.log
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
echo "Running on host: $(hostname)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
module load conda
# conda activate vlm
conda activate omni
export audioBench='/home/xwang378/scratch/2025/AudioBench'
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_audio_vision \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_vision_audio \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_vision_text \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_audio_text \
# --sample 1000
# Qwen2.5-Omni
# python $audioBench/scripts/run.py \
# --model qwen2.5_omni \
# --task_name perception/vggss_audio_text \
# --sample 1000
python $audioBench/scripts/run.py \
--model qwen2.5_omni \
--task_name perception/vggss_vision_text \
--sample 1000
We provide a fully reproducible evaluation path through lmms-eval (fork with XModBench tasks pre-integrated). The dataset is auto-downloaded from the HF Hub โ no manual data prep.
Why dedicated model wrappers? Each XModBench item places media in both the question stem and every answer option (up to 5 media per item). lmms-eval's simple model interface only attaches one media object per request, so omni models would silently see just the first media. We therefore add chat-style
*_interleavewrappers that feed the full interleaved prompt to the model. No upstream model file is modified.
git clone https://github.com/XingruiWang/lmms-eval.git
cd lmms-eval
pip install -e ".[all]"python -m lmms_eval \
--model qwen2_5_omni_interleave \
--model_args pretrained=Qwen/Qwen2.5-Omni-7B,device_map=auto,attn_implementation=flash_attention_2 \
--tasks xmod_bench_lite_a2t \
--batch_size 1 --limit 8 --log_samples \
--output_path logs/debugsubmit_lite.sh launches all 6 modality configs with a resource-aware GPU
profile (no-video configs on 1 GPU, video configs on 4) so the full sweep
fits one QoS allocation:
# Qwen2.5-Omni-7B
./submit_lite.sh qwen2_5_omni_interleave Qwen/Qwen2.5-Omni-7B qwenomni3
# Qwen3-Omni-30B-A3B (MoE; all configs need 4 GPU)
LIGHT_GRES=gpu:a5000:4 HEAVY_GRES=gpu:a5000:4 \
./submit_lite.sh qwen3_omni_interleave Qwen/Qwen3-Omni-30B-A3B-Instruct qwenomni3 \
device_map=auto,attn_implementation=flash_attention_2
# Level-2 metrics (by-config / by-family / disparity / imbalance)
python lmms_eval/tasks/xmod_bench/summarize.py \
--logs logs/xmod_bench_lite/results_qwen2_5_omni_interleave/TASKS=(xmod_bench_audio_text xmod_bench_text_audio \
xmod_bench_audio_image xmod_bench_image_audio \
xmod_bench_image_text xmod_bench_text_image \
xmod_bench_audio_video xmod_bench_text_video \
xmod_bench_video_audio xmod_bench_video_text)
python -m lmms_eval \
--model qwen2_5_omni_interleave \
--model_args pretrained=Qwen/Qwen2.5-Omni-7B,device_map=auto,attn_implementation=flash_attention_2 \
--tasks "${TASKS[$SLURM_ARRAY_TASK_ID]}" \
--batch_size 1 --log_samples \
--output_path logs/xmod_bench_full/resultsBy-config accuracy on XModBench-Lite via lmms-eval, vs. the paper's full-set numbers (Table 2). ฮ = Lite โ paper.
| Config | Qwen2.5-Omni (Lite) | paper (full) | ฮ | Qwen3-Omni (Lite) |
|---|---|---|---|---|
| Audio โ Text | 63.1 | 62.0 | +1.1 | 71.6 |
| Audio โ Vision | 49.8 | 48.0 | +1.8 | 52.0 |
| Text โ Audio | 59.2 | 55.4 | +3.8 | 62.5 |
| Text โ Vision | 62.5 | 59.6 | +2.9 | 67.0 |
| Vision โ Audio | 50.3 | 50.5 | โ0.2 | 55.6 |
| Vision โ Text | 76.4 | 76.3 | +0.1 | 83.1 |
- Qwen2.5-Omni reproduces the paper within |ฮ| < 5 on all 6 configurations on the lightweight 6k Lite split โ confirming the lmms-eval port is faithful.
- Qwen3-Omni (released after the paper) is reported here for the first time, using the identical wrapper/code path.
- Full-set (61,320-sample) lmms-eval runs use the same wrappers via the
Section 4 command; numbers are updated in
lmms_eval/tasks/xmod_bench/RESULTS.mdas runs complete.
Per-run logs include overall accuracy plus per-config / per-family /
per-subtask breakdowns; summarize.py emits the 17 Level-2 numbers
(6 by-config, 5 by-family, 3 modality-disparity, 3 directional-imbalance).
By-configuration accuracy (%) over the six modality directions; Avg. is the mean over the six (Table 2 of the paper).
| Model | AโT | AโV | TโA | TโV | VโA | VโT | Avg. |
|---|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 71.0 | 58.9 | 64.4 | 79.8 | 60.8 | 88.6 | 70.6 |
| Gemini 2.5 Flash | 62.6 | 51.2 | 55.1 | 75.7 | 51.9 | 86.0 | 63.7 |
| Gemini 2.0 Flash | 63.7 | 49.0 | 52.2 | 71.5 | 47.6 | 85.2 | 61.2 |
| EchoInk-R1 | 64.6 | 45.9 | 56.4 | 60.9 | 49.9 | 77.6 | 59.2 |
| Qwen2.5-Omni | 62.0 | 48.0 | 55.4 | 59.6 | 50.5 | 76.3 | 58.6 |
| Gemini 1.5 Pro | 52.4 | 38.2 | 48.6 | 70.4 | 40.7 | 79.9 | 55.0 |
| Baichuan-Omni-1.5 | 47.8 | 35.8 | 40.5 | 56.2 | 38.6 | 73.0 | 48.7 |
| VideoLLaMA 2 | 48.6 | 26.0 | 25.7 | 26.5 | 25.2 | 66.8 | 36.5 |
| VITA | 40.2 | 26.0 | 29.8 | 26.8 | 29.9 | 59.3 | 35.4 |
| Unified-IO 2 XXL | 37.4 | 25.0 | 31.2 | 37.8 | 26.7 | 39.9 | 33.0 |
| Unified-IO 2 XL | 33.3 | 27.0 | 27.1 | 32.9 | 26.5 | 37.4 | 30.7 |
| Unified-IO 2 | 28.9 | 24.0 | 25.4 | 32.0 | 25.7 | 32.7 | 28.1 |
| PandaGPT | 24.5 | 25.0 | 23.8 | 25.2 | 24.5 | 25.1 | 24.7 |
| No Context (random) | 25.1 | 24.3 | 25.4 | 24.8 | 25.3 | 25.7 | 25.1 |
| Human | 92.4 | 91.5 | 91.1 | 91.8 | 86.4 | 95.6 | 91.5 |
Vision-only models are evaluated only on textโvision configs: Qwen2.5-VL 67.4 Avg., InternVL-3.5 61.7 Avg. (omitted from the six-way table).
Balanced split (5 families ร 6 configs ร 200), evaluated through the lmms-eval port with interleaved-multimedia wrappers.
| Model | AโT | AโV | TโA | TโV | VโA | VโT | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen3-Omni-30B | 71.6 | 52.0 | 62.5 | 67.0 | 55.6 | 83.1 | 65.3 |
| Qwen2.5-Omni-7B | 63.1 | 49.8 | 59.2 | 62.5 | 50.3 | 76.4 | 60.2 |
| Baichuan-Omni-1.5 | 52.5 | 32.0 | 47.6 | 56.6 | 47.0 | 77.7 | 52.2 |
| OmniVinci | 62.2 | โ | โ | โ | โ | 78.8 | โ |
Qwen2.5-Omni matches its full-set paper numbers within 5 points on every configuration, confirming the port is faithful. Qwen3-Omni post-dates the paper (first reported here). OmniVinci runs on its single-media-condition configs; its 4-option configs hit VILA-internal limits (see
RESULTS.mdโ note: that file lives in the lmms-eval repo). New runs update as they complete.
- Strong Performance: Perception and linguistic tasks (~75% for best models)
- Weak Performance: Spatial (50.1%) and temporal reasoning (60.8%)
- Performance Drop: 15-25 points decrease in spatial/temporal vs. perception tasks
- Audio vs. Text: 20-49 point performance drop
- Audio vs. Vision: 33-point average gap
- Vision vs. Text: ~15-point disparity
- Consistency: Best models show 10-12 point standard deviation
- VisionโText: 9-17 point gaps between directions
- AudioโText: 6-8 point asymmetries
- Root Cause: Training data imbalance favoring image-to-text over inverse directions
If you use XModBench in your research, please cite our paper:
@article{wang2025xmodbench,
title={XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models},
author={Wang, Xingrui and others},
journal={arXiv preprint arXiv:2510.15148},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
We thank all contributors and the research community for their valuable feedback and suggestions.
- Project Lead: Xingrui Wang
- Email: [xwang378@jhu.edu]
- Website: https://xingruiwang.github.io/projects/XModBench/
- Release Huggingface data
- Release data processing code
- Release data evaluation code
Note: XModBench is actively maintained and regularly updated with new models and evaluation metrics. For the latest updates, please check our releases page.
