IVEBench Icon IVEBench

Modern Benchmark Suite for Instruction-Guided Video Editing Assessment

1Zhejiang University, 2Tencent Youtu Lab, 3Shanghai Jiao Tong University, 4University of Auckland, 5National University of Singapore

Highlights

Compared with existing video editing benchmarks, our proposed IVEBench offers the following key advantages:

(1) Comprehensive support for IVE methods: IVEBench is specifically designed to evaluate instruction-guided video editing (IVE) models while remaining compatible with traditional source-target prompt-based methods, ensuring broad applicability across editing paradigms;

(2) Diverse and semantically rich video corpus: The benchmark contains 600 high-quality source videos spanning seven semantic dimensions and thirty topics, with frame lengths ranging from 32 to 1,024, providing wide coverage of real-world scenarios;

(3) Comprehensive editing taxonomy: IVEBench includes eight major editing categories and thirty-five subcategories, encompassing diverse editing types such as style, attribute, subject motion, camera motion, and visual effect editing, to fully represent instruction-guided behaviors;

(4) Integration of MLLM-based and traditional metrics: The evaluation protocol combines conventional objective indicators with multimodal large language model (MLLM)-based assessments across three dimensions (video quality, instruction compliance, and video fidelity) for more human-aligned and holistic evaluation;

(5) Extensive benchmarking of state-of-the-art models: We conduct a thorough quantitative and qualitative evaluation of leading IVE models—including InsV2V, AnyV2V, StableV2V, as well as the multi-conditional video editing framework VACE, establishing a unified and fair standard for future research.

Teaser Image

Leaderboard

Comprehensive evaluation results of instruction-guided video editing methods on IVEBench.

Leaderboard (Short Subset)

# Method Total Score Video Quality Instruction
Compliance
Video Fidelity Video Quality (Details) Instruction Compliance (Details) Video Fidelity (Details)
Subject
Consistency
Background
Consistency
Temporal
Flickering
Motion
Smoothness
VTSS Overall Semantic
Consistency
Phrase Semantic
Consistency
Instruction
Satisfaction
Quantity
Accuracy
Semantic
Fidelity
Motion
Fidelity
Content
Fidelity
1 InsV2V 0.65715 0.802357 0.374118 0.794976 0.901442 0.944001 0.975373 0.975373 0.04835 0.240611 0.229095 3.1 0.2 0.952295 0.678833 4.125
2 VACE 0.616088 0.801204 0.267255 0.779804 0.916832 0.94867 0.959168 0.959168 0.048467 0.235583 0.215446 2.27 0.2 0.963778 0.883994 3.735
3 Anyv2v 0.55052 0.724021 0.355533 0.572005 0.836517 0.91594 0.969898 0.969898 0.028747 0.215527 0.228784 3.251852 0 0.796468 0.824666 2.651852
4 StableV2V 0.509333 0.693736 0.420937 0.413327 0.828293 0.905021 0.962683 0.962683 0.02092 0.204132 0.234756 3.44898 0.25 0.70401 0.773338 1.785714

Leaderboard (Long Subset)

# Method Total Score Video Quality Instruction
Compliance
Video Fidelity Video Quality (Details) Instruction Compliance (Details) Video Fidelity (Details)
Subject
Consistency
Background
Consistency
Temporal
Flickering
Motion
Smoothness
VTSS Overall Semantic
Consistency
Phrase Semantic
Consistency
Instruction
Satisfaction
Quantity
Accuracy
Semantic
Fidelity
Motion
Fidelity
Content
Fidelity
1 InsV2V 0.65715 0.802357 0.374118 0.794976 0.901442 0.944001 0.975373 0.975373 0.04835 0.240611 0.229095 3.1 0.2 0.952295 0.678833 4.125
2 VACE 0.616088 0.801204 0.267255 0.779804 0.916832 0.94867 0.959168 0.959168 0.048467 0.235583 0.215446 2.27 0.2 0.963778 0.883994 3.735
3 Anyv2v 0.55052 0.724021 0.355533 0.572005 0.836517 0.91594 0.969898 0.969898 0.028747 0.215527 0.228784 3.251852 0 0.796468 0.824666 2.651852
4 StableV2V 0.509333 0.693736 0.420937 0.413327 0.828293 0.905021 0.962683 0.962683 0.02092 0.204132 0.234756 3.44898 0.25 0.70401 0.773338 1.785714

Note: Higher values indicate better performance for all metrics. Click on column headers to sort by different metrics.

Benchmark

Data Pipeline

data-composition

Data acquisition and processing pipeline of IVEBench. 1) Curation process to 600 high-quality diverse videos. 2) Well-designed pipeline for comprehensive editing prompts.

Benchmark Statistics

data-composition

Statistical distributions of IVEBench

Benchmark Comparison

data-composition

Attributes comparison with open-source video editing benchmarks. Our proposed IVEBench boasts distinct advantages across various key dimensions.

Experiments

Qualitative Visualization

Quantitative Visualization

data-composition

IVEBench Evaluation Results of Video Editing Models. We visualize the evaluation results of four IVE models in 12 IVEBench metrics. We normalize the results per dimension for clearer comparisons.

BibTeX

@article{chen2025ivebenchmodernbenchmarksuite,
      title={IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment}, 
      author={Yinan Chen and Jiangning Zhang and Teng Hu and Yuxiang Zeng and Zhucun Xue and Qingdong He and Chengjie Wang and Yong Liu and Xiaobin Hu and Shuicheng Yan},
      journal={arXiv preprint arXiv:2510.11647},
      year={2025}
}