IVEBench

Modern Benchmark Suite for Instruction-Guided Video Editing Assessment

Yinan Chen¹, Jiangning Zhang^1,2, Teng Hu³, Yuxiang Zeng⁴, Zhucun Xue¹, Qingdong He², Chengjie Wang^2,3, Yong Liu¹, Xiaobin Hu², Shuicheng Yan⁵

¹Zhejiang University, ²Tencent Youtu Lab, ³Shanghai Jiao Tong University, ⁴University of Auckland, ⁵National University of Singapore

Paper arXiv Code Data Leaderboard

Highlights

Compared with existing video editing benchmarks, our proposed IVEBench offers the following key advantages:

(1) Comprehensive support for IVE methods: IVEBench is specifically designed to evaluate instruction-guided video editing (IVE) models while remaining compatible with traditional source-target prompt-based methods, ensuring broad applicability across editing paradigms;

(2) Diverse and semantically rich video corpus: The benchmark contains 600 high-quality source videos spanning seven semantic dimensions and thirty topics, with frame lengths ranging from 32 to 1,024, providing wide coverage of real-world scenarios;

(3) Comprehensive editing taxonomy: IVEBench includes eight major editing categories and thirty-five subcategories, encompassing diverse editing types such as style, attribute, subject motion, camera motion, and visual effect editing, to fully represent instruction-guided behaviors;

(4) Integration of MLLM-based and traditional metrics: The evaluation protocol combines conventional objective indicators with multimodal large language model (MLLM)-based assessments across three dimensions (video quality, instruction compliance, and video fidelity) for more human-aligned and holistic evaluation;

(5) Extensive benchmarking of state-of-the-art models: We conduct a thorough quantitative and qualitative evaluation of leading IVE models—including InsV2V, AnyV2V, StableV2V, as well as the multi-conditional video editing framework VACE, establishing a unified and fair standard for future research.

Edit prompt: Transform the video into an oil painting style

Source prompt: The video showcases a stack of golden pancakes topped with fresh strawberries and drizzled with chocolate sauce. The setting appears to be a bright, cheerful kitchen with various breakfast items in the background.

Target prompt: The video showcases a stack of golden pancakes topped with fresh strawberries and drizzled with chocolate sauce in an oil painting style. The setting appears to be a bright, cheerful kitchen with various breakfast items in the background.

Edit prompt: Change the style of the video to anime

Source prompt: Three individuals dressed in medieval-style armor are engaged in a conversation in a forest setting. They appear to be discussing something important, possibly planning a strategy or sharing information.

Target prompt: Three individuals dressed in medieval-style armor are engaged in a conversation in a forest setting in anime style. They appear to be discussing something important, possibly planning a strategy or sharing information.

Edit prompt: Zoom out the view of Earth from space

Source prompt: The video showcases a stunning view of Earth from space during nighttime, with the planet illuminated by city lights and a bright sun on the horizon.

Target prompt: The video showcases a zoomed-out view of Earth from space during nighttime, with the planet illuminated by city lights and a bright sun on the horizon.

Edit prompt: Add a violinist in the background

Source prompt: The video captures two individuals performing outdoors at night. One person is holding a microphone and appears to be singing or speaking, while the other stands beside them, observing or waiting for their turn. The setting is illuminated by ambient streetlights and distant city lights, creating a serene yet engaging atmosphere.

Target prompt: The video captures two individuals performing outdoors at night. One person is holding a microphone and appears to be singing or speaking, while the other stands beside them, observing or waiting for their turn. There is also a violinist in the background. The setting is illuminated by ambient streetlights and distant city lights, creating a serene yet engaging atmosphere.

Edit prompt: Scale up the jade pendant

Source prompt: A skilled artisan meticulously carves intricate details into a green jade pendant featuring a dragon design. The pendant is held securely in hand, and a precision tool is used to enhance the fine lines and textures of the dragon's scales and features.

Target prompt: A skilled artisan meticulously carves intricate details into a larger green jade pendant featuring a dragon design. The pendant is held securely in hand, and a precision tool is used to enhance the fine lines and textures of the dragon's scales and features.

Edit prompt: Move the book behind the dog.

Source prompt: A dog wearing glasses is sitting on a bed with an open book in front of it, creating a humorous and endearing scene.

Target prompt: A dog wearing glasses is sitting on a bed with the open book placed behind it, creating a humorous and endearing scene.

Edit prompt: Change the running young people to walking

Source prompt: A group of young people are skateboarding and running through an underground tunnel. The scene captures their dynamic movements and the urban environment around them.

Target prompt: A group of young people are skateboarding and walking through an underground tunnel. The scene captures their movements and the urban environment around them.

Edit prompt: Change the black clothing to red

Source prompt: The video depicts a dramatic battle scene, in which a warrior dressed in black stands triumphantly atop a pile of defeated enemies under a stormy sky.

Target prompt: The video depicts a dramatic battle scene, in which a warrior dressed in red stands triumphantly atop a pile of defeated enemies under a stormy sky.

Edit prompt: Transform the video into a watercolor style

Source prompt: The video showcases a majestic clock tower under a bright blue sky with fluffy white clouds. The tower is intricately designed with detailed architecture, featuring a large clock face and ornate sculptures. The scene is serene and emphasizes the grandeur of the structure.

Target prompt: The video showcases a watercolor-style majestic clock tower under a bright blue sky with fluffy white clouds. The tower is intricately designed with detailed architecture, featuring a large clock face and ornate sculptures. The scene is serene and emphasizes the grandeur of the structure.

Edit prompt: Replace the character lying on the floor with a small animated teddy bear

Source prompt: The video depicts a lively scene inside a cozy room where four animated characters are dancing and having fun. One character is lying on the floor, seemingly exhausted or resting, while the others are energetically moving around. The room is decorated with colorful curtains, a calendar with a rabbit design, and various household items, creating a warm and inviting atmosphere.

Target prompt: The video depicts a lively scene inside a cozy room where four animated characters are dancing and having fun. One small animated teddy bear is lying on the floor, seemingly exhausted or resting, while the others are energetically moving around. The room is decorated with colorful curtains, a calendar with a rabbit design, and various household items, creating a warm and inviting atmosphere.

Edit prompt: Zoom out the view of the video

Source prompt: The video features an animated character, a young woman with blonde hair tied in a braid, standing against a backdrop of a glowing sunset sky. She is dressed in a dark outfit and has her eyes closed, with one hand raised as if she is feeling the warmth of the sun.

Target prompt: The video features an animated character, a young woman with blonde hair tied in a braid, standing against a backdrop of a glowing sunset sky. She is dressed in a dark outfit and has her eyes closed, with one hand raised as if she is feeling the warmth of the sun, shown in a wider view.

Edit prompt: Add a flock of birds flying above the Arch of Peace

Source prompt: The video showcases the majestic Arch of Peace in Milan, Italy, under a clear blue sky. The arch is a grand structure with intricate carvings and statues, surrounded by a well-maintained green lawn. The scene is serene, with a few people walking around, adding a sense of scale to the monumental architecture.

Target prompt: The video showcases the majestic Arch of Peace in Milan, Italy, under a clear blue sky. The arch is a grand structure with intricate carvings and statues, surrounded by a well-maintained green lawn. There is a flock of birds flying above the arch. The scene is serene, with a few people walking around, adding a sense of scale to the monumental architecture.

Edit prompt: Adjust the colors of the video to a warmer tone

Source prompt: The video captures an aerial view of a bustling city intersection with vehicles moving in various directions. The scene is set during the day with clear visibility, showcasing the organized chaos of urban traffic.

Target prompt: The video captures an aerial view of a bustling city intersection with vehicles moving in various directions. The scene is set during the day with clear visibility, showcasing the organized chaos of urban traffic with warmer-toned colors.

Edit prompt: Move the sliced kiwi closer to the camera.

Source prompt: The video showcases a close-up of a sliced kiwi fruit, highlighting its vibrant green flesh and black seeds, set against a green background with other green fruits and vegetables.

Target prompt: The video showcases a close-up of a closer sliced kiwi fruit, highlighting its vibrant green flesh and black seeds. set against a green background with other green fruits and vegetables.

Edit prompt: Scale up the motorcycle and the motorcyclist

Source prompt: A motorcyclist rides down a highway at dusk, wearing full protective gear and a helmet, with the motorcycle's headlights illuminating the road ahead.

Target prompt: A larger motorcyclist rides down a highway at dusk, wearing full protective gear and a helmet, with the motorcycle's headlights illuminating the road ahead.

Edit prompt: Make the man, woman in black blazer, and the other woman move around the room in a coordinated dance

Source prompt: A woman in a black blazer is talking to a man and a woman in a room decorated with balloons and a Christmas tree.

Target prompt: A woman in a black blazer, a man, and another woman are dancing around a room decorated with balloons and a Christmas tree.

Leaderboard

Comprehensive evaluation results of instruction-guided video editing methods on IVEBench.

More up-to-date instruction-guided video editing methods will continue to be updated.

Leaderboard (Short Subset)

#	Method	Total Score	Video Quality	Instruction Compliance	Video Fidelity	Video Quality (Details)					Instruction Compliance (Details)				Video Fidelity (Details)
#	Method	Total Score	Video Quality	Instruction Compliance	Video Fidelity	Subject Consistency	Background Consistency	Temporal Flickering	Motion Smoothness	VTSS	Overall Semantic Consistency	Phrase Semantic Consistency	Instruction Satisfaction	Quantity Accuracy	Semantic Fidelity	Motion Fidelity	Content Fidelity
1	Ditto	0.667455	0.781191	0.49081	0.730363	0.962402	0.975734	0.96984	0.989527	0.038201	0.249228	0.243133	3.87	0.3	0.885714	0.789851	3.635
2	InsV2V	0.666815	0.795805	0.3861	0.818541	0.936759	0.958194	0.974885	0.974885	0.044564	0.241119	0.228607	3.0625	0.3	0.950594	0.855546	4.04875
3	Lucy-Edit-Dev	0.635298	0.820931	0.339448	0.745514	0.946271	0.964003	0.97851	0.990135	0.05082	0.238288	0.220275	2.8375	0.2	0.930882	0.677938	3.825
4	VACE	0.626087	0.798298	0.254199	0.825764	0.949597	0.965058	0.975686	0.975686	0.044513	0.23797	0.215348	2.1625	0.2	0.968574	0.885873	4.0325
5	ICVE	0.603252	0.712515	0.453532	0.643709	0.953842	0.97101	0.991063	0.995441	0.017079	0.228634	0.229431	3.6175	0.3	0.848825	0.457221	3.55
6	Omni-Video	0.586517	0.781993	0.43644	0.541117	0.963197	0.971408	0.976379	0.987218	0.038415	0.219778	0.228862	3.36	0.4	0.814609	0.507224	2.845
7	AnyV2V	0.57673	0.727088	0.417245	0.585857	0.892648	0.939805	0.972575	0.972575	0.026466	0.215903	0.236315	3.335	0.3	0.800876	0.815908	2.75
8	StableV2V	0.508923	0.691682	0.426694	0.408393	0.853475	0.919418	0.963825	0.963825	0.018734	0.197835	0.242328	3.56	0.2	0.700009	0.751333	1.7875

Leaderboard (Long Subset)

#	Method	Total Score	Video Quality	Instruction Compliance	Video Fidelity	Video Quality (Details)					Instruction Compliance (Details)				Video Fidelity (Details)
#	Method	Total Score	Video Quality	Instruction Compliance	Video Fidelity	Subject Consistency	Background Consistency	Temporal Flickering	Motion Smoothness	VTSS	Overall Semantic Consistency	Phrase Semantic Consistency	Instruction Satisfaction	Quantity Accuracy	Semantic Fidelity	Motion Fidelity	Content Fidelity
1	InsV2V	0.65715	0.802357	0.374118	0.794976	0.901442	0.944001	0.975373	0.975373	0.04835	0.240611	0.229095	3.1	0.2	0.952295	0.678833	4.125
2	VACE	0.616088	0.801204	0.267255	0.779804	0.916832	0.94867	0.959168	0.959168	0.048467	0.235583	0.215446	2.27	0.2	0.963778	0.883994	3.735
3	Anyv2v	0.55052	0.724021	0.355533	0.572005	0.836517	0.91594	0.969898	0.969898	0.028747	0.215527	0.228784	3.251852	0	0.796468	0.824666	2.651852
4	StableV2V	0.509333	0.693736	0.420937	0.413327	0.828293	0.905021	0.962683	0.962683	0.02092	0.204132	0.234756	3.44898	0.25	0.70401	0.773338	1.785714
5	Lucy-Edit-Dev	0.648858	0.821	0.315107	0.810466	0.910351	0.94848	0.980079	0.991183	0.052671	0.239159	0.217732	2.645	0.2	0.970202	0.73463	4.13
6	Omni-Video	0.570946	0.778392	0.424405	0.510041	0.94171	0.961136	0.968076	0.980258	0.039098	0.219006	0.229922	3.53	0.2	0.806642	0.551065	2.59
7	ICVE	0.587734	0.719584	0.403837	0.639781	0.953185	0.97228	0.99219	0.995999	0.019113	0.226036	0.228235	3.625	0	0.858698	0.483956	3.475
8	Ditto	0.659281	0.779508	0.478546	0.719789	0.964755	0.976063	0.970801	0.989857	0.037547	0.234154	0.243371	3.925	0.2	0.857041	0.728152	3.685

Note: Higher values indicate better performance for all metrics. Click on column headers to sort by different metrics.

Benchmark

Data Pipeline

Data acquisition and processing pipeline of IVEBench. 1) Curation process to 600 high-quality diverse videos. 2) Well-designed pipeline for comprehensive editing prompts.

Benchmark Statistics

Statistical distributions of IVEBench

Benchmark Comparison

Attributes comparison with open-source video editing benchmarks. Our proposed IVEBench boasts distinct advantages across various key dimensions.

Experiments

Qualitative Visualization

Source Video

InsV2V

AnyV2V

StableV2V

VACE

Category:Style EditingSubcategory:Anime

Edit prompt:Change the style of the video to anime

Source Video

InsV2V

AnyV2V

StableV2V

VACE

Category:Effect EditingSubcategory:Transition

Edit prompt:After a wave-foam transition, the small fishing boat is eaten by a giant whale

Source Video

InsV2V

AnyV2V

StableV2V

VACE

Category:Subject EditingSubcategory:Replace Existing Subject

Edit prompt:Replace the water bottle in the scene with a newspaper

Source Video

InsV2V

AnyV2V

StableV2V

VACE

Category:Camera Angle EditingSubcategory:Front View

Edit prompt:Change the view to a front view

Source Video

InsV2V

AnyV2V

StableV2V

VACE

Category:Subject Motion EditingSubcategory:Single Subject Motion

Edit prompt:Make the static spider-man in the mural dynamic and make him swing faster

Source Video

InsV2V

AnyV2V

StableV2V

VACE

Category:Style EditingSubcategory:Watercolor

Edit prompt:Transform the video into a watercolor style

Source Video

InsV2V

AnyV2V

StableV2V

VACE

Category:Camera Motion EditingSubcategory:Dolly In

Edit prompt:Move the camera closer to the man in the black shirt

Qualitative comparison of state-of-the-art IVE methods.

Quantitative Visualization

IVEBench Evaluation Results of Video Editing Models. We visualize the evaluation results of four IVE models in 12 IVEBench metrics. We normalize the results per dimension for clearer comparisons.

BibTeX

@article{chen2025ivebench,
  title={IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment},
  author={Chen, Yinan and Zhang, Jiangning and Hu, Teng and Zeng, Yuxiang and Xue, Zhucun and He, Qingdong and Wang, Chengjie and Liu, Yong and Hu, Xiaobin and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2510.11647},
  year={2025}
}