The first large-scale dataset & benchmark for instruction-guided joint audio-visual editing, paired with an Agent-in-the-loop curation pipeline and a strong baseline model.
While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that JAVEdit outperforms all baselines on five of six evaluation metrics, with a 26% relative gain in audio-visual synchrony over the strongest sequential alternative. All data, code, and model weights will be publicly released.
Source videos and their instruction-guided edited counterparts, spanning all five editing categories. Each pair modifies both the visual and audio streams simultaneously.
"Replace the central figure with a centenarian exhibiting white hair, deep facial wrinkles, and a frail physical presence while preserving the kitchen background, camera position, and dialogue delivery."
"Replace the person with a swamp monster featuring muddy green skin, decaying plant growths, and textured swamp-like details while preserving the original side lighting, slow camera push-in, background with blinds, and the subject's forward gaze and spoken word 'Yes' delivered in a guttural voice."
"Replace the person with a female while maintaining the same emotional expression, background, and audio delivery cues including choked speech, sniff, and sigh."
"Replace the subject's attire with a beige cardigan over a white collared blouse, add black-framed glasses, and style their hair into a neat bun while maintaining the original background, camera position, and spoken dialogue content."
"Replace the background with a dense forest hiking trail featuring a dirt path surrounded by tall trees and lush greenery while maintaining the subject's position and natural outdoor lighting."
"Replace the background with a vibrant spring garden scene filled with blooming pink and white flowers, cherry blossom branches, and lush greenery under bright sunlight."
"Replace the background with a clear, photorealistic view of Alcatraz Island while maintaining the same foreground subject's smoking actions, steady camera perspective, and vocal delivery."
"Replace the background with a bog landscape featuring wet, marshy terrain, scattered puddles, tall reeds, and an overcast sky while preserving the subject's position, attire, gestures, and vocal delivery unchanged from the original."
"Ensure the speaker delivers the line 'Profit margins require immediate strategic adjustments.' while maintaining all visual and non-speech audio elements unchanged."
"Replace the spoken content so the person says 'Shut off the water valve. Replace the old rubber washer.' while maintaining all visual and non-speech audio elements unchanged."
"Replace the spoken dialogue with the exact phrase: 'I need to grab milk, bread, and eggs now.'"
"Replace the spoken dialogue with 'Now string the beads carefully.'"
"Display the dimly lit room's background with the window showing blurred exterior lights and worn walls, accompanied by ambient atmospheric sounds and music without any human voices."
"The scene features the storefront with graffiti-covered windows, a dark door, and interior shelves visible through the glass, accompanied by ambient sounds without any human voices."
"The scene should depict an empty courtroom with wooden benches and paneled walls, accompanied by ambient background music without any human speech."
"The video displays the casino background with slot machines featuring 'Giant Jackpot' signage, showcasing only the stationary machines and ambient lighting without any individuals present."
"Include a woman in a black suit seated in the chair, maintaining her posture and presence within the environment, while retaining the background elements and original human vocalizations."
"Include a woman with long wavy hair wearing black attire and gold earrings in the chapel, with her speaking the phrase 'That place where my mom and I had some of our best talks' while preserving the floral displays, wooden coffins, and ambient musical score."
"Insert a person wearing a gray shirt and a red cord necklace seated in the vehicle while holding a metallic thermos, with the background seats, window, and ambient sounds remaining unchanged."
"A person with gray hair and glasses, dressed in a dark button-up shirt, sits in the brown leather chair facing the bookshelf while speaking, with the background and ambient music unchanged."
The first large-scale dataset for instruction-guided joint audio-visual editing. ~100K human-centric editing triplets across 5 categories at 720p resolution.
A closed-loop multi-agent quality control system with an Orchestrator (Claude) and Inspector (Gemini), raising data pass rate from 36% to 83%.
A 150-video benchmark with 6 human-aligned metrics, plus a LoRA fine-tuned LTX-2.3 model that achieves state-of-the-art performance.
The first large-scale, instruction-guided joint audio-visual editing dataset, built for human-centric video editing at scale.
| Dataset | Scale | Audio | Instruction | Agent QC |
|---|---|---|---|---|
| InsViE-1M | ~1M | |||
| OpenVE-3M | ~3M | |||
| AVI-Edit | ~73K | |||
| JAVEdit-100k | ~103K |
Alter subject appearance while synchronously updating voice style & timbre
Change environment or scene with ambient sound updated to match
Alter spoken content with lip motion synchronized to new speech
Remove a human subject together with the associated voice stream
Insert a human subject into a scene with the corresponding voice
JAVEditBench evaluation across 150 videos using 6 human-aligned metrics.
| Method | Visual Quality ↑ | Audio Quality ↑ | AV Sync ↑ | Instruction Compliance ↑ | Video Fidelity ↑ | AV Quality ↑ |
|---|---|---|---|---|---|---|
| AVED | 0.0590 | 1.72 | 0.1641 | 2.95 | 3.87 | 2.93 |
| AVI-Edit | 0.0604 | 2.34 | 0.2721 | 3.49 | 3.89 | 3.86 |
| Sequential | 0.0563 | 2.35 | 0.2925 | 3.99 | 4.08 | 3.51 |
| JAVEdit (Ours) | 0.0596 | 2.42 | 0.3688 | 4.07 | 4.22 | 3.88 |
| Model | Scale | Agent QC | Visual Q ↑ | Audio Q ↑ | AV Sync ↑ | IC ↑ | Fidelity ↑ |
|---|---|---|---|---|---|---|---|
| JAVEdit-tiny | 5K | 0.0574 | 2.38 | 0.2453 | 3.21 | 3.95 | |
| JAVEdit-small | 15K | 0.0579 | 2.44 | 0.2871 | 3.49 | 4.18 | |
| JAVEdit w/o Agent | 100K | 0.0581 | 2.31 | 0.3012 | 3.61 | 4.05 | |
| JAVEdit (Ours) | 100K | 0.0596 | 2.42 | 0.3688 | 4.07 | 4.22 |
Evaluation metrics:
Side-by-side comparison of source videos and outputs from four methods across all editing categories.
@article{chen2026javedit, title = {JAVEdit: Joint Audio-Visual Instruction-Guided Video Editing with an Agent-Curated Data Pipeline}, author = {Chen, Yinan and Lin, Chuming and Chen, Zhennan and Zeng, Yuxiang and Zhu, Junwei and Bi, Yali and Huang, Xijie and Xu, Chengming and Luo, Donghao and Xue, Zhucun and Hu, Xiaobin and Wang, Chengjie and Liu, Yong and Zhang, Jiangning and Yan, Shuicheng}, journal = {arXiv preprint arXiv:XXXX.XXXXX}, year = {2026} }