JAVEdit: Joint Audio-Visual
Instruction-Guided Video Editing
with Agentic Data Curation

The first large-scale dataset & benchmark for instruction-guided joint audio-visual editing, paired with an Agent-in-the-loop curation pipeline and a strong baseline model.

Yinan Chen1*,  Chuming Lin2*,  Zhennan Chen3,  Yuxiang Zeng4,  Junwei Zhu2,  Yali Bi1,  Xijie Huang5,  Chengming Xu2,  Donghao Luo2,  Zhucun Xue1,  Xiaobin Hu6,  Chengjie Wang2,  Yong Liu1,  Jiangning Zhang1,2📧,  Shuicheng Yan6

1 Zhejiang University 2 Youtu Lab, Tencent 3 Nanjing University 4 University of Auckland 5 Fudan University 6 National University of Singapore
* Equal contribution📧 Corresponding author

Abstract

While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that JAVEdit outperforms all baselines on five of six evaluation metrics, with a 26% relative gain in audio-visual synchrony over the strongest sequential alternative. All data, code, and model weights will be publicly released.

Editing Examples from JAVEdit-100k

Source videos and their instruction-guided edited counterparts, spanning all five editing categories. Each pair modifies both the visual and audio streams simultaneously.

Source
Target
Instruction

"Replace the central figure with a centenarian exhibiting white hair, deep facial wrinkles, and a frail physical presence while preserving the kitchen background, camera position, and dialogue delivery."

Source
Target
Instruction

"Replace the person with a swamp monster featuring muddy green skin, decaying plant growths, and textured swamp-like details while preserving the original side lighting, slow camera push-in, background with blinds, and the subject's forward gaze and spoken word 'Yes' delivered in a guttural voice."

Source
Target
Instruction

"Replace the person with a female while maintaining the same emotional expression, background, and audio delivery cues including choked speech, sniff, and sigh."

Source
Target
Instruction

"Replace the subject's attire with a beige cardigan over a white collared blouse, add black-framed glasses, and style their hair into a neat bun while maintaining the original background, camera position, and spoken dialogue content."

Source
Target
Instruction

"Replace the background with a dense forest hiking trail featuring a dirt path surrounded by tall trees and lush greenery while maintaining the subject's position and natural outdoor lighting."

Source
Target
Instruction

"Replace the background with a vibrant spring garden scene filled with blooming pink and white flowers, cherry blossom branches, and lush greenery under bright sunlight."

Source
Target
Instruction

"Replace the background with a clear, photorealistic view of Alcatraz Island while maintaining the same foreground subject's smoking actions, steady camera perspective, and vocal delivery."

Source
Target
Instruction

"Replace the background with a bog landscape featuring wet, marshy terrain, scattered puddles, tall reeds, and an overcast sky while preserving the subject's position, attire, gestures, and vocal delivery unchanged from the original."

Source
Target
Instruction

"Ensure the speaker delivers the line 'Profit margins require immediate strategic adjustments.' while maintaining all visual and non-speech audio elements unchanged."

Source
Target
Instruction

"Replace the spoken content so the person says 'Shut off the water valve. Replace the old rubber washer.' while maintaining all visual and non-speech audio elements unchanged."

Source
Target
Instruction

"Replace the spoken dialogue with the exact phrase: 'I need to grab milk, bread, and eggs now.'"

Source
Target
Instruction

"Replace the spoken dialogue with 'Now string the beads carefully.'"

Source
Target
Instruction

"Display the dimly lit room's background with the window showing blurred exterior lights and worn walls, accompanied by ambient atmospheric sounds and music without any human voices."

Source
Target
Instruction

"The scene features the storefront with graffiti-covered windows, a dark door, and interior shelves visible through the glass, accompanied by ambient sounds without any human voices."

Source
Target
Instruction

"The scene should depict an empty courtroom with wooden benches and paneled walls, accompanied by ambient background music without any human speech."

Source
Target
Instruction

"The video displays the casino background with slot machines featuring 'Giant Jackpot' signage, showcasing only the stationary machines and ambient lighting without any individuals present."

Source
Target
Instruction

"Include a woman in a black suit seated in the chair, maintaining her posture and presence within the environment, while retaining the background elements and original human vocalizations."

Source
Target
Instruction

"Include a woman with long wavy hair wearing black attire and gold earrings in the chapel, with her speaking the phrase 'That place where my mom and I had some of our best talks' while preserving the floral displays, wooden coffins, and ambient musical score."

Source
Target
Instruction

"Insert a person wearing a gray shirt and a red cord necklace seated in the vehicle while holding a metallic thermos, with the background seats, window, and ambient sounds remaining unchanged."

Source
Target
Instruction

"A person with gray hair and glasses, dressed in a dark button-up shirt, sits in the brown leather chair facing the bookshelf while speaking, with the background and ambient music unchanged."

Three Pillars of JAVEdit

Dataset · 01

JAVEdit-100k

The first large-scale dataset for instruction-guided joint audio-visual editing. ~100K human-centric editing triplets across 5 categories at 720p resolution.

Pipeline · 02

Agent-in-the-Loop

A closed-loop multi-agent quality control system with an Orchestrator (Claude) and Inspector (Gemini), raising data pass rate from 36% to 83%.

Benchmark & Model · 03

JAVEditBench + JAVEdit

A 150-video benchmark with 6 human-aligned metrics, plus a LoRA fine-tuned LTX-2.3 model that achieves state-of-the-art performance.

0K
Edit Triplets
0
Editing Tasks
0%
Pass Rate (Agent QC)
0
Benchmark Videos
0
Eval Metrics

JAVEdit-100k Dataset

The first large-scale, instruction-guided joint audio-visual editing dataset, built for human-centric video editing at scale.

What Makes JAVEdit-100k Unique?

  • Only dataset supporting joint audio-visual editing with free-form natural language instructions.
  • Covers five distinct editing categories: subject editing, background editing, speech editing, subject removal, and subject addition.
  • All videos at 1280×720 / 121 frames / 25 fps, carefully curated for strict cross-modal alignment.
  • Quality controlled via an agent-in-the-loop mechanism that automatically inspects, diagnoses, and fixes data quality issues, eliminating the need for costly manual review.
Dataset Scale Audio Instruction Agent QC
InsViE-1M~1M
OpenVE-3M~3M
AVI-Edit~73K
JAVEdit-100k~103K

Five Editing Categories

Subject Editing

Alter subject appearance while synchronously updating voice style & timbre

Background Editing

Change environment or scene with ambient sound updated to match

Speech Editing

Alter spoken content with lip motion synchronized to new speech

Subject Removal

Remove a human subject together with the associated voice stream

Subject Addition

Insert a human subject into a scene with the corresponding voice

Quantitative Results on JAVEditBench

JAVEditBench evaluation across 150 videos using 6 human-aligned metrics.

5/6
Metrics where JAVEdit ranks #1
+26%
Relative gain in AV Sync vs. Sequential
≥0.80
Spearman's ρ with human judgment
Method Visual Quality ↑ Audio Quality ↑ AV Sync ↑ Instruction Compliance ↑ Video Fidelity ↑ AV Quality ↑
AVED 0.0590 1.72 0.1641 2.95 3.87 2.93
AVI-Edit 0.0604 2.34 0.2721 3.49 3.89 3.86
Sequential 0.0563 2.35 0.2925 3.99 4.08 3.51
JAVEdit (Ours) 0.0596 2.42 0.3688 4.07 4.22 3.88
Table 1: Quantitative comparison on JAVEditBench. Bold = best. Sequential cascades Kiwi-Edit with HunyuanVideo-Foley.

Ablation Study

Model Scale Agent QC Visual Q ↑ Audio Q ↑ AV Sync ↑ IC ↑ Fidelity ↑
JAVEdit-tiny 5K 0.05742.380.24533.213.95
JAVEdit-small 15K 0.05792.440.28713.494.18
JAVEdit w/o Agent 100K 0.05812.310.30123.614.05
JAVEdit (Ours) 100K 0.05962.420.36884.074.22
Table 2: Ablation on JAVEditBench. Agent-in-the-loop QC and data scale are complementary — both are necessary for best performance.

Evaluation metrics:

Visual Quality (VTSS) Audio Quality (UTMOSv2) AV Sync (SyncNet) Instruction Compliance Video Fidelity AV Quality (Qwen3-Omni)

Qualitative Results on JAVEditBench

Side-by-side comparison of source videos and outputs from four methods across all editing categories.

Case 1 / 2
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)
Source
AVED
AVI-Edit
Sequential
JAVEdit (Ours)

BibTeX

BibTeX
@article{chen2026javedit,
  title     = {JAVEdit: Joint Audio-Visual Instruction-Guided Video Editing
               with an Agent-Curated Data Pipeline},
  author    = {Chen, Yinan and Lin, Chuming and Chen, Zhennan and Zeng, Yuxiang
               and Zhu, Junwei and Bi, Yali and Huang, Xijie and Xu, Chengming
               and Luo, Donghao and Xue, Zhucun and Hu, Xiaobin
               and Wang, Chengjie and Liu, Yong and Zhang, Jiangning
               and Yan, Shuicheng},
  journal   = {arXiv preprint arXiv:XXXX.XXXXX},
  year      = {2026}
}