JAVEdit — Joint Audio-Visual Instruction-Guided Video Editing

Abstract

While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that JAVEdit outperforms all baselines on five of six evaluation metrics, with a 26% relative gain in audio-visual synchrony over the strongest sequential alternative. All data, code, and model weights will be publicly released.

Editing Examples from JAVEdit-100k

Source videos and their instruction-guided edited counterparts, spanning all five editing categories. Each pair modifies both the visual and audio streams simultaneously.

Source

Target

Instruction

"Replace the central figure with a centenarian exhibiting white hair, deep facial wrinkles, and a frail physical presence while preserving the kitchen background, camera position, and dialogue delivery."

Source

Target

Instruction

"Replace the person with a swamp monster featuring muddy green skin, decaying plant growths, and textured swamp-like details while preserving the original side lighting, slow camera push-in, background with blinds, and the subject's forward gaze and spoken word 'Yes' delivered in a guttural voice."

Source

Target

Instruction

"Replace the person with a female while maintaining the same emotional expression, background, and audio delivery cues including choked speech, sniff, and sigh."

Source

Target

Instruction

"Replace the subject's attire with a beige cardigan over a white collared blouse, add black-framed glasses, and style their hair into a neat bun while maintaining the original background, camera position, and spoken dialogue content."

Source

Target

Instruction

"Replace the background with a dense forest hiking trail featuring a dirt path surrounded by tall trees and lush greenery while maintaining the subject's position and natural outdoor lighting."

Source

Target

Instruction

"Replace the background with a vibrant spring garden scene filled with blooming pink and white flowers, cherry blossom branches, and lush greenery under bright sunlight."

Source

Target

Instruction

"Replace the background with a clear, photorealistic view of Alcatraz Island while maintaining the same foreground subject's smoking actions, steady camera perspective, and vocal delivery."

Source

Target

Instruction

"Replace the background with a bog landscape featuring wet, marshy terrain, scattered puddles, tall reeds, and an overcast sky while preserving the subject's position, attire, gestures, and vocal delivery unchanged from the original."

Source

Target

Instruction

"Ensure the speaker delivers the line 'Profit margins require immediate strategic adjustments.' while maintaining all visual and non-speech audio elements unchanged."

Source

Target

Instruction

"Replace the spoken content so the person says 'Shut off the water valve. Replace the old rubber washer.' while maintaining all visual and non-speech audio elements unchanged."

Source

Target

Instruction

"Replace the spoken dialogue with the exact phrase: 'I need to grab milk, bread, and eggs now.'"

Source

Target

Instruction

"Replace the spoken dialogue with 'Now string the beads carefully.'"

Source

Target

Instruction

"Display the dimly lit room's background with the window showing blurred exterior lights and worn walls, accompanied by ambient atmospheric sounds and music without any human voices."

Source

Target

Instruction

"The scene features the storefront with graffiti-covered windows, a dark door, and interior shelves visible through the glass, accompanied by ambient sounds without any human voices."

Source

Target

Instruction

"The scene should depict an empty courtroom with wooden benches and paneled walls, accompanied by ambient background music without any human speech."

Source

Target

Instruction

"The video displays the casino background with slot machines featuring 'Giant Jackpot' signage, showcasing only the stationary machines and ambient lighting without any individuals present."

Source

Target

Instruction

"Include a woman in a black suit seated in the chair, maintaining her posture and presence within the environment, while retaining the background elements and original human vocalizations."

Source

Target

Instruction

"Include a woman with long wavy hair wearing black attire and gold earrings in the chapel, with her speaking the phrase 'That place where my mom and I had some of our best talks' while preserving the floral displays, wooden coffins, and ambient musical score."

Source

Target

Instruction

"Insert a person wearing a gray shirt and a red cord necklace seated in the vehicle while holding a metallic thermos, with the background seats, window, and ambient sounds remaining unchanged."

Source

Target

Instruction

"A person with gray hair and glasses, dressed in a dark button-up shirt, sits in the brown leather chair facing the bookshelf while speaking, with the background and ambient music unchanged."

JAVEdit-100k Dataset

The first large-scale, instruction-guided joint audio-visual editing dataset, built for human-centric video editing at scale.

What Makes JAVEdit-100k Unique?

Only dataset supporting joint audio-visual editing with free-form natural language instructions.
Covers five distinct editing categories: subject editing, background editing, speech editing, subject removal, and subject addition.
All videos at 1280×720 / 121 frames / 25 fps, carefully curated for strict cross-modal alignment.
Quality controlled via an agent-in-the-loop mechanism that automatically inspects, diagnoses, and fixes data quality issues, eliminating the need for costly manual review.

Dataset	Scale	Audio	Instruction	Agent QC
InsViE-1M	~1M
OpenVE-3M	~3M
AVI-Edit	~73K
JAVEdit-100k	~103K

Five Editing Categories

Subject Editing

Alter subject appearance while synchronously updating voice style & timbre

Background Editing

Change environment or scene with ambient sound updated to match

Speech Editing

Alter spoken content with lip motion synchronized to new speech

Subject Removal

Remove a human subject together with the associated voice stream

Subject Addition

Insert a human subject into a scene with the corresponding voice

Quantitative Results on JAVEditBench

JAVEditBench evaluation across 150 videos using 6 human-aligned metrics.

5/6

Metrics where JAVEdit ranks #1

+26%

Relative gain in AV Sync vs. Sequential

≥0.80

Spearman's ρ with human judgment

Method	Visual Quality ↑	Audio Quality ↑	AV Sync ↑	Instruction Compliance ↑	Video Fidelity ↑	AV Quality ↑
AVED	0.0590	1.72	0.1641	2.95	3.87	2.93
AVI-Edit	0.0604	2.34	0.2721	3.49	3.89	3.86
Sequential	0.0563	2.35	0.2925	3.99	4.08	3.51
JAVEdit (Ours)	0.0596	2.42	0.3688	4.07	4.22	3.88

Table 1: Quantitative comparison on JAVEditBench. Bold = best. Sequential cascades Kiwi-Edit with HunyuanVideo-Foley.

Ablation Study

Model	Scale	Visual Q ↑	Audio Q ↑	AV Sync ↑	IC ↑	Fidelity ↑
JAVEdit-tiny	5K	0.0574	2.38	0.2453	3.21	3.95
JAVEdit-small	15K	0.0579	2.44	0.2871	3.49	4.18
JAVEdit w/o Agent	100K	0.0581	2.31	0.3012	3.61	4.05
JAVEdit (Ours)	100K	0.0596	2.42	0.3688	4.07	4.22

Table 2: Ablation on JAVEditBench. Agent-in-the-loop QC and data scale are complementary — both are necessary for best performance.

Evaluation metrics:

Visual Quality (VTSS) Audio Quality (UTMOSv2) AV Sync (SyncNet) Instruction Compliance Video Fidelity AV Quality (Qwen3-Omni)

Qualitative Results on JAVEditBench

Side-by-side comparison of source videos and outputs from four methods across all editing categories.

Case 1 / 2

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

Source

AVED

AVI-Edit

Sequential

JAVEdit (Ours)

BibTeX

@article{chen2026javedit,
  title     = {JAVEdit: Joint Audio-Visual Instruction-Guided Video Editing
               with an Agent-Curated Data Pipeline},
  author    = {Chen, Yinan and Lin, Chuming and Chen, Zhennan and Zeng, Yuxiang
               and Zhu, Junwei and Bi, Yali and Huang, Xijie and Xu, Chengming
               and Luo, Donghao and Xue, Zhucun and Hu, Xiaobin
               and Wang, Chengjie and Liu, Yong and Zhang, Jiangning
               and Yan, Shuicheng},
  journal   = {arXiv preprint arXiv:XXXX.XXXXX},
  year      = {2026}
}

JAVEdit: Joint Audio-Visual
Instruction-Guided Video Editing
with Agentic Data Curation

Abstract

Editing Examples from JAVEdit-100k

Three Pillars of JAVEdit

JAVEdit-100k

Agent-in-the-Loop

JAVEditBench + JAVEdit

JAVEdit-100k Dataset

What Makes JAVEdit-100k Unique?

Five Editing Categories

Quantitative Results on JAVEditBench

Ablation Study

Qualitative Results on JAVEditBench

BibTeX