ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
This is the official code used to train ReWatch-R1. Note that the code only contains the reinforcement learning part.
[2026/01/26] ReWatch-R1 has been accepted by ICLR 2026.
Use our model for video reasoning! Please use transformers==4.56.0 and qwen_vl_utils.
Please download our model ReWatch-R1.
It is recommended to use the video parameters in the paper (up to 192 frames, with a resolution of 128*28*28 per frame).
For the best results, you must provide the duration of the video in the prompt (for example, 00:00-10:00), and the timestamp should be in the MM:SS format.
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path = "ReWatch-R1"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(
model_path,
trust_remote_code=True,
use_fast=True,
padding_side="left",
truncation_side="right",
)
video_path = "videos/example.mp4"
video_duration = 600
question = "What happened from [05:00] to [05:10]?"
total_pixels = 12288*28*28
min_pixels = 128*28*28
max_pixels = 128*28*28
fps = 2.0
max_frames = 192
video_config = {
"type": "video",
"video": video_path,
"total_pixels": total_pixels,
"min_pixels": min_pixels,
"max_pixels": max_pixels,
"fps": fps,
"max_frames": max_frames
}
react_prompt = """You are a video understanding expert. You are given a video and a question. You need to answer the question based on the video content. Please answer the question step by step. When you need more video details, you will re-watch the relevant clips and use <action> and </action> to mark the actions, and use <observation> and </observation> to mark the visual details you observe. When you have enough information to determine the final answer, you will wrap the final answer in <answer> and </answer>.
**Video Information and Question:**
- **Video Duration**: {video_duration}
- **Question**: {question}"""
def seconds_to_timestamp(seconds):
"""将秒数转换为时间戳字符串 (MM:SS)"""
minutes = seconds // 60
seconds = seconds % 60
return f"{minutes:02d}:{seconds:02d}"
duration_str = f"00:00-{seconds_to_timestamp(video_duration)}"
instruction = react_prompt.format(video_duration=duration_str, question=question)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
video_config,
{"type": "text", "text": instruction},
]},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
max_length=16384,
truncation=True,
do_sample_frames=False,
**video_kwargs,
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=4096, use_cache=True)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)Please follow the steps below to start your video RL training!
First, download our cold-started model ReWatch-R1-SFT.
Second, download our QA data and Caption data.
Then prepare the QA data in the following format:
The training set is a jsonl format file, and the format of each line is as follows.
- multiple-choice format
{
"problem_id": "1Yc9DM8j378.mp4_temporal_localization_multiple_choice",
"question_type": "temporal_localization",
"multiple_choice": true,
"problem": "At 01:40, what specific piece of jewelry is Man 2 described as wearing, a detail consistent with Abram's later appearance in the bar?\nA: A beaded necklace\nB: A large gold medallion\nC: A leather wristband\nD: A silver chain",
"data_type": "video",
"videos": "1Yc9DM8j378.mp4",
"answer": "D",
"answer_str": "A silver chain",
"duration": 972,
"duration_str": "00:00-16:12"
}- open-end format
{
"problem_id": "1Yc9DM8j378.mp4_temporal_localization_open_end",
"question_type": "temporal_localization",
"multiple_choice": false,
"problem": "At 01:40, what specific piece of jewelry is Man 2 described as wearing, a detail consistent with Abram's later appearance in the bar?",
"data_type": "video",
"videos": "1Yc9DM8j378.mp4",
"answer": "A silver chain",
"answer_str": "A silver chain",
"duration": 972,
"duration_str": "00:00-16:12"
}This is the most crucial step. Please modify the configuration file at configs/config.yaml by updating the data path and model path to your local paths.
Please modify all the places marked "TODO:".
If you want to customize your data loading logic, you can further modify the RLHFDataset class in the verl/utils/dataset.py file.
Then run the following script to start the RL training!
bash scripts/train_single_node.shbash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODESwhere TRAIN_SCRIPT is the script to train on single node, NNODES is the number of nodes required.
For example,
bash scripts/srun_multi_nodes.sh scripts/train_single_node.sh 2- Long-RL: the codebase we built upon. Thanks for their wonderful work.
@misc{zhang2025rewatchr1boostingcomplexvideo,
title={ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis},
author={Congzhi Zhang and Zhibin Wang and Yinchao Ma and Jiawei Peng and Yihan Wang and Qiang Zhou and Jun Song and Bo Zheng},
year={2025},
eprint={2509.23652},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.23652},
}