Skip to content

[ICLR 2026] ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

License

Notifications You must be signed in to change notification settings

alibaba/ReWatch-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReWatch-R1

Paper Project Page Model Dataset

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

This is the official code used to train ReWatch-R1. Note that the code only contains the reinforcement learning part.

🔥 News

[2026/01/26] ReWatch-R1 has been accepted by ICLR 2026.

Using ReWatch-R1 to Inference

Use our model for video reasoning! Please use transformers==4.56.0 and qwen_vl_utils.
Please download our model ReWatch-R1.
It is recommended to use the video parameters in the paper (up to 192 frames, with a resolution of 128*28*28 per frame).
For the best results, you must provide the duration of the video in the prompt (for example, 00:00-10:00), and the timestamp should be in the MM:SS format.

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "ReWatch-R1"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(
    model_path, 
    trust_remote_code=True,
    use_fast=True,
    padding_side="left",
    truncation_side="right",
)

video_path = "videos/example.mp4"
video_duration = 600
question = "What happened from [05:00] to [05:10]?"

total_pixels = 12288*28*28
min_pixels = 128*28*28
max_pixels = 128*28*28
fps = 2.0
max_frames = 192

video_config = {
    "type": "video",
    "video": video_path,
    "total_pixels": total_pixels,
    "min_pixels": min_pixels,
    "max_pixels": max_pixels,
    "fps": fps,
    "max_frames": max_frames
}

react_prompt = """You are a video understanding expert. You are given a video and a question. You need to answer the question based on the video content. Please answer the question step by step. When you need more video details, you will re-watch the relevant clips and use <action> and </action> to mark the actions, and use <observation> and </observation> to mark the visual details you observe. When you have enough information to determine the final answer, you will wrap the final answer in <answer> and </answer>.

**Video Information and Question:**
- **Video Duration**: {video_duration}
- **Question**: {question}"""

def seconds_to_timestamp(seconds):
    """将秒数转换为时间戳字符串 (MM:SS)"""
    minutes = seconds // 60
    seconds = seconds % 60
    return f"{minutes:02d}:{seconds:02d}"

duration_str = f"00:00-{seconds_to_timestamp(video_duration)}"
instruction = react_prompt.format(video_duration=duration_str, question=question)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        video_config,
        {"type": "text", "text": instruction},
    ]},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    max_length=16384,
    truncation=True,
    do_sample_frames=False,
    **video_kwargs,
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=4096, use_cache=True)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Quick Start for RL Training

Please follow the steps below to start your video RL training!

Prepare the data and model

First, download our cold-started model ReWatch-R1-SFT.
Second, download our QA data and Caption data.
Then prepare the QA data in the following format: The training set is a jsonl format file, and the format of each line is as follows.

  • multiple-choice format
{
    "problem_id": "1Yc9DM8j378.mp4_temporal_localization_multiple_choice", 
    "question_type": "temporal_localization", 
    "multiple_choice": true, 
    "problem": "At 01:40, what specific piece of jewelry is Man 2 described as wearing, a detail consistent with Abram's later appearance in the bar?\nA: A beaded necklace\nB: A large gold medallion\nC: A leather wristband\nD: A silver chain", 
    "data_type": "video", 
    "videos": "1Yc9DM8j378.mp4", 
    "answer": "D", 
    "answer_str": "A silver chain", 
    "duration": 972, 
    "duration_str": "00:00-16:12"
}
  • open-end format
{
    "problem_id": "1Yc9DM8j378.mp4_temporal_localization_open_end", 
    "question_type": "temporal_localization", 
    "multiple_choice": false, 
    "problem": "At 01:40, what specific piece of jewelry is Man 2 described as wearing, a detail consistent with Abram's later appearance in the bar?", 
    "data_type": "video", 
    "videos": "1Yc9DM8j378.mp4", 
    "answer": "A silver chain", 
    "answer_str": "A silver chain", 
    "duration": 972, 
    "duration_str": "00:00-16:12"
}

Modify the training configuration

This is the most crucial step. Please modify the configuration file at configs/config.yaml by updating the data path and model path to your local paths.

Please modify all the places marked "TODO:".

If you want to customize your data loading logic, you can further modify the RLHFDataset class in the verl/utils/dataset.py file.

Training

Then run the following script to start the RL training!

Single node

bash scripts/train_single_node.sh

Multi-nodes

bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES

where TRAIN_SCRIPT is the script to train on single node, NNODES is the number of nodes required.

For example,

bash scripts/srun_multi_nodes.sh scripts/train_single_node.sh 2

Acknowledgement

  • Long-RL: the codebase we built upon. Thanks for their wonderful work.

Citation

@misc{zhang2025rewatchr1boostingcomplexvideo,
        title={ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis}, 
        author={Congzhi Zhang and Zhibin Wang and Yinchao Ma and Jiawei Peng and Yihan Wang and Qiang Zhou and Jun Song and Bo Zheng},
        year={2025},
        eprint={2509.23652},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2509.23652}, 
  }

About

[ICLR 2026] ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •