ReWatch-R1

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

This is the official code used to train ReWatch-R1. Note that the code only contains the reinforcement learning part.

🔥 News

[2026/01/26] ReWatch-R1 has been accepted by ICLR 2026.

Using ReWatch-R1 to Inference

Use our model for video reasoning! Please use transformers==4.56.0 and qwen_vl_utils.
Please download our model ReWatch-R1.
It is recommended to use the video parameters in the paper (up to 192 frames, with a resolution of 128*28*28 per frame).
For the best results, you must provide the duration of the video in the prompt (for example, 00:00-10:00), and the timestamp should be in the MM:SS format.

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "ReWatch-R1"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(
    model_path, 
    trust_remote_code=True,
    use_fast=True,
    padding_side="left",
    truncation_side="right",
)

video_path = "videos/example.mp4"
video_duration = 600
question = "What happened from [05:00] to [05:10]?"

total_pixels = 12288*28*28
min_pixels = 128*28*28
max_pixels = 128*28*28
fps = 2.0
max_frames = 192

video_config = {
    "type": "video",
    "video": video_path,
    "total_pixels": total_pixels,
    "min_pixels": min_pixels,
    "max_pixels": max_pixels,
    "fps": fps,
    "max_frames": max_frames
}

react_prompt = """You are a video understanding expert. You are given a video and a question. You need to answer the question based on the video content. Please answer the question step by step. When you need more video details, you will re-watch the relevant clips and use <action> and </action> to mark the actions, and use <observation> and </observation> to mark the visual details you observe. When you have enough information to determine the final answer, you will wrap the final answer in <answer> and </answer>.

**Video Information and Question:**
- **Video Duration**: {video_duration}
- **Question**: {question}"""

def seconds_to_timestamp(seconds):
    """将秒数转换为时间戳字符串 (MM:SS)"""
    minutes = seconds // 60
    seconds = seconds % 60
    return f"{minutes:02d}:{seconds:02d}"

duration_str = f"00:00-{seconds_to_timestamp(video_duration)}"
instruction = react_prompt.format(video_duration=duration_str, question=question)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        video_config,
        {"type": "text", "text": instruction},
    ]},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    max_length=16384,
    truncation=True,
    do_sample_frames=False,
    **video_kwargs,
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=4096, use_cache=True)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Quick Start for RL Training

Please follow the steps below to start your video RL training!

Prepare the data and model

First, download our cold-started model ReWatch-R1-SFT.
Second, download our QA data and Caption data.
Then prepare the QA data in the following format: The training set is a jsonl format file, and the format of each line is as follows.

multiple-choice format

{
    "problem_id": "1Yc9DM8j378.mp4_temporal_localization_multiple_choice", 
    "question_type": "temporal_localization", 
    "multiple_choice": true, 
    "problem": "At 01:40, what specific piece of jewelry is Man 2 described as wearing, a detail consistent with Abram's later appearance in the bar?\nA: A beaded necklace\nB: A large gold medallion\nC: A leather wristband\nD: A silver chain", 
    "data_type": "video", 
    "videos": "1Yc9DM8j378.mp4", 
    "answer": "D", 
    "answer_str": "A silver chain", 
    "duration": 972, 
    "duration_str": "00:00-16:12"
}

open-end format

{
    "problem_id": "1Yc9DM8j378.mp4_temporal_localization_open_end", 
    "question_type": "temporal_localization", 
    "multiple_choice": false, 
    "problem": "At 01:40, what specific piece of jewelry is Man 2 described as wearing, a detail consistent with Abram's later appearance in the bar?", 
    "data_type": "video", 
    "videos": "1Yc9DM8j378.mp4", 
    "answer": "A silver chain", 
    "answer_str": "A silver chain", 
    "duration": 972, 
    "duration_str": "00:00-16:12"
}

Modify the training configuration

This is the most crucial step. Please modify the configuration file at configs/config.yaml by updating the data path and model path to your local paths.

Please modify all the places marked "TODO:".

If you want to customize your data loading logic, you can further modify the RLHFDataset class in the verl/utils/dataset.py file.

Training

Then run the following script to start the RL training!

Single node

bash scripts/train_single_node.sh

Multi-nodes

bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES

where TRAIN_SCRIPT is the script to train on single node, NNODES is the number of nodes required.

For example,

bash scripts/srun_multi_nodes.sh scripts/train_single_node.sh 2

Acknowledgement

Long-RL: the codebase we built upon. Thanks for their wonderful work.

Citation

@misc{zhang2025rewatchr1boostingcomplexvideo,
        title={ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis}, 
        author={Congzhi Zhang and Zhibin Wang and Yinchao Ma and Jiawei Peng and Yihan Wang and Qiang Zhou and Jun Song and Bo Zheng},
        year={2025},
        eprint={2509.23652},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2509.23652}, 
  }

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReWatch-R1

🔥 News

Using ReWatch-R1 to Inference

Quick Start for RL Training

Prepare the data and model

Modify the training configuration

Training

Single node

Multi-nodes

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

alibaba/ReWatch-R1

Folders and files

Latest commit

History

Repository files navigation

ReWatch-R1

🔥 News

Using ReWatch-R1 to Inference

Quick Start for RL Training

Prepare the data and model

Modify the training configuration

Training

Single node

Multi-nodes

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages