This work was tested with Python 3.8.12, CUDA 11.3, and Ubuntu 18.04.
conda create -n CICR python=3.8
conda activate CICR
conda install pytorch==1.11.0 torchvision==0.12.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
# You should also download nltk_data.
python -c "import nltk; nltk.download('all')"The structure of the data folder is as follows:
data
├── charades
│ ├── annotations
│ │ ├── charades_sta_test.txt
│ │ ├── charades_sta_train.txt
│ │ ├── Charades_v1_test.csv
│ │ ├── Charades_v1_train.csv
│ │ ├── CLIP_tokenized_count.txt
│ │ ├── GloVe_tokenized_count.txt
│ │ └── glove.pkl
│ ├── charades_query_object_glove_train.pkl
│ ├── charades_query_subject_glove_train.pkl
│ └── charades_query_relation_glove_train.pkl
├── Charades-CD
│ ├── charades_test_iid.json
│ ├── charades_test_ood.json
│ ├── charades_train.json
│ ├── charades_val.json
│ ├── charades_query_object_glove_train.pkl
│ ├── charades_query_subject_glove_train.pkl
│ └── charades_query_relation_glove_train.pkl
│ └── glove.pkl -> ../charades/annotations/glove.pkl
├── Charades-CG
│ ├── novel_composition.json
│ ├── novel_word.json
│ ├── test_trivial.json
│ ├── train.json
│ ├── CLIP_tokenized_count.txt -> ../charades/annotations/CLIP_tokenized_count.txt
│ └── glove.pkl -> ../charades/annotations/glove.pkl
├── qvhighlights
│ ├── annotations
│ │ ├── CLIP_tokenized_count.txt
│ │ ├── highlight_test_release.jsonl
│ │ ├── highlight_train_release.jsonl
│ │ ├── highlight_val_object.jsonl
│ │ └── highlight_val_release.jsonl
│ ├── qvhighlights_query_relation_train.pkl
│ ├── qvhighlights_query_subject_train.pkl
│ └── qvhighlights_query_object_train.pkl
├── tacos
│ ├── annotations
│ │ ├── CLIP_tokenized_count.txt
│ │ ├── GloVe_tokenized_count.txt
│ │ ├── test.json
│ │ ├── train.json
│ │ └── val.json
│ ├── tacos_query_object_glove_train.pkl
│ ├── tacos_query_relation_glove_train.pkl
│ └── tacos_query_subject_glove_train.pklAll extracted features are converted to hdf5 files for better storage. You can use the provided python script ./data/npy2hdf5.py to convert *.npy or *.npz files to an hdf5 file.
These files are built for masked language modeling in FW-MESM, and they can be generated by running
python -m data.tokenized_count-
CLIP_tokenized_count.txtColumn 1 is the word_id tokenized by the CLIP tokenizer, column 2 is the times the word_id appears in the whole dataset.
-
GloVe_tokenized_count.txtColumn 1 is the splited word in a sentence, column 2 is its tokenized id for GloVe, and column 3 is the times the word appears in the whole dataset.
- CLIP+SlowFast: We use the features provided by MESM.
- I3D: We employ features delivered by VSLNet.
- VGG: We utilize features supplied by 2D-TAN.
We use the official feature files for QVHighlights dataset from Moment-DETR, and merge them to clip_image.hdf5 and slowfast.hdf5.
Features are obtained from MESM.
You can run train.py with args in command lines:
CUDA_VISIBLE_DEVICES=0 python train.py {--args}Or run with a config file as input:
CUDA_VISIBLE_DEVICES=0 python train.py --config_file ./config/charades/VGG_GloVe.jsonYou can run eval.py with args in command lines:
CUDA_VISIBLE_DEVICES=0 python eval.py {--args}Or run with a config file as input:
CUDA_VISIBLE_DEVICES=0 python eval.py --config_file ./config/charades/VGG_GloVe_eval.json