LLMs-for-Online-Child-Safety

TheWebConf 2026 Study Repo

Dataset and Code Supplement

This repository provides supplementary material used in our experiments on video ad classification using Multimodal LLMs.
It includes the annotated dataset, metadata sources, transcription pipelines, experimental notebooks, and the annotation codebook.

📁 Repository Structure

1. `ground_truth.csv`

Contains the final annotated dataset.

Each row includes:

🎬 Video ID
🏷️ Primary Label
🏷️ Secondary Label
✏️ English (Translated) Transcription
✏️ Native Transcription
🗂️ Metadata fields (title, tags, thumbnail, channelTitle, description)
🌍 Languages — indicates unavailable or region-restricted videos

2. `PythonNotebooks/`

Notebooks used for dataset construction, metadata enrichment, and experimental evaluation.

🔹 Retrieval Pipelines

📥 YouTube Videos Download.ipynb — Raw video collection & filtering
🧾 Download Video Metadata.ipynb — Metadata retrieval & preprocessing

🔹 Experiment Notebooks

Run, evaluate, and reproduce experiments on:

🗣️ Transcription-only models
🏷️ Metadata-only models
🔀 Multimodal fusion pipelines
🔮 Gemini-based baselines
🔬 Ablations & sampling strategies

Set the directories correctly in the notebook for the required CSVs. Please note that CSVs for video IDs, transcriptions and labels is combined in one CSV.

(All notebook files are listed inside the folder.)

3. `Updated Codebook.pdf`

Final annotation guide used by human labelers.

Contains:

📚 Definitions for all primary & secondary labels
🖼️ Label examples
⚠️ Edge cases & annotation rules

4. `Base Prompt.pdf`

Contains base CoT prompt use across all experiments.

5. Miscellaneous

Additional supplementary files supporting the dataset and experiments.

🧩 Key Features

✅ Multilingual transcriptions (native + translated)
✅ Unified metadata integration
✅ Pre-labeled dataset for replication and benchmarking
✅ Modular and customizable pipeline
✅ Transparent labeling methodology via codebook

🚀 How to Use This Repository

📄 Load the dataset from ground_truth.csv
🧪 Use the Python notebooks to:
- Reproduce experiments
- Extend analyses
- Implement additional modeling pipelines
📘 Consult Updated Codebook.pdf for label semantics

Contact

Contact any of the following emails in case of issues or questions:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
PythonNotebooks		PythonNotebooks
Base Prompt.pdf		Base Prompt.pdf
PythonNotebooks.zip		PythonNotebooks.zip
README.md		README.md
Updated Codebook.pdf		Updated Codebook.pdf
ground_truth.csv		ground_truth.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMs-for-Online-Child-Safety

Dataset and Code Supplement

📁 Repository Structure

1. `ground_truth.csv`

2. `PythonNotebooks/`

🔹 Retrieval Pipelines

🔹 Experiment Notebooks

3. `Updated Codebook.pdf`

4. `Base Prompt.pdf`

5. Miscellaneous

🧩 Key Features

🚀 How to Use This Repository

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLMs-for-Online-Child-Safety

Dataset and Code Supplement

📁 Repository Structure

1. ground_truth.csv

2. PythonNotebooks/

🔹 Retrieval Pipelines

🔹 Experiment Notebooks

3. Updated Codebook.pdf

4. Base Prompt.pdf

5. Miscellaneous

🧩 Key Features

🚀 How to Use This Repository

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `ground_truth.csv`

2. `PythonNotebooks/`

3. `Updated Codebook.pdf`

4. `Base Prompt.pdf`

Packages