Skip to content

nsgLUMS/LLMs-for-Online-Child-Safety

Repository files navigation

LLMs-for-Online-Child-Safety

TheWebConf 2026 Study Repo

Dataset and Code Supplement

This repository provides supplementary material used in our experiments on video ad classification using Multimodal LLMs.
It includes the annotated dataset, metadata sources, transcription pipelines, experimental notebooks, and the annotation codebook.


📁 Repository Structure


1. ground_truth.csv

Contains the final annotated dataset.

Each row includes:

  • 🎬 Video ID
  • 🏷️ Primary Label
  • 🏷️ Secondary Label
  • ✏️ English (Translated) Transcription
  • ✏️ Native Transcription
  • 🗂️ Metadata fields (title, tags, thumbnail, channelTitle, description)
  • 🌍 Languages — indicates unavailable or region-restricted videos

2. PythonNotebooks/

Notebooks used for dataset construction, metadata enrichment, and experimental evaluation.

🔹 Retrieval Pipelines

  • 📥 YouTube Videos Download.ipynb — Raw video collection & filtering
  • 🧾 Download Video Metadata.ipynb — Metadata retrieval & preprocessing

🔹 Experiment Notebooks

Run, evaluate, and reproduce experiments on:

  • 🗣️ Transcription-only models
  • 🏷️ Metadata-only models
  • 🔀 Multimodal fusion pipelines
  • 🔮 Gemini-based baselines
  • 🔬 Ablations & sampling strategies

Set the directories correctly in the notebook for the required CSVs. Please note that CSVs for video IDs, transcriptions and labels is combined in one CSV.

(All notebook files are listed inside the folder.)


3. Updated Codebook.pdf

Final annotation guide used by human labelers.

Contains:

  • 📚 Definitions for all primary & secondary labels
  • 🖼️ Label examples
  • ⚠️ Edge cases & annotation rules

4. Base Prompt.pdf

Contains base CoT prompt use across all experiments.


5. Miscellaneous

Additional supplementary files supporting the dataset and experiments.


🧩 Key Features

  • ✅ Multilingual transcriptions (native + translated)
  • ✅ Unified metadata integration
  • ✅ Pre-labeled dataset for replication and benchmarking
  • ✅ Modular and customizable pipeline
  • ✅ Transparent labeling methodology via codebook

🚀 How to Use This Repository

  1. 📄 Load the dataset from ground_truth.csv
  2. 🧪 Use the Python notebooks to:
    • Reproduce experiments
    • Extend analyses
    • Implement additional modeling pipelines
  3. 📘 Consult Updated Codebook.pdf for label semantics

Contact

Contact any of the following emails in case of issues or questions:

  1. eman.nabeel@lums.edu.pk
  2. 26100270@lums.edu.pk
  3. 26100107@lums.edu.pk

About

TheWebConf 2026 Study Repo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors