This repository contains the code and resources for a fact-checking project developed as part of the NLP for Society (NLP4Soc) course at TU Delft (DSAIT4100, 2025).
This project evaluates the faithfulness of fact-checking explanations generated by various large language models. It includes:
- Data processing pipelines for multiple fact-checking datasets
- Inference scripts for generating fact-checking responses from LLMs
- Multiple evaluation frameworks for assessing faithfulness, including RAGAS, G-EVAL, Human-in-the-Loop, and entailment-based methods.
- Detailed analysis and visualization notebooks
The interactive Human-in-the-Loop (HITL) evaluation allows users to assess the faithfulness of model-generated explanations. See the GUI screenshot below for an example of the interface used to collect human feedback on model outputs.

archive/: Original datasets and preparation scripts- Three fact-checking datasets: Hover, PolitiHop, and CovidFact
- Prepared and unified dataset versions
experiments/: Evaluation notebooks and utilities- Various faithfulness evaluation methods (RAGAS, G-EVAL, Entailment)
- Analysis outputs and merged results
inference/: Scripts for generating model outputs- Fact-checking pipelines for both local and API-based models
- Raw and converted output formats
misc_notebooks/: Analysis and utility notebooks- Results processing, visualization, and final analysis
This project uses UV as the package manager.
-
Install UV (if not already installed):
pip install uv
or follow the installation guide at UV Installation
-
Install Ollama for local model inference
-
Create a virtual environment with notebook kernel support:
uv venv
-
Install dependencies:
uv sync
-
Activate the environment:
# On Windows .venv\Scripts\activate # On Unix/macOS source .venv/bin/activate
- GPT-4o: Proprietary model accessed via OpenAI API
- DeepSeek-R1:32B: Distilled and quantized version available via Ollama (MIT License)
- Mistral-7B: Apache 2.0 Licensed model, used via Ollama
- RoBERTa-Large-MNLI: Apache 2.0 Licensed model for entailment classification
- Llama 3–8B: Meta's proprietary Llama 3 license (permitted for research use)
- RAGAS: Apache 2.0 Licensed framework for automated evaluation of Retrieval-Augmented Generation
- GEVAL: MIT Licensed framework for evaluation of natural language generation
- Hover: MIT Licensed dataset for fact-checking
- PolitiHop: Publicly available dataset for political claim verification
- CovidFact: Research dataset for COVID-19 fact extraction and verification
The main experiment notebooks are in the experiments/ directory:
entailment_evaluation.ipynb: Entailment-based faithfulness evaluationgeval_evaluation.ipynb: G-EVAL framework evaluationragas_evaluation.ipynb: RAGAS framework evaluationhuman_in_the_loop_evaluation.ipynb: Evaluation with human feedback
Final analysis and visualizations can be found in misc_notebooks/final_analysis.ipynb.
- Samuel Goldie
- Jorden van Schijndel
- Franciszek Latała
- Razo van Berkel
- Deniz Çetin
