This document describes how to collect citations for new models (like LES and EDMF) and integrate them into the existing analysis pipeline.
The workflow consists of three main steps:
- Convert team papers from .docx to JSON format
- Scrape citations using academic APIs
- Integrate into the existing LLM analytics and dashboard pipeline
# For citation scraping
pip install requests scholarly crossref-commons
# For DOCX processing (optional)
pip install python-docx
# Ensure Ollama is installed and running (for LLM analysis)
ollama serveYou have team papers in .docx format in the team papers/ folder. Convert them to JSON first.
cd pubclassifier
# Convert LES team papers
python team_papers_converter.py "team papers/LES/LES team papers.docx" LES -o LES_team_papers.json
# Convert EDMF team papers
python team_papers_converter.py "team papers/EDMF/EDMF team papers.docx" EDMF -o EDMF_team_papers.json# Manual entry for LES
python team_papers_converter.py LES -o LES_team_papers.json --format manual
# Manual entry for EDMF
python team_papers_converter.py EDMF -o EDMF_team_papers.json --format manualpython team_papers_converter.py LES_papers.csv LES -o LES_team_papers.json --format csvUse the citation scraper to collect papers that cite your team papers.
# Scrape citations for LES
python citation_scraper.py LES_team_papers.json -o LES_citations.json --max-citations 1000
# Scrape citations for EDMF
python citation_scraper.py EDMF_team_papers.json -o EDMF_citations.json --max-citations 1000The scraper will:
- Search for each team paper using Semantic Scholar and CrossRef APIs
- Find citing papers for each team paper
- Save all citations in JSON format compatible with the existing pipeline
Scraping Statistics:
Team papers processed: 15
Papers found: 12
Papers not found: 3
Total citations collected: 450
Use the integration script to add your new model to the complete pipeline.
# Full integration for LES
python pipeline_integration.py LES LES_citations.json --team-papers LES_team_papers.json
# Full integration for EDMF
python pipeline_integration.py EDMF EDMF_citations.json --team-papers EDMF_team_papers.jsonThis will:
- Copy citations to
../LLM_paper_analytics/data/LES.json - Run LLM analysis to categorize citations (creates
LES_analyzed.json) - Copy analyzed data to the dashboard (
../science-model-dashboard/src/data/) - Update model configuration files
# Skip LLM analysis (if Ollama not available)
python pipeline_integration.py LES LES_citations.json --no-analysis
# Skip dashboard integration
python pipeline_integration.py LES LES_citations.json --no-dashboard
# Just validate integration
python pipeline_integration.py LES LES_citations.json --validate-onlyAfter integration, start the dashboard to view your new model:
cd ../science-model-dashboard
npm install
npm startNavigate to:
http://localhost:3000/science-model-dashboard/LEShttp://localhost:3000/science-model-dashboard/EDMF
-
DOCX Conversion Problems
# Install python-docx pip install python-docx # Or use manual entry python team_papers_converter.py LES -o LES_team_papers.json --format manual
-
API Rate Limiting
# The scraper has built-in delays, but if you hit limits: # - Wait a few minutes between runs # - Use --max-citations with smaller numbers python citation_scraper.py LES_team_papers.json -o LES_citations.json --max-citations 100
-
Ollama Not Available
# Skip LLM analysis for now python pipeline_integration.py LES LES_citations.json --no-analysis # Or install Ollama and run analysis later # Download from: https://ollama.ai ollama pull deepseek-r1:671b
-
Papers Not Found
- Check the paper titles in your team papers JSON
- Ensure DOIs are included when available
- Some papers may not be in Semantic Scholar/CrossRef databases
If automatic conversion doesn't work well, you can manually edit the JSON files:
{
"model_name": "LES",
"team_papers_source": "manual",
"extraction_date": "2025-11-11",
"papers": [
{
"title": "Large Eddy Simulation of...",
"authors": ["Smith, J.", "Jones, A."],
"year": 2020,
"doi": "10.1234/example",
"venue": "Journal of Climate"
}
]
}LLM_paper_analytics/
├── data/
│ ├── LES.json # Raw citations
│ └── EDMF.json
└── results/
├── LES_analyzed.json # LLM-analyzed citations
└── EDMF_analyzed.json
science-model-dashboard/
└── src/
├── data/
│ ├── LES_analyzed.json # Dashboard data
│ └── EDMF_analyzed.json
└── config/
└── modelConfig.js # Updated with new models
pubclassifier/
├── LES_team_papers.json # Team papers
├── LES_citations.json # Scraped citations
├── EDMF_team_papers.json
└── EDMF_citations.json
# 1. Convert team papers
cd pubclassifier
python team_papers_converter.py "team papers/LES/LES team papers.docx" LES -o LES_team_papers.json
# 2. Scrape citations (this may take 10-30 minutes depending on number of papers)
python citation_scraper.py LES_team_papers.json -o LES_citations.json
# 3. Integrate into pipeline (requires Ollama running)
python pipeline_integration.py LES LES_citations.json --team-papers LES_team_papers.json
# 4. Start dashboard
cd ../science-model-dashboard
npm start
# 5. View results at http://localhost:3000/science-model-dashboard/LESIf you want to run custom LLM analysis:
cd ../LLM_paper_analytics
python src/citation_analyzer.py data/LES.json -o results/LES_custom.json -m "Large Eddy Simulation methodology" --model "llama3.3:70b"# Create a batch script
for model in LES EDMF; do
echo "Processing $model..."
python citation_scraper.py ${model}_team_papers.json -o ${model}_citations.json
python pipeline_integration.py $model ${model}_citations.json --team-papers ${model}_team_papers.json
doneTo add more citations to an existing model:
# Scrape additional citations
python citation_scraper.py LES_additional_papers.json -o LES_additional_citations.json
# Merge with existing data (you may need to manually combine JSON files)
# Then re-run integration
python pipeline_integration.py LES LES_combined_citations.json