A sophisticated news scraper that collects, filters, and presents energy, AI, and blockchain news from multiple sources with intelligent keyword matching and static website hosting.
- Smart Filtering: Uses word boundary regex matching to ensure only relevant articles are collected
- Multiple Sources: Scrapes from BBC, CNN, Guardian, Al Jazeera, TechCrunch, Ars Technica, and more
- Legislation Tracking: Collects legislative content from US Congress, UK Parliament, EU, Australia, Brazil, and South Africa
- Prediction Markets: Fetches political prediction markets from Polymarket with country-level tagging
- Static Website: Automatically generates browsable HTML indexes for each collection date
- S3 Integration: Stores content in AWS S3 with date-organized folders
- Idempotent: Prevents duplicate processing of articles
- Progress Tracking: Saves progress to resume interrupted collections
-
Install Dependencies:
pip install -r requirements.txt
-
Configure AWS:
aws configure
-
Run the Scraper:
python3 news_scraper_final.py
-
View Results:
- Master Index:
http://news-collection-website.s3-website-us-east-1.amazonaws.com/ - Date-specific:
http://news-collection-website.s3-website-us-east-1.amazonaws.com/news/YYYY-MM-DD/
- Master Index:
The scraper targets these keywords:
- Energy: renewable, solar, wind, nuclear, battery, grid, power, energy
- AI: artificial intelligence, machine learning, AI, ML, neural network, deep learning
- Blockchain: blockchain, cryptocurrency, bitcoin, ethereum, crypto, DeFi, Web3
newsroom/
├── news_scraper.py # Main news scraper script
├── legislation_scraper.py # Legislative feed scraper
├── polymarket_scraper.py # Polymarket prediction markets scraper
├── article_tagger.py # Geographic and topic tagging
├── news_storage.py # Shared S3 storage utilities
├── requirements.txt # Python dependencies
├── README.md # This file
├── lambda/
│ ├── lambda_news_scraper.py # Lambda entry handler
│ └── lambda_wrapper.py # Orchestrates all scrapers
├── scripts/
│ └── deploy_lambda.sh # Lambda deployment script
└── docs/
└── README_news_scraper.md # Detailed setup instructions
s3://news-collection-website/
├── index.html # Master index (all dates)
└── news/
└── YYYY-MM-DD/
├── index.html # Date-specific index (all articles for that date)
├── metadata/ # All article metadata (news + legislation)
├── content/ # All article content (news + legislation)
├── rss/
│ ├── metadata/ # RSS article metadata (legacy)
│ └── content/ # RSS article content (legacy)
└── direct/
├── metadata/ # Direct scrape metadata (legacy)
└── content/ # Direct scrape content (legacy)
Note: News, legislation, and Polymarket articles are all stored in the same metadata/ and content/ folders at the date level. The HTML index generation scans all metadata folders to include all articles. Use the source field or special_tags to filter by type.
python3 news_scraper.pypython3 legislation_scraper.pypython3 polymarket_scraper.py# Reprocess everything
python3 news_scraper.py --fresh
python3 legislation_scraper.py --fresh- Fetches articles from RSS feeds and direct scraping
- Filters articles by keyword matching (energy, AI, blockchain, etc.)
- Tags articles with geographic and topical information
- Saves to S3 with metadata and content
- Generates HTML indexes
- Fetches articles from legislative RSS feeds
- NO keyword filtering - collects ALL articles from feeds
- Tags articles with geographic information
- Always adds
legislationtag to metadata - Saves to same S3 bucket using shared storage utilities
- Articles appear in same HTML indexes (filter by legislation tag)
- Fetches prediction markets from Polymarket Gamma API (no auth required)
- Filters for political/geopolitical markets using keyword matching
- Detects country mentions from market questions and descriptions
- Tags with
prediction_marketspecial tag and detected countries - Saves to same S3 bucket using shared storage utilities
- Typically collects 100-150 political markets per run
Polymarket API: https://gamma-api.polymarket.com/markets
- No authentication required
- Supports pagination via
limitandoffsetparameters - Returns market data including question, prices, volume, liquidity
All three scrapers use shared utilities:
save_article(): Save article with metadata to S3exists_in_s3(): Check if file already existsupload_to_s3_if_not_exists(): Upload file if not already presentget_today_folder(): Get today's folder path
Regular News Articles:
- ✅ Geographic tags (continents)
- ✅ Keyword matching
- ✅ Topic categories (energy, AI, blockchain, insurance, geopolitics)
Legislation Articles:
- ✅ Always tagged with
special_tags: ['legislation'] - ✅ Geographic tags (continents) detected automatically
- ❌ No keyword matching (collects all from legislative feeds)
Polymarket Articles:
- ✅ Always tagged with
special_tags: ['prediction_market'] - ✅ Country-level tagging (e.g., "United States", "Ukraine", "Mexico")
- ✅ Geographic tags (continents) detected automatically
- ✅ Keyword filtering for political/geopolitical markets
- ✅ Market data stored (volume, liquidity, prices, outcomes)
The newsroom system runs as an AWS Lambda function in us-east-1 that is triggered by:
- Manual Update: Click "Update" button in the UI (calls API Gateway endpoint)
- Scheduled: Daily at 11PM Central Time (5AM UTC) via EventBridge
When triggered, all three scrapers run in sequence:
news_scraper.py- Collects keyword-filtered news articleslegislation_scraper.py- Collects unfiltered legislative articlespolymarket_scraper.py- Collects political prediction markets
All scrapers share the same S3 bucket and save articles to the same date-organized folders, allowing them to appear together on the daily index pages.
cd /path/to/newsroom
./scripts/deploy_lambda.shThe deploy script:
- Creates/updates the Lambda function in us-east-1
- Sets up EventBridge rule for daily scheduling
- Packages all scrapers and dependencies
lambda_wrapper.py forces FRESH_MODE=true before invoking each scraper, clearing their progress trackers so every Lambda run reprocesses that day's feeds end-to-end. When running locally you can mirror this behaviour by passing --fresh (or setting FRESH_MODE=true) if you need to bypass idempotency manually.
- Polymarket Integration: New scraper for political prediction markets with country-level tagging
- Country Detection: Polymarket articles tagged with specific countries (40+ countries supported)
- Deploy Script Fix: Added explicit us-east-1 region to all Lambda/EventBridge commands
- Enhanced Keyword Matching: Fixed false positives by implementing word boundary regex
- Quality Control: Reduced collection from 700+ low-quality articles to ~50 high-quality articles
- Better Filtering: Eliminated irrelevant content like "Celebrity Traitors" matching "AI"
- Legislation Scraper: Separate scraper for legislative content (bypasses keyword filtering, respects date filtering)
- Shared Storage: Extracted common S3 operations to reusable utilities
- Date Filtering: Legislation scraper now filters by past 24 hours (same timeframe as news scraper)
- Error Handling: Improved logging and error handling in Lambda wrapper
MIT License - See LICENSE file for details.