Newsroom - AI-Powered News Collection System

A sophisticated news scraper that collects, filters, and presents energy, AI, and blockchain news from multiple sources with intelligent keyword matching and static website hosting.

Features

Smart Filtering: Uses word boundary regex matching to ensure only relevant articles are collected
Multiple Sources: Scrapes from BBC, CNN, Guardian, Al Jazeera, TechCrunch, Ars Technica, and more
Legislation Tracking: Collects legislative content from US Congress, UK Parliament, EU, Australia, Brazil, and South Africa
Prediction Markets: Fetches political prediction markets from Polymarket with country-level tagging
Static Website: Automatically generates browsable HTML indexes for each collection date
S3 Integration: Stores content in AWS S3 with date-organized folders
Idempotent: Prevents duplicate processing of articles
Progress Tracking: Saves progress to resume interrupted collections

Quick Start

Install Dependencies:
```
pip install -r requirements.txt
```
Configure AWS:
```
aws configure
```
Run the Scraper:
```
python3 news_scraper_final.py
```
View Results:
- Master Index: http://news-collection-website.s3-website-us-east-1.amazonaws.com/
- Date-specific: http://news-collection-website.s3-website-us-east-1.amazonaws.com/news/YYYY-MM-DD/

Configuration

The scraper targets these keywords:

Energy: renewable, solar, wind, nuclear, battery, grid, power, energy
AI: artificial intelligence, machine learning, AI, ML, neural network, deep learning
Blockchain: blockchain, cryptocurrency, bitcoin, ethereum, crypto, DeFi, Web3

File Structure

newsroom/
├── news_scraper.py          # Main news scraper script
├── legislation_scraper.py   # Legislative feed scraper
├── polymarket_scraper.py    # Polymarket prediction markets scraper
├── article_tagger.py        # Geographic and topic tagging
├── news_storage.py          # Shared S3 storage utilities
├── requirements.txt         # Python dependencies
├── README.md                # This file
├── lambda/
│   ├── lambda_news_scraper.py  # Lambda entry handler
│   └── lambda_wrapper.py       # Orchestrates all scrapers
├── scripts/
│   └── deploy_lambda.sh     # Lambda deployment script
└── docs/
    └── README_news_scraper.md  # Detailed setup instructions

S3 Bucket Structure

s3://news-collection-website/
├── index.html                    # Master index (all dates)
└── news/
    └── YYYY-MM-DD/
        ├── index.html           # Date-specific index (all articles for that date)
        ├── metadata/            # All article metadata (news + legislation)
        ├── content/             # All article content (news + legislation)
        ├── rss/
        │   ├── metadata/        # RSS article metadata (legacy)
        │   └── content/         # RSS article content (legacy)
        └── direct/
            ├── metadata/        # Direct scrape metadata (legacy)
            └── content/         # Direct scrape content (legacy)

Note: News, legislation, and Polymarket articles are all stored in the same metadata/ and content/ folders at the date level. The HTML index generation scans all metadata folders to include all articles. Use the source field or special_tags to filter by type.

Usage Examples

Run News Scraper Only

python3 news_scraper.py

Run Legislation Scraper Only

python3 legislation_scraper.py

Run Polymarket Scraper Only

python3 polymarket_scraper.py

Fresh Mode (Bypass Idempotency)

# Reprocess everything
python3 news_scraper.py --fresh
python3 legislation_scraper.py --fresh

How It Works

News Scraper (`news_scraper.py`)

Fetches articles from RSS feeds and direct scraping
Filters articles by keyword matching (energy, AI, blockchain, etc.)
Tags articles with geographic and topical information
Saves to S3 with metadata and content
Generates HTML indexes

Legislation Scraper (`legislation_scraper.py`)

Fetches articles from legislative RSS feeds
NO keyword filtering - collects ALL articles from feeds
Tags articles with geographic information
Always adds legislation tag to metadata
Saves to same S3 bucket using shared storage utilities
Articles appear in same HTML indexes (filter by legislation tag)

Polymarket Scraper (`polymarket_scraper.py`)

Fetches prediction markets from Polymarket Gamma API (no auth required)
Filters for political/geopolitical markets using keyword matching
Detects country mentions from market questions and descriptions
Tags with prediction_market special tag and detected countries
Saves to same S3 bucket using shared storage utilities
Typically collects 100-150 political markets per run

Polymarket API: https://gamma-api.polymarket.com/markets

No authentication required
Supports pagination via limit and offset parameters
Returns market data including question, prices, volume, liquidity

Shared Storage (`news_storage.py`)

All three scrapers use shared utilities:

save_article(): Save article with metadata to S3
exists_in_s3(): Check if file already exists
upload_to_s3_if_not_exists(): Upload file if not already present
get_today_folder(): Get today's folder path

Tagging System

Regular News Articles:

✅ Geographic tags (continents)
✅ Keyword matching
✅ Topic categories (energy, AI, blockchain, insurance, geopolitics)

Legislation Articles:

✅ Always tagged with special_tags: ['legislation']
✅ Geographic tags (continents) detected automatically
❌ No keyword matching (collects all from legislative feeds)

Polymarket Articles:

✅ Always tagged with special_tags: ['prediction_market']
✅ Country-level tagging (e.g., "United States", "Ukraine", "Mexico")
✅ Geographic tags (continents) detected automatically
✅ Keyword filtering for political/geopolitical markets
✅ Market data stored (volume, liquidity, prices, outcomes)

Lambda Deployment

The newsroom system runs as an AWS Lambda function in us-east-1 that is triggered by:

Manual Update: Click "Update" button in the UI (calls API Gateway endpoint)
Scheduled: Daily at 11PM Central Time (5AM UTC) via EventBridge

When triggered, all three scrapers run in sequence:

news_scraper.py - Collects keyword-filtered news articles
legislation_scraper.py - Collects unfiltered legislative articles
polymarket_scraper.py - Collects political prediction markets

All scrapers share the same S3 bucket and save articles to the same date-organized folders, allowing them to appear together on the daily index pages.

Deploying Updates

cd /path/to/newsroom
./scripts/deploy_lambda.sh

The deploy script:

Creates/updates the Lambda function in us-east-1
Sets up EventBridge rule for daily scheduling
Packages all scrapers and dependencies

lambda_wrapper.py forces FRESH_MODE=true before invoking each scraper, clearing their progress trackers so every Lambda run reprocesses that day's feeds end-to-end. When running locally you can mirror this behaviour by passing --fresh (or setting FRESH_MODE=true) if you need to bypass idempotency manually.

Recent Improvements

Polymarket Integration: New scraper for political prediction markets with country-level tagging
Country Detection: Polymarket articles tagged with specific countries (40+ countries supported)
Deploy Script Fix: Added explicit us-east-1 region to all Lambda/EventBridge commands
Enhanced Keyword Matching: Fixed false positives by implementing word boundary regex
Quality Control: Reduced collection from 700+ low-quality articles to ~50 high-quality articles
Better Filtering: Eliminated irrelevant content like "Celebrity Traitors" matching "AI"
Legislation Scraper: Separate scraper for legislative content (bypasses keyword filtering, respects date filtering)
Shared Storage: Extracted common S3 operations to reusable utilities
Date Filtering: Legislation scraper now filters by past 24 hours (same timeframe as news scraper)
Error Handling: Improved logging and error handling in Lambda wrapper

License

MIT License - See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.sep		.sep
__pycache__		__pycache__
docs		docs
lambda		lambda
scripts		scripts
templates		templates
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
article_tagger.py		article_tagger.py
economy_politics_scraper.py		economy_politics_scraper.py
index.html		index.html
legislation_scraper.py		legislation_scraper.py
news_scraper.py		news_scraper.py
news_storage.py		news_storage.py
newsroom_dynamodb.py		newsroom_dynamodb.py
polymarket_scraper.py		polymarket_scraper.py
requirements.txt		requirements.txt
requirements_lambda.txt		requirements_lambda.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Newsroom - AI-Powered News Collection System

Features

Quick Start

Configuration

File Structure

S3 Bucket Structure

Usage Examples

Run News Scraper Only

Run Legislation Scraper Only

Run Polymarket Scraper Only

Fresh Mode (Bypass Idempotency)

How It Works

News Scraper (`news_scraper.py`)

Legislation Scraper (`legislation_scraper.py`)

Polymarket Scraper (`polymarket_scraper.py`)

Shared Storage (`news_storage.py`)

Tagging System

Lambda Deployment

Deploying Updates

Recent Improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Newsroom - AI-Powered News Collection System

Features

Quick Start

Configuration

File Structure

S3 Bucket Structure

Usage Examples

Run News Scraper Only

Run Legislation Scraper Only

Run Polymarket Scraper Only

Fresh Mode (Bypass Idempotency)

How It Works

News Scraper (news_scraper.py)

Legislation Scraper (legislation_scraper.py)

Polymarket Scraper (polymarket_scraper.py)

Shared Storage (news_storage.py)

Tagging System

Lambda Deployment

Deploying Updates

Recent Improvements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

News Scraper (`news_scraper.py`)

Legislation Scraper (`legislation_scraper.py`)

Polymarket Scraper (`polymarket_scraper.py`)

Shared Storage (`news_storage.py`)

Packages