Columbia_Capstone-KPMG/
│── configs/
│ └── ingest_parse.yaml # Config files (parameters for pipelines)
│
│── data/
│ ├── raw/ # Raw input documents (NEVER commit to git)
│ ├── processed/ # Outputs generated by parsing/processing
│
│── docker/
│ ├── .env.example # Template for env variables (Neo4j credentials and configs)
│ ├── docker-compose.yml # Compose file to spin up Neo4j container
│
│── docs/ # Notes, research findings, design docs
│
│── scripts/ # CLI entry scripts (team runs these)
│ └── do_asterisk_chunking.py
│ └── do_fix_size_chunking.py
│ └── do_semantic_chunking.py
│ ├── doc_converter.py
│ └── ingest_graph.py
│ └── ingestion_parse.py
│ └── reset_graph.py
│ └── test_neo4j.py
│
│── src/ # Core source code (modularized)
│ ├── doc_2_docx/ # Conversion logic (.doc → .docx; Windows-dependent)
│ │ └── ...
│ │
│ ├── healthcare_rag_llm/ # Main package
│ │ ├── doc_parsing/ # Parsing PDF/Docx (tables, watermarks, images)
│ │ │ └── ...
│ │ │
│ │ ├── chunking/ # Text splitting strategies
│ │ │ └── ...
│ │ │
│ │ ├── embedding/ # Embedding models
│ │ │ └── ...
│ │ │
│ │ ├── graph_builder/ # Knowledge Graph builder (refer to README.md for usage note)
│ │ │ └── ...
│ │ │
│ │ ├── pipelines/ # Orchestrated workflows (end-to-end)
│ │ │ └── ...
│ │ │
│ │ ├── utils/ # Helper functions (I/O, logging, common tools)
│ │ │ └── ...
│ │ │
│ │ └── ... # Placeholder for future modules
│ │
│ └── __init__.py # Makes src a Python package
│
│── .gitignore # Ignores data/, .venv/, logs, etc.
│── pyproject.toml # Dependency config
│── README.md # (this file)
-
Put all raw Medicaid PDFs/Docs under:
-
macOS/Linux:
data/raw/Childrens Evolution of Care/ -
Windows:
data\raw\Childrens Evolution of Care\
-
-
Parsed outputs go to:
-
macOS/Linux:
data/processed/ -
Windows:
data\processed\
-
-
DO NOT commit files under
data/(git will ignore them). -
If you need to share data: use Google Drive/SharePoint.
- Required: Python 3.9 - 3.12
- Recommended: Python 3.11
- Not supported: Python 3.13+ (PyTorch compatibility)
- Supported: NVIDIA GPUs (GTX 10 series and newer)
- Examples: RTX 4090, RTX 3080, RTX 2070, GTX 1660
- Performance: 10-20x faster than CPU
- Unsupported GPUs: Automatically fall back to CPU
- Any modern multi-core processor
- 8GB+ RAM recommended
- Expect longer processing times
Performance Comparison:
- GPU (RTX 2070): ~35 seconds for 5000 chunks
- CPU (Ryzen 7): ~250-500 seconds for 5000 chunks
📖 See INSTALLATION.md for detailed requirements and setup instructions.
-
Create a branch for each feature/task
git checkout -b feature/<short-description>Examples:
feature/update-doc-parsingbugfix/ocr-path
-
Commit frequently, but small logical chunks
git add <files> git commit -m "Clear message: what & why" -
Sync with main before push
git checkout main git pull origin main git checkout feature/your-branch git merge main -
Push your branch
git push origin feature/your-branch -
Open a Pull Request (PR) on GitHub
- Always make a PR into
main - Request at least 1 teammate as reviewer
- Merge only after review
- Always make a PR into
-
Delete branch after merge
-
On GitHub → “Delete branch”
-
Locally:
git branch -d feature/your-branch
-
- Clone repo & create virtual environment:
-
macOS/Linux
git clone git@github.com:xiaojiangwu12338/Columbia_Capstone-KPMG.git cd Columbia_Capstone-KPMG python3 -m venv .venv source .venv/bin/activate
-
Windows
git clone git@github.com:xiaojiangwu12338/Columbia_Capstone-KPMG.git cd Columbia_Capstone-KPMG python -m venv .venv .\.venv\Scripts\activate
- Install PyTorch (GPU support recommended):
# Universal installation (works on all systems)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# This automatically uses GPU if available, CPU otherwise- Install other dependencies:
pip install -e .
# Download NLTK data (required for semantic chunking)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"- Install system dependencies:
-
macOS
brew install tesseract libreoffice
-
Linux
sudo apt-get install tesseract-ocr libreoffice
-
Windows
- Tesseract OCR
- Microsoft Word (via COM) or LibreOffice
- Verify installation:
python -c "import torch; print(f'Device: {\"GPU - \" + torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU\"}')"-
Put sample documents under
data/raw/... -
Sample run:
-
macOS/Linux
python scripts/ingestion_parse.py --config configs/ingest_parse.yaml
-
Windows
python scripts\ingestion_parse.py --config configs\ingest_parse.yaml