Columbia_Capstone-KPMG

Project Organization

Columbia_Capstone-KPMG/
│── configs/                 
│   └── ingest_parse.yaml       # Config files (parameters for pipelines)
│
│── data/                      
│   ├── raw/                    # Raw input documents (NEVER commit to git)
│   ├── processed/              # Outputs generated by parsing/processing
│
│── docker/                      
│   ├── .env.example            # Template for env variables (Neo4j credentials and configs)
│   ├── docker-compose.yml      # Compose file to spin up Neo4j container
│
│── docs/                       # Notes, research findings, design docs
│
│── scripts/                    # CLI entry scripts (team runs these)
│   └── do_asterisk_chunking.py
│   └── do_fix_size_chunking.py
│   └── do_semantic_chunking.py
│   ├── doc_converter.py
│   └── ingest_graph.py
│   └── ingestion_parse.py
│   └── reset_graph.py
│   └── test_neo4j.py
│
│── src/                        # Core source code (modularized)
│   ├── doc_2_docx/             # Conversion logic (.doc → .docx; Windows-dependent)
│   │   └── ...
│   │
│   ├── healthcare_rag_llm/     # Main package
│   │   ├── doc_parsing/        # Parsing PDF/Docx (tables, watermarks, images)
│   │   │   └── ...
│   │   │
│   │   ├── chunking/           # Text splitting strategies
│   │   │   └── ...
│   │   │
│   │   ├── embedding/          # Embedding models
│   │   │   └── ...
│   │   │
│   │   ├── graph_builder/      # Knowledge Graph builder (refer to README.md for usage note)
│   │   │   └── ...
│   │   │
│   │   ├── pipelines/          # Orchestrated workflows (end-to-end)
│   │   │   └── ...
│   │   │
│   │   ├── utils/              # Helper functions (I/O, logging, common tools)
│   │   │   └── ...
│   │   │
│   │   └── ...                 # Placeholder for future modules
│   │
│   └── __init__.py             # Makes src a Python package
│
│── .gitignore                  # Ignores data/, .venv/, logs, etc.
│── pyproject.toml              # Dependency config
│── README.md                   # (this file)

Data Handling

Put all raw Medicaid PDFs/Docs under:

macOS/Linux:
```
data/raw/Childrens Evolution of Care/
```
Windows:
```
data\raw\Childrens Evolution of Care\
```

Parsed outputs go to:
- macOS/Linux:
```
data/processed/
```
- Windows:
```
data\processed\
```
DO NOT commit files under data/ (git will ignore them).
If you need to share data: use Google Drive/SharePoint.

System Requirements

Python Version

Required: Python 3.9 - 3.12
Recommended: Python 3.11
Not supported: Python 3.13+ (PyTorch compatibility)

Hardware

GPU (Highly Recommended for Performance)

Supported: NVIDIA GPUs (GTX 10 series and newer)
- Examples: RTX 4090, RTX 3080, RTX 2070, GTX 1660
Performance: 10-20x faster than CPU
Unsupported GPUs: Automatically fall back to CPU

CPU Only (Works but Slower)

Any modern multi-core processor
8GB+ RAM recommended
Expect longer processing times

Performance Comparison:

GPU (RTX 2070): ~35 seconds for 5000 chunks
CPU (Ryzen 7): ~250-500 seconds for 5000 chunks

📖 See INSTALLATION.md for detailed requirements and setup instructions.

Git Workflow (Team Rules)

Create a branch for each feature/task
```
git checkout -b feature/<short-description>
```
Examples:
- feature/update-doc-parsing
- bugfix/ocr-path

Commit frequently, but small logical chunks

git add <files>
git commit -m "Clear message: what & why"

Sync with main before push

git checkout main
git pull origin main
git checkout feature/your-branch
git merge main

Push your branch
```
git push origin feature/your-branch
```
Open a Pull Request (PR) on GitHub
- Always make a PR into main
- Request at least 1 teammate as reviewer
- Merge only after review
Delete branch after merge
- On GitHub → “Delete branch”
- Locally:
```
git branch -d feature/your-branch
```

Quickstart

Clone repo & create virtual environment:

macOS/Linux

git clone git@github.com:xiaojiangwu12338/Columbia_Capstone-KPMG.git
cd Columbia_Capstone-KPMG
python3 -m venv .venv
source .venv/bin/activate

Windows

git clone git@github.com:xiaojiangwu12338/Columbia_Capstone-KPMG.git
cd Columbia_Capstone-KPMG
python -m venv .venv
.\.venv\Scripts\activate

Install PyTorch (GPU support recommended):

# Universal installation (works on all systems)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# This automatically uses GPU if available, CPU otherwise

Install other dependencies:

pip install -e .

# Download NLTK data (required for semantic chunking)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

Install system dependencies:

macOS
```
brew install tesseract libreoffice
```

Linux

sudo apt-get install tesseract-ocr libreoffice

Windows
- Tesseract OCR
- Microsoft Word (via COM) or LibreOffice

Verify installation:

python -c "import torch; print(f'Device: {\"GPU - \" + torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU\"}')"

Put sample documents under data/raw/...
Sample run:

macOS/Linux

python scripts/ingestion_parse.py --config configs/ingest_parse.yaml

Windows

python scripts\ingestion_parse.py --config configs\ingest_parse.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Columbia_Capstone-KPMG

Project Organization

Data Handling

System Requirements

Python Version

Hardware

GPU (Highly Recommended for Performance)

CPU Only (Works but Slower)

Git Workflow (Team Rules)

Quickstart

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
configs		configs
data		data
docker		docker
docs		docs
frontend		frontend
scripts		scripts
src		src
.gitignore		.gitignore
EnvironmentSetup_instruction.md		EnvironmentSetup_instruction.md
Introduction.md		Introduction.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Columbia_Capstone-KPMG

Project Organization

Data Handling

System Requirements

Python Version

Hardware

GPU (Highly Recommended for Performance)

CPU Only (Works but Slower)

Git Workflow (Team Rules)

Quickstart

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages