Skip to content

xiaojiangwu12338/Columbia_Capstone-KPMG

Repository files navigation

Columbia_Capstone-KPMG

Project Organization

Columbia_Capstone-KPMG/
│── configs/                 
│   └── ingest_parse.yaml       # Config files (parameters for pipelines)
│
│── data/                      
│   ├── raw/                    # Raw input documents (NEVER commit to git)
│   ├── processed/              # Outputs generated by parsing/processing
│
│── docker/                      
│   ├── .env.example            # Template for env variables (Neo4j credentials and configs)
│   ├── docker-compose.yml      # Compose file to spin up Neo4j container
│
│── docs/                       # Notes, research findings, design docs
│
│── scripts/                    # CLI entry scripts (team runs these)
│   └── do_asterisk_chunking.py
│   └── do_fix_size_chunking.py
│   └── do_semantic_chunking.py
│   ├── doc_converter.py
│   └── ingest_graph.py
│   └── ingestion_parse.py
│   └── reset_graph.py
│   └── test_neo4j.py
│
│── src/                        # Core source code (modularized)
│   ├── doc_2_docx/             # Conversion logic (.doc → .docx; Windows-dependent)
│   │   └── ...
│   │
│   ├── healthcare_rag_llm/     # Main package
│   │   ├── doc_parsing/        # Parsing PDF/Docx (tables, watermarks, images)
│   │   │   └── ...
│   │   │
│   │   ├── chunking/           # Text splitting strategies
│   │   │   └── ...
│   │   │
│   │   ├── embedding/          # Embedding models
│   │   │   └── ...
│   │   │
│   │   ├── graph_builder/      # Knowledge Graph builder (refer to README.md for usage note)
│   │   │   └── ...
│   │   │
│   │   ├── pipelines/          # Orchestrated workflows (end-to-end)
│   │   │   └── ...
│   │   │
│   │   ├── utils/              # Helper functions (I/O, logging, common tools)
│   │   │   └── ...
│   │   │
│   │   └── ...                 # Placeholder for future modules
│   │
│   └── __init__.py             # Makes src a Python package
│
│── .gitignore                  # Ignores data/, .venv/, logs, etc.
│── pyproject.toml              # Dependency config
│── README.md                   # (this file)

Data Handling

  • Put all raw Medicaid PDFs/Docs under:

    • macOS/Linux:

      data/raw/Childrens Evolution of Care/
      
    • Windows:

      data\raw\Childrens Evolution of Care\
      
  • Parsed outputs go to:

    • macOS/Linux:

      data/processed/
      
    • Windows:

      data\processed\
      
  • DO NOT commit files under data/ (git will ignore them).

  • If you need to share data: use Google Drive/SharePoint.

System Requirements

Python Version

  • Required: Python 3.9 - 3.12
  • Recommended: Python 3.11
  • Not supported: Python 3.13+ (PyTorch compatibility)

Hardware

GPU (Highly Recommended for Performance)

  • Supported: NVIDIA GPUs (GTX 10 series and newer)
    • Examples: RTX 4090, RTX 3080, RTX 2070, GTX 1660
  • Performance: 10-20x faster than CPU
  • Unsupported GPUs: Automatically fall back to CPU

CPU Only (Works but Slower)

  • Any modern multi-core processor
  • 8GB+ RAM recommended
  • Expect longer processing times

Performance Comparison:

  • GPU (RTX 2070): ~35 seconds for 5000 chunks
  • CPU (Ryzen 7): ~250-500 seconds for 5000 chunks

📖 See INSTALLATION.md for detailed requirements and setup instructions.

Git Workflow (Team Rules)

  1. Create a branch for each feature/task

    git checkout -b feature/<short-description>
    

    Examples:

    • feature/update-doc-parsing
    • bugfix/ocr-path
  2. Commit frequently, but small logical chunks

    git add <files>
    git commit -m "Clear message: what & why"
    
  3. Sync with main before push

    git checkout main
    git pull origin main
    git checkout feature/your-branch
    git merge main
    
  4. Push your branch

    git push origin feature/your-branch
    
  5. Open a Pull Request (PR) on GitHub

    • Always make a PR into main
    • Request at least 1 teammate as reviewer
    • Merge only after review
  6. Delete branch after merge

    • On GitHub → “Delete branch”

    • Locally:

      git branch -d feature/your-branch
      

Quickstart

  1. Clone repo & create virtual environment:
  • macOS/Linux

    git clone git@github.com:xiaojiangwu12338/Columbia_Capstone-KPMG.git
    cd Columbia_Capstone-KPMG
    python3 -m venv .venv
    source .venv/bin/activate
  • Windows

    git clone git@github.com:xiaojiangwu12338/Columbia_Capstone-KPMG.git
    cd Columbia_Capstone-KPMG
    python -m venv .venv
    .\.venv\Scripts\activate
  1. Install PyTorch (GPU support recommended):
# Universal installation (works on all systems)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# This automatically uses GPU if available, CPU otherwise
  1. Install other dependencies:
pip install -e .

# Download NLTK data (required for semantic chunking)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
  1. Install system dependencies:
  • macOS

    brew install tesseract libreoffice
  • Linux

    sudo apt-get install tesseract-ocr libreoffice
  • Windows

  1. Verify installation:
python -c "import torch; print(f'Device: {\"GPU - \" + torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU\"}')"
  1. Put sample documents under data/raw/...

  2. Sample run:

  • macOS/Linux

    python scripts/ingestion_parse.py --config configs/ingest_parse.yaml
  • Windows

    python scripts\ingestion_parse.py --config configs\ingest_parse.yaml

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages