Skip to content

NMBGMR/aquiferIE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AquiferIE

Information extractor for New Mexico aquifer reports — ingests PDFs, generates embeddings, answers structured questions, ranks study areas, and exports tidy outputs.


Highlights

  • Report ingestion from curated links and basin folders.
  • Automated text extraction & embeddings for fast retrieval-QA.
  • Question sets for hydrogeologic properties, brackish/salinity, and bounding boxes.
  • Ranking workflows for reports and study areas with plotting utilities.
  • GIS helpers to rasterize/smooth polygons and manage basin shapefiles.
  • Optional self-evaluation pass to check/grade AI answers.

Repository Layout (key files)

  • aquiferie_runner.py — orchestrates end-to-end runs.
  • main.py — core pipeline entry (extraction → retrieval-QA → outputs).
  • openai_api_client.py — thin wrapper for model calls (embeddings/Q&A).
  • embeddingsIE.py — build/search embeddings for report text.
  • gen_bboxes.py — derive/collect bounding boxes for study areas.
  • hydrogeoprops.py — hydrogeologic-properties question/answer helpers.
  • brackishwater.py — brackish/salinity insight extraction.
  • rank_reports.py, rank_studyareas.py — scoring & ranking utilities.
  • plot_rankings.py — quick plots for rankings.
  • rasterize_polys.py, smooth_shapefiles.py — GIS post-processing scripts.
  • view_insights.py — browse/inspect generated insight tables.
  • self_evaluation.py — optional answer quality checks.
  • Data and prompts:
    • aquiferie_report_links.csv (report URLs/paths)
    • aquiferie_insight_prompts.csv and .txt variants
    • rank_insights_questions.txt
    • bounding_boxes.* (shapefile set)

Prerequisites

  • Python ≥ 3.9
  • System deps (recommended):
    • poppler (or equivalent) for some PDF extraction backends
    • gdal, geos, proj if using GIS helpers
  • OpenAI API key exposed in your environment as OPENAI_API_KEY

Installation

# clone
git clone https://github.com/marissafichera/aquiferie.git
cd aquiferie

Step-by-Step Workflow

  • 1) Build embeddings over reports

    • Inputs: aquiferie_report_links.csv

    • Outputs: out/embeddings/aquifer_insights_embeddings.csv

    • Command:

      python embeddingsIE.py \
        --links-csv aquiferie_report_links.csv \
        --embeddings-csv out/embeddings/aquifer_insights_embeddings.csv
  • 2) Run retrieval-QA with structured prompts

    • Inputs: aquiferie_insight_prompts.csv, embeddings CSV

    • Outputs: out/insights/answers.csv

    • Command:

      python main.py \
        --prompts-file aquiferie_insight_prompts.csv \
        --embeddings-csv out/embeddings/aquifer_insights_embeddings.csv \
        --answers-csv out/insights/answers.csv
  • 3) Rank reports and/or study areas

    • Inputs: out/insights/answers.csv

    • Outputs: out/ranking/report_ranks.csv, out/ranking/studyarea_ranks.csv

    • Commands:

      python rank_reports.py \
        --answers-csv out/insights/answers.csv \
        --out-csv out/ranking/report_ranks.csv
      
      python rank_studyareas.py \
        --answers-csv out/insights/answers.csv \
        --out-csv out/ranking/studyarea_ranks.csv
  • 4) Plot and inspect results

    • Plot rankings:

      python plot_rankings.py \
        --ranking-csv out/ranking/report_ranks.csv \
        --out-png out/figures/Figure_1.png
    • Browse insights:

      python view_insights.py \
        --answers-csv out/insights/answers.csv
  • 5) Optional: GIS post-processing

    • Rasterize polygons:

      python rasterize_polys.py \
        --in-shp some_polys.shp \
        --out-tif out/rasters/polys.tif
    • Smooth shapefiles:

      python smooth_shapefiles.py \
        --in-shp some_polys.shp \
        --out-shp out/smoothed/polys_smoothed.shp
  • 6) Optional: Self-evaluation of answers

    • Inputs: out/insights/answers.csv

    • Outputs: out/insights/answers_scored.csv (example)

    • Command:

      python self_evaluation.py \
        --answers-csv out/insights/answers.csv \
        --out-csv out/insights/answers_scored.csv

Configuration

  • Environment variables

    • OPENAI_API_KEY — required for embeddings and Q&A.
  • Inputs

    • Basin subfolders with PDFs.

    • Prompt sets:

      • aquiferie_insight_prompts.txt
    • Optional shapes:

      • bounding_boxes.* (SHP/DBF/SHX/PRJ)
  • Outputs (typical)

    • in basin subfolders: aquifer_insights_selfeval.csv

Contributing

  • Issues and PRs are welcome.
  • Use feature branches, write clear commit messages, and keep PRs focused.

License

  • If a license file is absent, please add one (e.g., MIT) before broad reuse/distribution.

Acknowledgments

  • Thanks to the open-source Python and geospatial ecosystems (pandas, numpy, FAISS, scikit-learn, PyMuPDF/PDF tooling, GeoPandas/Shapely/GDAL/PROJ) that make this workflow possible.

About

OpenAI "information extractor" for New Mexico aquifer reports.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages