Information extractor for New Mexico aquifer reports — ingests PDFs, generates embeddings, answers structured questions, ranks study areas, and exports tidy outputs.
- Report ingestion from curated links and basin folders.
- Automated text extraction & embeddings for fast retrieval-QA.
- Question sets for hydrogeologic properties, brackish/salinity, and bounding boxes.
- Ranking workflows for reports and study areas with plotting utilities.
- GIS helpers to rasterize/smooth polygons and manage basin shapefiles.
- Optional self-evaluation pass to check/grade AI answers.
aquiferie_runner.py— orchestrates end-to-end runs.main.py— core pipeline entry (extraction → retrieval-QA → outputs).openai_api_client.py— thin wrapper for model calls (embeddings/Q&A).embeddingsIE.py— build/search embeddings for report text.gen_bboxes.py— derive/collect bounding boxes for study areas.hydrogeoprops.py— hydrogeologic-properties question/answer helpers.brackishwater.py— brackish/salinity insight extraction.rank_reports.py,rank_studyareas.py— scoring & ranking utilities.plot_rankings.py— quick plots for rankings.rasterize_polys.py,smooth_shapefiles.py— GIS post-processing scripts.view_insights.py— browse/inspect generated insight tables.self_evaluation.py— optional answer quality checks.- Data and prompts:
aquiferie_report_links.csv(report URLs/paths)aquiferie_insight_prompts.csvand.txtvariantsrank_insights_questions.txtbounding_boxes.*(shapefile set)
- Python ≥ 3.9
- System deps (recommended):
poppler(or equivalent) for some PDF extraction backendsgdal,geos,projif using GIS helpers
- OpenAI API key exposed in your environment as
OPENAI_API_KEY
# clone
git clone https://github.com/marissafichera/aquiferie.git
cd aquiferie-
1) Build embeddings over reports
-
Inputs:
aquiferie_report_links.csv -
Outputs:
out/embeddings/aquifer_insights_embeddings.csv -
Command:
python embeddingsIE.py \ --links-csv aquiferie_report_links.csv \ --embeddings-csv out/embeddings/aquifer_insights_embeddings.csv
-
-
2) Run retrieval-QA with structured prompts
-
Inputs:
aquiferie_insight_prompts.csv, embeddings CSV -
Outputs:
out/insights/answers.csv -
Command:
python main.py \ --prompts-file aquiferie_insight_prompts.csv \ --embeddings-csv out/embeddings/aquifer_insights_embeddings.csv \ --answers-csv out/insights/answers.csv
-
-
3) Rank reports and/or study areas
-
Inputs:
out/insights/answers.csv -
Outputs:
out/ranking/report_ranks.csv,out/ranking/studyarea_ranks.csv -
Commands:
python rank_reports.py \ --answers-csv out/insights/answers.csv \ --out-csv out/ranking/report_ranks.csv python rank_studyareas.py \ --answers-csv out/insights/answers.csv \ --out-csv out/ranking/studyarea_ranks.csv
-
-
4) Plot and inspect results
-
Plot rankings:
python plot_rankings.py \ --ranking-csv out/ranking/report_ranks.csv \ --out-png out/figures/Figure_1.png
-
Browse insights:
python view_insights.py \ --answers-csv out/insights/answers.csv
-
-
5) Optional: GIS post-processing
-
Rasterize polygons:
python rasterize_polys.py \ --in-shp some_polys.shp \ --out-tif out/rasters/polys.tif
-
Smooth shapefiles:
python smooth_shapefiles.py \ --in-shp some_polys.shp \ --out-shp out/smoothed/polys_smoothed.shp
-
-
6) Optional: Self-evaluation of answers
-
Inputs:
out/insights/answers.csv -
Outputs:
out/insights/answers_scored.csv(example) -
Command:
python self_evaluation.py \ --answers-csv out/insights/answers.csv \ --out-csv out/insights/answers_scored.csv
-
-
Environment variables
OPENAI_API_KEY— required for embeddings and Q&A.
-
Inputs
-
Basin subfolders with PDFs.
-
Prompt sets:
aquiferie_insight_prompts.txt
-
Optional shapes:
bounding_boxes.*(SHP/DBF/SHX/PRJ)
-
-
Outputs (typical)
- in basin subfolders:
aquifer_insights_selfeval.csv
- in basin subfolders:
- Issues and PRs are welcome.
- Use feature branches, write clear commit messages, and keep PRs focused.
- If a license file is absent, please add one (e.g., MIT) before broad reuse/distribution.
- Thanks to the open-source Python and geospatial ecosystems (pandas, numpy, FAISS, scikit-learn, PyMuPDF/PDF tooling, GeoPandas/Shapely/GDAL/PROJ) that make this workflow possible.