GitHub - NMBGMR/aquiferIE: OpenAI "information extractor" for New Mexico aquifer reports.

AquiferIE

Information extractor for New Mexico aquifer reports — ingests PDFs, generates embeddings, answers structured questions, ranks study areas, and exports tidy outputs.

Highlights

Report ingestion from curated links and basin folders.
Automated text extraction & embeddings for fast retrieval-QA.
Question sets for hydrogeologic properties, brackish/salinity, and bounding boxes.
Ranking workflows for reports and study areas with plotting utilities.
GIS helpers to rasterize/smooth polygons and manage basin shapefiles.
Optional self-evaluation pass to check/grade AI answers.

Repository Layout (key files)

aquiferie_runner.py — orchestrates end-to-end runs.
main.py — core pipeline entry (extraction → retrieval-QA → outputs).
openai_api_client.py — thin wrapper for model calls (embeddings/Q&A).
embeddingsIE.py — build/search embeddings for report text.
gen_bboxes.py — derive/collect bounding boxes for study areas.
hydrogeoprops.py — hydrogeologic-properties question/answer helpers.
brackishwater.py — brackish/salinity insight extraction.
rank_reports.py, rank_studyareas.py — scoring & ranking utilities.
plot_rankings.py — quick plots for rankings.
rasterize_polys.py, smooth_shapefiles.py — GIS post-processing scripts.
view_insights.py — browse/inspect generated insight tables.
self_evaluation.py — optional answer quality checks.
Data and prompts:
- aquiferie_report_links.csv (report URLs/paths)
- aquiferie_insight_prompts.csv and .txt variants
- rank_insights_questions.txt
- bounding_boxes.* (shapefile set)

Prerequisites

Python ≥ 3.9
System deps (recommended):
- poppler (or equivalent) for some PDF extraction backends
- gdal, geos, proj if using GIS helpers
OpenAI API key exposed in your environment as OPENAI_API_KEY

Installation

# clone
git clone https://github.com/marissafichera/aquiferie.git
cd aquiferie

Step-by-Step Workflow

1) Build embeddings over reports

Inputs: aquiferie_report_links.csv
Outputs: out/embeddings/aquifer_insights_embeddings.csv

Command:

python embeddingsIE.py \
  --links-csv aquiferie_report_links.csv \
  --embeddings-csv out/embeddings/aquifer_insights_embeddings.csv

2) Run retrieval-QA with structured prompts

Inputs: aquiferie_insight_prompts.csv, embeddings CSV
Outputs: out/insights/answers.csv

Command:

python main.py \
  --prompts-file aquiferie_insight_prompts.csv \
  --embeddings-csv out/embeddings/aquifer_insights_embeddings.csv \
  --answers-csv out/insights/answers.csv

3) Rank reports and/or study areas

Inputs: out/insights/answers.csv
Outputs: out/ranking/report_ranks.csv, out/ranking/studyarea_ranks.csv

Commands:

python rank_reports.py \
  --answers-csv out/insights/answers.csv \
  --out-csv out/ranking/report_ranks.csv

python rank_studyareas.py \
  --answers-csv out/insights/answers.csv \
  --out-csv out/ranking/studyarea_ranks.csv

4) Plot and inspect results

Plot rankings:

python plot_rankings.py \
  --ranking-csv out/ranking/report_ranks.csv \
  --out-png out/figures/Figure_1.png

Browse insights:

python view_insights.py \
  --answers-csv out/insights/answers.csv

5) Optional: GIS post-processing

Rasterize polygons:

python rasterize_polys.py \
  --in-shp some_polys.shp \
  --out-tif out/rasters/polys.tif

Smooth shapefiles:

python smooth_shapefiles.py \
  --in-shp some_polys.shp \
  --out-shp out/smoothed/polys_smoothed.shp

6) Optional: Self-evaluation of answers
- Inputs: out/insights/answers.csv
- Outputs: out/insights/answers_scored.csv (example)
- Command:
```
python self_evaluation.py \
  --answers-csv out/insights/answers.csv \
  --out-csv out/insights/answers_scored.csv
```

Configuration

Environment variables
- OPENAI_API_KEY — required for embeddings and Q&A.
Inputs
- Basin subfolders with PDFs.
- Prompt sets:
  - aquiferie_insight_prompts.txt
- Optional shapes:
  - bounding_boxes.* (SHP/DBF/SHX/PRJ)
Outputs (typical)
- in basin subfolders: aquifer_insights_selfeval.csv

Contributing

Issues and PRs are welcome.
Use feature branches, write clear commit messages, and keep PRs focused.

License

If a license file is absent, please add one (e.g., MIT) before broad reuse/distribution.

Acknowledgments

Thanks to the open-source Python and geospatial ecosystems (pandas, numpy, FAISS, scikit-learn, PyMuPDF/PDF tooling, GeoPandas/Shapely/GDAL/PROJ) that make this workflow possible.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.idea		.idea
AcomaBasin		AcomaBasin
AlbuquerqueBasin		AlbuquerqueBasin
BootheelBasinAndRange		BootheelBasinAndRange
CapitanReefAquifer		CapitanReefAquifer
DelawareBasin		DelawareBasin
EngleBasin/reports		EngleBasin/reports
EspanolaBasin		EspanolaBasin
EstanciaBasin		EstanciaBasin
GilaSanFrancisco		GilaSanFrancisco
HighPlains		HighPlains
JornadaDelMuertoBasin/reports		JornadaDelMuertoBasin/reports
LaJenciaBasin/reports		LaJenciaBasin/reports
MimbresBasin		MimbresBasin
PalomasBasin/reports		PalomasBasin/reports
RatonBasin		RatonBasin
RioArribaCounty		RioArribaCounty
RoswellArtesianBasin		RoswellArtesianBasin
SacramentoMountainsPecosSlope		SacramentoMountainsPecosSlope
SaltBasin		SaltBasin
SanAgustinPlains		SanAgustinPlains
SanJuanBasin		SanJuanBasin
SanLuisBasin		SanLuisBasin
SanMarcialBasin/reports		SanMarcialBasin/reports
SantoDomingoBasin/reports		SantoDomingoBasin/reports
SocorroBasin/reports		SocorroBasin/reports
TularosaBasin/reports		TularosaBasin/reports
UnionCounty		UnionCounty
testing		testing
~TularosaBasin		~TularosaBasin
.gitignore		.gitignore
Figure_1.png		Figure_1.png
aquifer_insights_embeddings.csv		aquifer_insights_embeddings.csv
aquifer_reports.csv		aquifer_reports.csv
aquiferie_insight_prompts.csv		aquiferie_insight_prompts.csv
aquiferie_insight_prompts.txt		aquiferie_insight_prompts.txt
aquiferie_insight_prompts2.txt		aquiferie_insight_prompts2.txt
aquiferie_report_links.csv		aquiferie_report_links.csv
aquiferie_runner.py		aquiferie_runner.py
bbox_cache.json		bbox_cache.json
bbox_question_only.txt		bbox_question_only.txt
bounding_boxes.cpg		bounding_boxes.cpg
bounding_boxes.dbf		bounding_boxes.dbf
bounding_boxes.prj		bounding_boxes.prj
bounding_boxes.shp		bounding_boxes.shp
bounding_boxes.shx		bounding_boxes.shx
brackishwater.py		brackishwater.py
categories.txt		categories.txt
embeddingsIE.py		embeddingsIE.py
gen_bboxes.py		gen_bboxes.py
hydrogeoprops.py		hydrogeoprops.py
hydrogeoprops_questions.txt		hydrogeoprops_questions.txt
main.py		main.py
merge_basin_files.py		merge_basin_files.py
openai_api_client.py		openai_api_client.py
plot_rankings.py		plot_rankings.py
rank_insights_questions.txt		rank_insights_questions.txt
rank_reports.py		rank_reports.py
rank_studyareas.py		rank_studyareas.py
ranking_from_insights.csv		ranking_from_insights.csv
rasterize_polys.py		rasterize_polys.py
readme.md		readme.md
response.pickle		response.pickle
self_evaluation.py		self_evaluation.py
smooth_shapefiles.py		smooth_shapefiles.py
view_insights.py		view_insights.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AquiferIE

Information extractor for New Mexico aquifer reports — ingests PDFs, generates embeddings, answers structured questions, ranks study areas, and exports tidy outputs.

Highlights

Repository Layout (key files)

Prerequisites

Installation

Step-by-Step Workflow

Configuration

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AquiferIE

Information extractor for New Mexico aquifer reports — ingests PDFs, generates embeddings, answers structured questions, ranks study areas, and exports tidy outputs.

Highlights

Repository Layout (key files)

Prerequisites

Installation

Step-by-Step Workflow

Configuration

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages