StudyAI — AI Study Assistant

StudyAI turns your own course materials into source-grounded study sets — quizzes, flashcards, explanations, and open-response Q&A. Upload a PDF, Word document, image, or pasted notes; the app extracts the text, finds the passages most relevant to your topic, and uses a language model to generate study content where every answer is traced back to the snippet it came from.

Built for the SDEV378 Applied AI final project: a working proof of concept with three ML components cooperating end to end (extraction → retrieval → generation), plus an optional fourth (automatic topic and difficulty tagging).

Team: Maxwell Maslov, Huma Khomidov

Features

Multiple input types — PDF, Word (.docx), images (.png/.jpg/...), .txt / .md / .csv, or pasted text.
Four study modes — multiple-choice quiz, flashcards, explanations, and open-response Q&A.
Grounded answers — the model may only use passages retrieved from your material, and each item links to its source snippet.
Automatic tagging — detects topics and a suggested difficulty to seed retrieval.
Graceful degradation — without an API key it still extracts, indexes, retrieves, and shows the relevant source passages.
Local-first — text extraction, embeddings, and vector search all run on your machine; only generation calls a hosted LLM.

How it works

Stage	Technology	Role
1. Extraction	PyMuPDF, python-docx, pytesseract (optional)	Read PDFs, Word files, text files, and OCR images/scanned pages into clean text
2. Retrieval	sentence-transformers (`all-MiniLM-L6-v2`) + ChromaDB	Chunk the text, embed it locally, and search for the passages most relevant to your topic
3. Generation	GroqCloud (`llama-3.1-8b-instant`)	Write quizzes, flashcards, explanations, and Q&A grounded strictly in the retrieved snippets
4. Classification (optional)	Groq, or a local keyword + readability heuristic	Tag topics and an overall difficulty for the document

Requirements

Python 3.10+
A free GroqCloud API key (optional, but required for AI-generated content)
Tesseract OCR (optional, only for reading images and scanned PDFs)

Setup

1. Clone and install dependencies

git clone https://github.com/mmaslov007/SDEV378-Final-AI.git
cd SDEV378-Final-AI

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

All packages install into the local .venv folder only — nothing is installed system-wide.

2. Configure your environment

copy .env.example .env

Edit .env and add your Groq API key (never commit this file):

Variable	Required	Description
`GROQ_API_KEY`	For AI generation	Your GroqCloud key. Without it, the app runs in source-only fallback mode.
`GROQ_MODEL`	No	LLM model id (default: `llama-3.1-8b-instant`).
`STUDY_ASSISTANT_CHROMA_PATH`	No	Where ChromaDB stores its local index (default: `.chroma`).
`TESSERACT_CMD`	No	Path to `tesseract.exe`, only if it isn't on your `PATH`.

3. (Optional) Install Tesseract for image OCR

Only needed for scanned images or picture-only PDFs. Normal PDFs, Word documents, text files, and pasted notes work without it.

# Windows
winget install --id tesseract-ocr.tesseract --exact --accept-source-agreements --accept-package-agreements

# macOS / Linux
brew install tesseract
sudo apt-get install tesseract-ocr

Reopen your terminal afterward so PATH refreshes. If OCR still shows as unavailable, set TESSERACT_CMD in .env.

Running the app

.\.venv\Scripts\python.exe -m streamlit run app.py

The app opens in your browser at http://localhost:8501. The first retrieval run downloads the MiniLM embedding model (~90 MB) into your local cache. Press Ctrl+C in the terminal to stop the server.

Using it — the happy path

Material — upload a file, paste notes, or click Load Sample.
Extract Material — pulls the text and shows an editable preview.
Configure — choose a mode (quiz / flashcards / explanation / Q&A), topic, and difficulty.
Build Search Index — chunks and indexes the text. (Required before generating.)
Generate Study Set — produces the study content with linked sources.

Generate stays disabled until you build the index — that's the intended order, not a bug.

Reading the status bar

The top navigation shows live component status:

Indicator	Meaning
OCR: off	Tesseract isn't installed. Fine unless you need image/scanned-PDF OCR.
LLM: ready / off	`ready` once `GROQ_API_KEY` is set; otherwise the app runs in fallback mode.
Retrieval: not built	No index built yet this session. Flips to `chromadb` after you click Build Search Index.

Validation

.\.venv\Scripts\python.exe -m unittest discover -s tests
.\.venv\Scripts\python.exe -m compileall study_assistant tests app.py
.\.venv\Scripts\python.exe scripts\check_setup.py

check_setup.py prints a quick diagnostic of installed dependencies, your Groq key status, and Tesseract availability.

Project structure

app.py                       Streamlit entry point (UI + page flow)
study_assistant/
  extraction.py              File/text extraction + OCR
  retrieval.py               Chunking, embeddings, ChromaDB / in-memory search
  generation.py              Groq-backed quiz/flashcard/explanation/Q&A generation
  classification.py          Topic + difficulty tagging (LLM or heuristic)
  config.py                  Environment configuration
  ui.py                      Design system (theme, nav, hero, components)
tests/                       Unit tests
scripts/check_setup.py       Environment diagnostics
sample_materials/            Demo content
.streamlit/config.toml       Theme configuration

Troubleshooting

"PyMuPDF is not installed" — you're running from the wrong Python. Stop any old Streamlit servers and relaunch with .\.venv\Scripts\python.exe -m streamlit run app.py.
OCR unavailable after installing Tesseract — reopen your terminal, run scripts\check_setup.py, and set TESSERACT_CMD in .env if needed.
Want a quick demo? — use sample_materials/sdev378_ai_study_notes.txt via the Load Sample button.

Privacy & data

StudyAI runs locally and writes only a .chroma/ index folder and a cached embedding model to disk. The only outbound network calls are to GroqCloud (for generation) and Hugging Face (a one-time model download). Text you upload is sent to Groq's API to generate study content, so avoid uploading sensitive material.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StudyAI — AI Study Assistant

Features

How it works

Requirements

Setup

1. Clone and install dependencies

2. Configure your environment

3. (Optional) Install Tesseract for image OCR

Running the app

Using it — the happy path

Reading the status bar

Validation

Project structure

Troubleshooting

Privacy & data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.streamlit		.streamlit
sample_materials		sample_materials
scripts		scripts
study_assistant		study_assistant
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ProjectPlan.md		ProjectPlan.md
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

StudyAI — AI Study Assistant

Features

How it works

Requirements

Setup

1. Clone and install dependencies

2. Configure your environment

3. (Optional) Install Tesseract for image OCR

Running the app

Using it — the happy path

Reading the status bar

Validation

Project structure

Troubleshooting

Privacy & data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages