Feature/hybrid rag by calderonjesus · Pull Request #166 · UofT-DSI/deploying-ai

calderonjesus · 2026-06-14T03:05:57Z

What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)

This pull request significantly refactors and enhances the 02_7_vectordb_docker.ipynb lab notebook to streamline the embedding workflow, enrich data with metadata, and improve integration with environment variables and APIs. The changes automate the creation and loading of ChromaDB inputs, add structured metadata to each entry, and modernize the code to use configurable settings and improved response handling.

Key improvements and changes:

1. Metadata Enrichment and Data Pipeline Automation

Added functions to join and load structured metadata (album, artist, score, genre, label, year) from multiple Pitchfork source files, and inject this metadata into each ChromaDB input for richer and more filterable queries. [1] [2]
Automated the process of generating ChromaDB input JSONL files from batch API results, with an option to use precomputed files or regenerate from the API. [1] [2]

2. ChromaDB Integration and Environment Configuration

Refactored the ChromaDB setup to use environment variables for URLs, collection names, and embedding models, allowing flexible deployment and easier configuration. (F79ede5cL370R370, 01_materials/labs/02_7_vectordb_docker.ipynbL287-R407)
Updated the embedding function initialization to support both direct OpenAI API usage and an API gateway, controlled by an environment variable.

3. Batch Processing and Data Loading Improvements

Introduced batch processing utilities for handling multiple batches efficiently and saving results to disk, with progress tracking via tqdm.
Improved reporting of batch completion and error handling in batch status checks. [1] [2]

4. Query and Prompt Generation Enhancements

Modified retrieval and prompt generation to use metadata directly from ChromaDB results, eliminating the need for additional SQL/database lookups at query time. [1] [2]
Expanded the prompt with additional metadata fields for more informative context in generated responses.

5. API Usage and Output Handling

Updated the response generation to use a new client.responses.create method, supporting more flexible model and input configuration, and adjusted output handling to match the new API.

These changes collectively make the notebook more robust, reproducible, and adaptable for both instructional and production use. [1] [2] [3] [4] [5] [6] [7] [8] F79ede5cL370R370, [9] [10] [11] [12] [13] [14]

What did you learn from the changes you have made?

N/A

Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?

N/A

Were there any challenges? If so, what issue(s) did you face? How did you overcome it?

N/A

How were these changes tested?

N/A

A reference to a related issue in your repository (if applicable)

N/A

Checklist

I can confirm that my changes are working as intended

- Use UUID to ensure unique batch ids

- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)

- Chunk by paragraphs - Induce paragraph breaks every N sentences

Progressive lab teaching keyword filtering, metadata filtering, and LLM re-ranking on the pitchfork_reviews ChromaDB collection. All three techniques compose into a single hybrid_rag() pipeline. No SQL or new infrastructure required — builds on 02_7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

'indie' is semantically redundant with the query — vector search already retrieves indie content, so the keyword filter produced identical results. 'shoegaze' is a specific subgenre term that forces a distinct result set, making the contrast with baseline legible. Added note explaining why semantically-redundant keywords show no effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Update from shoegaze to dream pop - The idea is to have terms that can be semantically similar to a broader and more diverse set of albums

… into feature/hybrid-rag

github-actions · 2026-06-14T03:06:10Z

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

vishnouvina

Only two comments:

the numbering of the notebooks won't be matching with 164
the embeddings at scale notebook has all the outputs of the cells, making readability a bit hard (some outputs are extra large)

- Use UUID to ensure unique batch ids

- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)

- Format output - Update prompts

- Enhance display output - Fix questions

- Updated execution counts to null in `04_7_vectordb_docker.ipynb` to reset cell states. - Removed output data from specific cells in `04_7_vectordb_docker.ipynb` for cleaner execution. - Modified batch description in `04_7_vectordb_docker.ipynb` for clarity. - Added additional context in the helper functions section of `04_8_hybrid_rag.ipynb`. - Ensured proper formatting and added a note about keywords in `04_8_hybrid_rag.ipynb`. - Updated `.env` file to maintain consistency in environment variable settings.

… into feature/hybrid-rag

github-actions · 2026-06-14T18:19:18Z

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

github-actions · 2026-06-14T18:21:29Z

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

calderonjesus · 2026-06-14T18:24:33Z

This is messy. I only changed a handful of files, not the 32 stated above. Closing and reopening.

calderonjesus and others added 19 commits June 13, 2026 16:47

get client utils and .env file from update/notebooks branch

41cd98c

Update notebook

ece4f5b

- Use UUID to ensure unique batch ids

Update paths

422168f

Update vector db notebook

13935db

- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)

Update .env file

31291fb

Harmonize directory name

cc5823d

Update notebook

8dc0d27

Fix prose to match code.

74b65a0

Update .gitignore

3bec3fa

Update api endpoint from completions to response

5dc9329

Update chunking strategy

0743287

- Chunk by paragraphs - Induce paragraph breaks every N sentences

Remove SQL Implementation and do all data operations in Chroma DB

1e5bfc1

Add Paragraph breaks

8c1e1d8

Update notebook from indie to shoegaze

20b8180

Update example to dream pop

add0165

Update example

c3071fc

- Update from shoegaze to dream pop - The idea is to have terms that can be semantically similar to a broader and more diverse set of albums

Merge branch 'feature/hybrid-rag' of github.com:UofT-DSI/deploying-ai…

2ff0ad7

… into feature/hybrid-rag

calderonjesus requested a review from vishnouvina June 14, 2026 03:23

calderonjesus marked this pull request as ready for review June 14, 2026 03:23

vishnouvina mentioned this pull request Jun 14, 2026

Update/notebooks #164

Merged

vishnouvina reviewed Jun 14, 2026

View reviewed changes

calderonjesus added 6 commits June 14, 2026 14:14

get client utils and .env file from update/notebooks branch

355ecbe

Update notebook

258f996

- Use UUID to ensure unique batch ids

Update paths

364bdb0

Update vector db notebook

7b0b252

- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)

Update .env file

9a9c4bf

Harmonize directory name

b37826e

calderonjesus added 25 commits June 14, 2026 14:16

Update client

fe02182

Cleanup and rename

714654b

Clean up notebook

30a8ddc

Update tool

c084155

Update notebook

019d9f6

Update module name

ef61fc2

Update secrets template

5d83b05

Upgrade to latest packages versions under minimum Python 3.11

e11fc5a

Notebook 1.1 check

fad8340

Update notebook 1.2

87a5eae

- Format output - Update prompts

Add references to OpenAI model comparisons

bfcaf1a

Update examples

9f2f59f

Update notebook 2

195e4b7

Remove output

177e677

Update notebook

c91242b

- Enhance display output - Fix questions

Update examples

1eb2bd7

Update output quality

4b82fba

Remove unused file

6f974b3

Add example document

45029b8

Updte notebook example

082b25a

Update .env and .gitignore

8496fde

Update defaults and fix typo

72e60ac

Update name

70f56e0

Merge branch 'feature/hybrid-rag' of github.com:UofT-DSI/deploying-ai…

c129f20

… into feature/hybrid-rag

Update notebook format

c74f492

calderonjesus closed this Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/hybrid rag#166

Feature/hybrid rag#166
calderonjesus wants to merge 79 commits into
mainfrom
feature/hybrid-rag

calderonjesus commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

vishnouvina left a comment

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

calderonjesus commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

calderonjesus commented Jun 14, 2026

What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)

What did you learn from the changes you have made?

Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?

Were there any challenges? If so, what issue(s) did you face? How did you overcome it?

How were these changes tested?

A reference to a related issue in your repository (if applicable)

Checklist

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

vishnouvina left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

calderonjesus commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants