Feature/hybrid rag by calderonjesus · Pull Request #167 · UofT-DSI/deploying-ai

calderonjesus · 2026-06-14T18:37:51Z

What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)

This pull request introduces several improvements and corrections to the data preparation and embeddings-at-scale labs, primarily focusing on file organization, text chunking, and batch processing for embeddings. The changes ensure more consistent data paths, enhance text preprocessing by grouping sentences into paragraphs, and update the batching logic for embedding jobs. Additionally, environment configuration is updated to support embedding workflows.

Key changes:

File organization and path consistency:

Updated all file paths to use the pitchfork subdirectory for documents, ensuring consistent file organization throughout the codebase. [1] [2] [3] [4]

Text preprocessing and chunking:

Added a new function add_paragraph_breaks to group sentences into paragraphs separated by double newlines, and integrated it into the JSONL export process for the content table. This improves downstream text chunking for embeddings. [1] [2]
Updated the chunking logic to use paragraph-based chunking with chunk_size=2800, chunk_overlap=700, and paragraph/sentence separators, aligning chunk boundaries with natural paragraph breaks. [1] [2]

Batch processing and embedding workflow:

Changed the output directory for embedding batch files to pitchfork/embeddings_batches/ and updated the glob pattern accordingly. [1] [2]
Improved batch file preparation by ensuring the output directory exists before writing files.
Enhanced the batch upload process by removing unnecessary print statements and generating unique batch IDs using uuid for better tracking. [1] [2]

Code and output improvements:

Enhanced chunk inspection by displaying the first five chunks and their metadata using Markdown in notebook outputs.
Added missing imports and client initialization for file uploads.
Minor bug fixes, such as correcting index usage and notebook display settings. [1] [2] [3]

Environment configuration:

Added EMBEDDING_MODEL and CHROMA_URL variables to the .env file to support embedding workflows and local Chroma database connections.

What did you learn from the changes you have made?

N/A

Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?

N/A

Were there any challenges? If so, what issue(s) did you face? How did you overcome it?

N/A

How were these changes tested?

N/A

A reference to a related issue in your repository (if applicable)

N/A

Checklist

[X ] I can confirm that my changes are working as intended

- Use UUID to ensure unique batch ids

- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)

- Chunk by paragraphs - Induce paragraph breaks every N sentences

Progressive lab teaching keyword filtering, metadata filtering, and LLM re-ranking on the pitchfork_reviews ChromaDB collection. All three techniques compose into a single hybrid_rag() pipeline. No SQL or new infrastructure required — builds on 02_7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

'indie' is semantically redundant with the query — vector search already retrieves indie content, so the keyword filter produced identical results. 'shoegaze' is a specific subgenre term that forces a distinct result set, making the contrast with baseline legible. Added note explaining why semantically-redundant keywords show no effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Update from shoegaze to dream pop - The idea is to have terms that can be semantically similar to a broader and more diverse set of albums

… into feature/hybrid-rag

- Use UUID to ensure unique batch ids

- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)

- Chunk by paragraphs - Induce paragraph breaks every N sentences

- Format output - Update prompts

- Enhance display output - Fix questions

- Updated execution counts to null in `04_7_vectordb_docker.ipynb` to reset cell states. - Removed output data from specific cells in `04_7_vectordb_docker.ipynb` for cleaner execution. - Modified batch description in `04_7_vectordb_docker.ipynb` for clarity. - Added additional context in the helper functions section of `04_8_hybrid_rag.ipynb`. - Ensured proper formatting and added a note about keywords in `04_8_hybrid_rag.ipynb`. - Updated `.env` file to maintain consistency in environment variable settings.

… into feature/hybrid-rag

github-actions · 2026-06-14T18:38:07Z

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

github-actions · 2026-06-14T18:40:16Z

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

calderonjesus and others added 30 commits June 13, 2026 16:47

get client utils and .env file from update/notebooks branch

41cd98c

Update notebook

ece4f5b

- Use UUID to ensure unique batch ids

Update paths

422168f

Update vector db notebook

13935db

- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)

Update .env file

31291fb

Harmonize directory name

cc5823d

Update notebook

8dc0d27

Fix prose to match code.

74b65a0

Update .gitignore

3bec3fa

Update api endpoint from completions to response

5dc9329

Update chunking strategy

0743287

- Chunk by paragraphs - Induce paragraph breaks every N sentences

Remove SQL Implementation and do all data operations in Chroma DB

1e5bfc1

Add Paragraph breaks

8c1e1d8

Update notebook from indie to shoegaze

20b8180

Update example to dream pop

add0165

Update example

c3071fc

- Update from shoegaze to dream pop - The idea is to have terms that can be semantically similar to a broader and more diverse set of albums

Merge branch 'feature/hybrid-rag' of github.com:UofT-DSI/deploying-ai…

2ff0ad7

… into feature/hybrid-rag

get client utils and .env file from update/notebooks branch

355ecbe

Update notebook

258f996

- Use UUID to ensure unique batch ids

Update paths

364bdb0

Update vector db notebook

7b0b252

- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)

Update .env file

9a9c4bf

Harmonize directory name

b37826e

Update notebook

4d5cb93

Fix prose to match code.

0b901e1

Update .gitignore

fe0e9d5

Update api endpoint from completions to response

25d2132

Update chunking strategy

8ee6d19

- Chunk by paragraphs - Induce paragraph breaks every N sentences

calderonjesus added 23 commits June 14, 2026 14:16

Update module name

ef61fc2

Update secrets template

5d83b05

Upgrade to latest packages versions under minimum Python 3.11

e11fc5a

Notebook 1.1 check

fad8340

Update notebook 1.2

87a5eae

- Format output - Update prompts

Add references to OpenAI model comparisons

bfcaf1a

Update examples

9f2f59f

Update notebook 2

195e4b7

Remove output

177e677

Update notebook

c91242b

- Enhance display output - Fix questions

Update examples

1eb2bd7

Update output quality

4b82fba

Remove unused file

6f974b3

Add example document

45029b8

Updte notebook example

082b25a

Update .env and .gitignore

8496fde

Update defaults and fix typo

72e60ac

Update name

70f56e0

Merge branch 'feature/hybrid-rag' of github.com:UofT-DSI/deploying-ai…

c129f20

… into feature/hybrid-rag

Update notebook format

c74f492

Merge branch 'main' into feature/hybrid-rag

426f38d

Clear outputs

42dd6ce

This file is now 04_8

2f44925

calderonjesus requested a review from vishnouvina June 14, 2026 19:23

vishnouvina approved these changes Jun 14, 2026

View reviewed changes

calderonjesus marked this pull request as ready for review June 14, 2026 20:01

calderonjesus merged commit 28711ce into main Jun 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/hybrid rag#167

Feature/hybrid rag#167
calderonjesus merged 82 commits into
mainfrom
feature/hybrid-rag

calderonjesus commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

calderonjesus commented Jun 14, 2026

What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)

What did you learn from the changes you have made?

Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?

Were there any challenges? If so, what issue(s) did you face? How did you overcome it?

How were these changes tested?

A reference to a related issue in your repository (if applicable)

Checklist

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants