Skip to content

Feature/hybrid rag#166

Closed
calderonjesus wants to merge 79 commits into
mainfrom
feature/hybrid-rag
Closed

Feature/hybrid rag#166
calderonjesus wants to merge 79 commits into
mainfrom
feature/hybrid-rag

Conversation

@calderonjesus

Copy link
Copy Markdown
Collaborator

What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)

This pull request significantly refactors and enhances the 02_7_vectordb_docker.ipynb lab notebook to streamline the embedding workflow, enrich data with metadata, and improve integration with environment variables and APIs. The changes automate the creation and loading of ChromaDB inputs, add structured metadata to each entry, and modernize the code to use configurable settings and improved response handling.

Key improvements and changes:

1. Metadata Enrichment and Data Pipeline Automation

  • Added functions to join and load structured metadata (album, artist, score, genre, label, year) from multiple Pitchfork source files, and inject this metadata into each ChromaDB input for richer and more filterable queries. [1] [2]
  • Automated the process of generating ChromaDB input JSONL files from batch API results, with an option to use precomputed files or regenerate from the API. [1] [2]

2. ChromaDB Integration and Environment Configuration

  • Refactored the ChromaDB setup to use environment variables for URLs, collection names, and embedding models, allowing flexible deployment and easier configuration. (F79ede5cL370R370, 01_materials/labs/02_7_vectordb_docker.ipynbL287-R407)
  • Updated the embedding function initialization to support both direct OpenAI API usage and an API gateway, controlled by an environment variable.

3. Batch Processing and Data Loading Improvements

  • Introduced batch processing utilities for handling multiple batches efficiently and saving results to disk, with progress tracking via tqdm.
  • Improved reporting of batch completion and error handling in batch status checks. [1] [2]

4. Query and Prompt Generation Enhancements

  • Modified retrieval and prompt generation to use metadata directly from ChromaDB results, eliminating the need for additional SQL/database lookups at query time. [1] [2]
  • Expanded the prompt with additional metadata fields for more informative context in generated responses.

5. API Usage and Output Handling

  • Updated the response generation to use a new client.responses.create method, supporting more flexible model and input configuration, and adjusted output handling to match the new API.

These changes collectively make the notebook more robust, reproducible, and adaptable for both instructional and production use. [1] [2] [3] [4] [5] [6] [7] [8] F79ede5cL370R370, [9] [10] [11] [12] [13] [14]

What did you learn from the changes you have made?

N/A

Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?

N/A

Were there any challenges? If so, what issue(s) did you face? How did you overcome it?

N/A

How were these changes tested?

N/A

A reference to a related issue in your repository (if applicable)

N/A

Checklist

  • I can confirm that my changes are working as intended

calderonjesus and others added 19 commits June 13, 2026 16:47
- Use UUID to ensure unique batch ids
- Define key constants generically
- Remove dead code
- Add otpional execution paths (file-based vs API-based, APi vs Gateway based)
- Chunk by paragraphs
- Induce paragraph breaks every N sentences
Progressive lab teaching keyword filtering, metadata filtering, and LLM
re-ranking on the pitchfork_reviews ChromaDB collection. All three
techniques compose into a single hybrid_rag() pipeline. No SQL or new
infrastructure required — builds on 02_7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'indie' is semantically redundant with the query — vector search already
retrieves indie content, so the keyword filter produced identical results.
'shoegaze' is a specific subgenre term that forces a distinct result set,
making the contrast with baseline legible. Added note explaining why
semantically-redundant keywords show no effect.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update from shoegaze to dream pop
- The idea is to have terms that can be semantically similar to a broader and more diverse set of albums
@github-actions

Copy link
Copy Markdown

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

@calderonjesus calderonjesus marked this pull request as ready for review June 14, 2026 03:23
@vishnouvina vishnouvina mentioned this pull request Jun 14, 2026

@vishnouvina vishnouvina left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two comments:

  • the numbering of the notebooks won't be matching with 164
  • the embeddings at scale notebook has all the outputs of the cells, making readability a bit hard (some outputs are extra large)

- Use UUID to ensure unique batch ids
- Define key constants generically
- Remove dead code
- Add otpional execution paths (file-based vs API-based, APi vs Gateway based)
- Format output
- Update prompts
- Enhance display output
- Fix questions
- Updated execution counts to null in `04_7_vectordb_docker.ipynb` to reset cell states.
- Removed output data from specific cells in `04_7_vectordb_docker.ipynb` for cleaner execution.
- Modified batch description in `04_7_vectordb_docker.ipynb` for clarity.
- Added additional context in the helper functions section of `04_8_hybrid_rag.ipynb`.
- Ensured proper formatting and added a note about keywords in `04_8_hybrid_rag.ipynb`.
- Updated `.env` file to maintain consistency in environment variable settings.
@github-actions

Copy link
Copy Markdown

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

@github-actions

Copy link
Copy Markdown

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

@calderonjesus

Copy link
Copy Markdown
Collaborator Author

This is messy. I only changed a handful of files, not the 32 stated above. Closing and reopening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants