Feature/hybrid rag#166
Conversation
- Use UUID to ensure unique batch ids
- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)
- Chunk by paragraphs - Induce paragraph breaks every N sentences
Progressive lab teaching keyword filtering, metadata filtering, and LLM re-ranking on the pitchfork_reviews ChromaDB collection. All three techniques compose into a single hybrid_rag() pipeline. No SQL or new infrastructure required — builds on 02_7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'indie' is semantically redundant with the query — vector search already retrieves indie content, so the keyword filter produced identical results. 'shoegaze' is a specific subgenre term that forces a distinct result set, making the contrast with baseline legible. Added note explaining why semantically-redundant keywords show no effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update from shoegaze to dream pop - The idea is to have terms that can be semantically similar to a broader and more diverse set of albums
… into feature/hybrid-rag
|
Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process. |
vishnouvina
left a comment
There was a problem hiding this comment.
Only two comments:
- the numbering of the notebooks won't be matching with 164
- the embeddings at scale notebook has all the outputs of the cells, making readability a bit hard (some outputs are extra large)
- Use UUID to ensure unique batch ids
- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)
- Format output - Update prompts
- Enhance display output - Fix questions
- Updated execution counts to null in `04_7_vectordb_docker.ipynb` to reset cell states. - Removed output data from specific cells in `04_7_vectordb_docker.ipynb` for cleaner execution. - Modified batch description in `04_7_vectordb_docker.ipynb` for clarity. - Added additional context in the helper functions section of `04_8_hybrid_rag.ipynb`. - Ensured proper formatting and added a note about keywords in `04_8_hybrid_rag.ipynb`. - Updated `.env` file to maintain consistency in environment variable settings.
… into feature/hybrid-rag
|
Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process. |
|
Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process. |
|
This is messy. I only changed a handful of files, not the 32 stated above. Closing and reopening. |
What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)
This pull request significantly refactors and enhances the
02_7_vectordb_docker.ipynblab notebook to streamline the embedding workflow, enrich data with metadata, and improve integration with environment variables and APIs. The changes automate the creation and loading of ChromaDB inputs, add structured metadata to each entry, and modernize the code to use configurable settings and improved response handling.Key improvements and changes:
1. Metadata Enrichment and Data Pipeline Automation
2. ChromaDB Integration and Environment Configuration
3. Batch Processing and Data Loading Improvements
tqdm.4. Query and Prompt Generation Enhancements
5. API Usage and Output Handling
client.responses.createmethod, supporting more flexible model and input configuration, and adjusted output handling to match the new API.These changes collectively make the notebook more robust, reproducible, and adaptable for both instructional and production use. [1] [2] [3] [4] [5] [6] [7] [8] F79ede5cL370R370, [9] [10] [11] [12] [13] [14]
What did you learn from the changes you have made?
N/A
Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?
N/A
Were there any challenges? If so, what issue(s) did you face? How did you overcome it?
N/A
How were these changes tested?
N/A
A reference to a related issue in your repository (if applicable)
N/A
Checklist