Feature/hybrid rag#167
Conversation
- Use UUID to ensure unique batch ids
- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)
- Chunk by paragraphs - Induce paragraph breaks every N sentences
Progressive lab teaching keyword filtering, metadata filtering, and LLM re-ranking on the pitchfork_reviews ChromaDB collection. All three techniques compose into a single hybrid_rag() pipeline. No SQL or new infrastructure required — builds on 02_7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'indie' is semantically redundant with the query — vector search already retrieves indie content, so the keyword filter produced identical results. 'shoegaze' is a specific subgenre term that forces a distinct result set, making the contrast with baseline legible. Added note explaining why semantically-redundant keywords show no effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update from shoegaze to dream pop - The idea is to have terms that can be semantically similar to a broader and more diverse set of albums
… into feature/hybrid-rag
- Use UUID to ensure unique batch ids
- Define key constants generically - Remove dead code - Add otpional execution paths (file-based vs API-based, APi vs Gateway based)
- Chunk by paragraphs - Induce paragraph breaks every N sentences
- Format output - Update prompts
- Enhance display output - Fix questions
- Updated execution counts to null in `04_7_vectordb_docker.ipynb` to reset cell states. - Removed output data from specific cells in `04_7_vectordb_docker.ipynb` for cleaner execution. - Modified batch description in `04_7_vectordb_docker.ipynb` for clarity. - Added additional context in the helper functions section of `04_8_hybrid_rag.ipynb`. - Ensured proper formatting and added a note about keywords in `04_8_hybrid_rag.ipynb`. - Updated `.env` file to maintain consistency in environment variable settings.
… into feature/hybrid-rag
|
Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process. |
|
Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process. |
What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)
This pull request introduces several improvements and corrections to the data preparation and embeddings-at-scale labs, primarily focusing on file organization, text chunking, and batch processing for embeddings. The changes ensure more consistent data paths, enhance text preprocessing by grouping sentences into paragraphs, and update the batching logic for embedding jobs. Additionally, environment configuration is updated to support embedding workflows.
Key changes:
File organization and path consistency:
pitchforksubdirectory for documents, ensuring consistent file organization throughout the codebase. [1] [2] [3] [4]Text preprocessing and chunking:
add_paragraph_breaksto group sentences into paragraphs separated by double newlines, and integrated it into the JSONL export process for thecontenttable. This improves downstream text chunking for embeddings. [1] [2]chunk_size=2800,chunk_overlap=700, and paragraph/sentence separators, aligning chunk boundaries with natural paragraph breaks. [1] [2]Batch processing and embedding workflow:
pitchfork/embeddings_batches/and updated the glob pattern accordingly. [1] [2]uuidfor better tracking. [1] [2]Code and output improvements:
Environment configuration:
EMBEDDING_MODELandCHROMA_URLvariables to the.envfile to support embedding workflows and local Chroma database connections.What did you learn from the changes you have made?
N/A
Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?
N/A
Were there any challenges? If so, what issue(s) did you face? How did you overcome it?
N/A
How were these changes tested?
N/A
A reference to a related issue in your repository (if applicable)
N/A
Checklist