Skip to content

Feature/hybrid rag#167

Merged
calderonjesus merged 82 commits into
mainfrom
feature/hybrid-rag
Jun 14, 2026
Merged

Feature/hybrid rag#167
calderonjesus merged 82 commits into
mainfrom
feature/hybrid-rag

Conversation

@calderonjesus

Copy link
Copy Markdown
Collaborator

What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)

This pull request introduces several improvements and corrections to the data preparation and embeddings-at-scale labs, primarily focusing on file organization, text chunking, and batch processing for embeddings. The changes ensure more consistent data paths, enhance text preprocessing by grouping sentences into paragraphs, and update the batching logic for embedding jobs. Additionally, environment configuration is updated to support embedding workflows.

Key changes:

File organization and path consistency:

  • Updated all file paths to use the pitchfork subdirectory for documents, ensuring consistent file organization throughout the codebase. [1] [2] [3] [4]

Text preprocessing and chunking:

  • Added a new function add_paragraph_breaks to group sentences into paragraphs separated by double newlines, and integrated it into the JSONL export process for the content table. This improves downstream text chunking for embeddings. [1] [2]
  • Updated the chunking logic to use paragraph-based chunking with chunk_size=2800, chunk_overlap=700, and paragraph/sentence separators, aligning chunk boundaries with natural paragraph breaks. [1] [2]

Batch processing and embedding workflow:

  • Changed the output directory for embedding batch files to pitchfork/embeddings_batches/ and updated the glob pattern accordingly. [1] [2]
  • Improved batch file preparation by ensuring the output directory exists before writing files.
  • Enhanced the batch upload process by removing unnecessary print statements and generating unique batch IDs using uuid for better tracking. [1] [2]

Code and output improvements:

  • Enhanced chunk inspection by displaying the first five chunks and their metadata using Markdown in notebook outputs.
  • Added missing imports and client initialization for file uploads.
  • Minor bug fixes, such as correcting index usage and notebook display settings. [1] [2] [3]

Environment configuration:

  • Added EMBEDDING_MODEL and CHROMA_URL variables to the .env file to support embedding workflows and local Chroma database connections.

What did you learn from the changes you have made?

N/A

Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?

N/A

Were there any challenges? If so, what issue(s) did you face? How did you overcome it?

N/A

How were these changes tested?

N/A

A reference to a related issue in your repository (if applicable)

N/A

Checklist

  • [X ] I can confirm that my changes are working as intended

calderonjesus and others added 30 commits June 13, 2026 16:47
- Use UUID to ensure unique batch ids
- Define key constants generically
- Remove dead code
- Add otpional execution paths (file-based vs API-based, APi vs Gateway based)
- Chunk by paragraphs
- Induce paragraph breaks every N sentences
Progressive lab teaching keyword filtering, metadata filtering, and LLM
re-ranking on the pitchfork_reviews ChromaDB collection. All three
techniques compose into a single hybrid_rag() pipeline. No SQL or new
infrastructure required — builds on 02_7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'indie' is semantically redundant with the query — vector search already
retrieves indie content, so the keyword filter produced identical results.
'shoegaze' is a specific subgenre term that forces a distinct result set,
making the contrast with baseline legible. Added note explaining why
semantically-redundant keywords show no effect.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update from shoegaze to dream pop
- The idea is to have terms that can be semantically similar to a broader and more diverse set of albums
- Use UUID to ensure unique batch ids
- Define key constants generically
- Remove dead code
- Add otpional execution paths (file-based vs API-based, APi vs Gateway based)
- Chunk by paragraphs
- Induce paragraph breaks every N sentences
- Format output
- Update prompts
- Enhance display output
- Fix questions
- Updated execution counts to null in `04_7_vectordb_docker.ipynb` to reset cell states.
- Removed output data from specific cells in `04_7_vectordb_docker.ipynb` for cleaner execution.
- Modified batch description in `04_7_vectordb_docker.ipynb` for clarity.
- Added additional context in the helper functions section of `04_8_hybrid_rag.ipynb`.
- Ensured proper formatting and added a note about keywords in `04_8_hybrid_rag.ipynb`.
- Updated `.env` file to maintain consistency in environment variable settings.
@github-actions

Copy link
Copy Markdown

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

@github-actions

Copy link
Copy Markdown

Hello, thank you for your contribution. If you are a participant, please close this pull request and open it in your own forked repository instead of here. Please read the instructions on your onboarding Assignment Submission Guide more carefully. If you are not a participant, please give us up to 72 hours to review your PR. Alternatively, you can reach out to us directly to expedite the review process.

@calderonjesus calderonjesus marked this pull request as ready for review June 14, 2026 20:01
@calderonjesus calderonjesus merged commit 28711ce into main Jun 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants