Keyword Search: BM25 in  `_compute_bm25_scores`

## Some Improvements
There are a few things we can do to improve our BM25 search, there's a trade off for everything and being able to discuss these trade offs in the context of the documentation we index is really valuable.

1. Stop word removal, this reduces noise by removing words like `the`, `has` etc. 
2. Stemmers which would change `walking` to `walk`, this then allows matches to `walk, walks, walking` for example. There are two popular types which are `porter` and `snowball` stemmers, `snowball` has a balance between accuracy, speed and language support whereas `porter` is classic, more aggressive and widely used.
3. We're lower casing text in our tokenizer `_tokenize` -> `re.findall(r"\b\w+\b", text.lower())`. Lower casing can improve recall but may reduce precision by conflating proper nouns and acronyms with common words e.g `Rain vs rain / US vs us`.
4.  Both stop word removal and lemmatization aren't silver bullets. A configurable stop word, stemmer and lemmatization parameter would be best as we can then enable or disable it depending on the use case. For example code or documentation with heavy symbol usage would have it disabled.

### Sudo Code @ [keyword_search_service.py](https://github.com/martin-papy/qdrant-loader/blob/main/packages/qdrant-loader-mcp-server/src/qdrant_loader_mcp_server/search/components/keyword_search_service.py)

```
    @staticmethod
    def _tokenize(text: str) -> list[str]:
        """Tokenize text using NLTK RegexpTokenizer word tokenization.
        
        See: https://www.nltk.org/api/nltk.tokenize.regexp.html
        """
        if not isinstance(text, str):
            return []
        return RegexpTokenizer(r"\b\w+\b").tokenize(text)

    def _compute_bm25_scores(self, documents: list[str], query: str) -> np.ndarray:
        """Compute BM25 scores for documents against the query.

        Tokenizes documents and query with NLTK regex word tokenization.
        """
        tokenized_docs = [self._tokenize(doc) for doc in documents]
        bm25 = BM25Okapi(tokenized_docs)
        tokenized_query = [SnowballStemmer(language="english").stem(word) for word in self._tokenize(query) if word not in self._stop_words]
        return bm25.get_scores(tokenized_query)
```

## What To Start With
A nice way to test and see the impacts would be to try out the following modules and classes. It's not exhaustive, it's a balancing act that's highly dependent on the types of document we index. 
```
from rank_bm25 import BM25Plus, BM25Okapi, BM25L
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
```

## Some Real World Examples
For a nice example you can refer to [llama index](https://developers.llamaindex.ai/python/framework/integrations/retrievers/bm25_retriever/#_top)
- They implement both stop word removal and lemming. 
```
from llama_index.retrievers.bm25 import BM25Retriever
```

Another example from Weavite, where they use [BM25F](https://docs.weaviate.io/weaviate/concepts/search/keyword-search#stopwords)

## Pitfalls
Some common pitfalls exist with BM25, below is just one example and there is also a set of trade offs with which algorithm you use. Sticking to BM25 is ok for now as we have a diverse set of documents we ingest.

To see an example of the pitfalls which can occur see this issue in the [rank_bm25 github](https://github.com/dorianbrown/rank_bm25/issues/50) albeit we don't want to score hello terms, we do want to ensure we get rich context returned.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keyword Search: BM25 in `_compute_bm25_scores` #128

Some Improvements

Sudo Code @ keyword_search_service.py

What To Start With

Some Real World Examples

Pitfalls

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Keyword Search: BM25 in _compute_bm25_scores #128

Description

Some Improvements

Sudo Code @ keyword_search_service.py

What To Start With

Some Real World Examples

Pitfalls

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Keyword Search: BM25 in `_compute_bm25_scores` #128