Skip to content

Keyword Search: BM25 in _compute_bm25_scores #128

@aidancbtw

Description

@aidancbtw

Some Improvements

There are a few things we can do to improve our BM25 search, there's a trade off for everything and being able to discuss these trade offs in the context of the documentation we index is really valuable.

  1. Stop word removal, this reduces noise by removing words like the, has etc.
  2. Stemmers which would change walking to walk, this then allows matches to walk, walks, walking for example. There are two popular types which are porter and snowball stemmers, snowball has a balance between accuracy, speed and language support whereas porter is classic, more aggressive and widely used.
  3. We're lower casing text in our tokenizer _tokenize -> re.findall(r"\b\w+\b", text.lower()). Lower casing can improve recall but may reduce precision by conflating proper nouns and acronyms with common words e.g Rain vs rain / US vs us.
  4. Both stop word removal and lemmatization aren't silver bullets. A configurable stop word, stemmer and lemmatization parameter would be best as we can then enable or disable it depending on the use case. For example code or documentation with heavy symbol usage would have it disabled.

Sudo Code @ keyword_search_service.py

    @staticmethod
    def _tokenize(text: str) -> list[str]:
        """Tokenize text using NLTK RegexpTokenizer word tokenization.
        
        See: https://www.nltk.org/api/nltk.tokenize.regexp.html
        """
        if not isinstance(text, str):
            return []
        return RegexpTokenizer(r"\b\w+\b").tokenize(text)

    def _compute_bm25_scores(self, documents: list[str], query: str) -> np.ndarray:
        """Compute BM25 scores for documents against the query.

        Tokenizes documents and query with NLTK regex word tokenization.
        """
        tokenized_docs = [self._tokenize(doc) for doc in documents]
        bm25 = BM25Okapi(tokenized_docs)
        tokenized_query = [SnowballStemmer(language="english").stem(word) for word in self._tokenize(query) if word not in self._stop_words]
        return bm25.get_scores(tokenized_query)

What To Start With

A nice way to test and see the impacts would be to try out the following modules and classes. It's not exhaustive, it's a balancing act that's highly dependent on the types of document we index.

from rank_bm25 import BM25Plus, BM25Okapi, BM25L
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer

Some Real World Examples

For a nice example you can refer to llama index

  • They implement both stop word removal and lemming.
from llama_index.retrievers.bm25 import BM25Retriever

Another example from Weavite, where they use BM25F

Pitfalls

Some common pitfalls exist with BM25, below is just one example and there is also a set of trade offs with which algorithm you use. Sticking to BM25 is ok for now as we have a diverse set of documents we ingest.

To see an example of the pitfalls which can occur see this issue in the rank_bm25 github albeit we don't want to score hello terms, we do want to ensure we get rich context returned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions