Some Improvements
There are a few things we can do to improve our BM25 search, there's a trade off for everything and being able to discuss these trade offs in the context of the documentation we index is really valuable.
- Stop word removal, this reduces noise by removing words like
the, has etc.
- Stemmers which would change
walking to walk, this then allows matches to walk, walks, walking for example. There are two popular types which are porter and snowball stemmers, snowball has a balance between accuracy, speed and language support whereas porter is classic, more aggressive and widely used.
- We're lower casing text in our tokenizer
_tokenize -> re.findall(r"\b\w+\b", text.lower()). Lower casing can improve recall but may reduce precision by conflating proper nouns and acronyms with common words e.g Rain vs rain / US vs us.
- Both stop word removal and lemmatization aren't silver bullets. A configurable stop word, stemmer and lemmatization parameter would be best as we can then enable or disable it depending on the use case. For example code or documentation with heavy symbol usage would have it disabled.
@staticmethod
def _tokenize(text: str) -> list[str]:
"""Tokenize text using NLTK RegexpTokenizer word tokenization.
See: https://www.nltk.org/api/nltk.tokenize.regexp.html
"""
if not isinstance(text, str):
return []
return RegexpTokenizer(r"\b\w+\b").tokenize(text)
def _compute_bm25_scores(self, documents: list[str], query: str) -> np.ndarray:
"""Compute BM25 scores for documents against the query.
Tokenizes documents and query with NLTK regex word tokenization.
"""
tokenized_docs = [self._tokenize(doc) for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
tokenized_query = [SnowballStemmer(language="english").stem(word) for word in self._tokenize(query) if word not in self._stop_words]
return bm25.get_scores(tokenized_query)
What To Start With
A nice way to test and see the impacts would be to try out the following modules and classes. It's not exhaustive, it's a balancing act that's highly dependent on the types of document we index.
from rank_bm25 import BM25Plus, BM25Okapi, BM25L
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
Some Real World Examples
For a nice example you can refer to llama index
- They implement both stop word removal and lemming.
from llama_index.retrievers.bm25 import BM25Retriever
Another example from Weavite, where they use BM25F
Pitfalls
Some common pitfalls exist with BM25, below is just one example and there is also a set of trade offs with which algorithm you use. Sticking to BM25 is ok for now as we have a diverse set of documents we ingest.
To see an example of the pitfalls which can occur see this issue in the rank_bm25 github albeit we don't want to score hello terms, we do want to ensure we get rich context returned.
Some Improvements
There are a few things we can do to improve our BM25 search, there's a trade off for everything and being able to discuss these trade offs in the context of the documentation we index is really valuable.
the,hasetc.walkingtowalk, this then allows matches towalk, walks, walkingfor example. There are two popular types which areporterandsnowballstemmers,snowballhas a balance between accuracy, speed and language support whereasporteris classic, more aggressive and widely used._tokenize->re.findall(r"\b\w+\b", text.lower()). Lower casing can improve recall but may reduce precision by conflating proper nouns and acronyms with common words e.gRain vs rain / US vs us.Sudo Code @ keyword_search_service.py
What To Start With
A nice way to test and see the impacts would be to try out the following modules and classes. It's not exhaustive, it's a balancing act that's highly dependent on the types of document we index.
Some Real World Examples
For a nice example you can refer to llama index
Another example from Weavite, where they use BM25F
Pitfalls
Some common pitfalls exist with BM25, below is just one example and there is also a set of trade offs with which algorithm you use. Sticking to BM25 is ok for now as we have a diverse set of documents we ingest.
To see an example of the pitfalls which can occur see this issue in the rank_bm25 github albeit we don't want to score hello terms, we do want to ensure we get rich context returned.