As an SRE, I often have to look for information in logs—a lot of logs, a ton of information, across many files. And generally, whether you're on the internet or just on your own computer, you're facing the same problem: you need to find information buried in an ever-growing pile of documents.
That brings us to a key problem: when I search for the word Panda on my computer, which file should come up first? The most recent one? No—the most relevant one! And figuring out whether a document is relevant to a search query is today’s topic.
Grab a seat, get yourself a coffee, tea, or hot chocolate. Let’s go!
Today's problem can be summarized like this: for any search, I want to assign a relevance score to my documents. The first document returned should be the one with the highest score!
Before we begin: a quick note. This article contains some mathematical formulas. Don’t worry—we’ll go through everything step by step, and I’ll explain things clearly so they’re easy to understand.
We'll start with a small list of documents that will guide us through this discussion.
Here they are:
- Doc 1: “A panda is a black and white animal”
- Doc 2: “The dog is white”
- Doc 3: “The cat is black”
- Doc 4: “The panda is neither a cat nor a dog”
- Doc 5: “The red panda is red”
- Doc 6: “Black is black, there's really no more hope I’m in the dark, I find it hard to believe Black is black, it's never too late Black is black, I still have hope Black is black, I still have hope Black is black, I still have hope”
This is a simple test set, but it already helps us see the problem with scoring.
Let’s start intuitively: if I search for the word “black”, which document is the most relevant? Among documents 1, 3, and 6?
Not easy, right?
Before We Go Further: Some Definitions
Token
Our documents—and even our search queries—aren’t made of “words,” but what we’ll refer to as tokens.
In this blog post, a word = a token. So nothing really changes, but since we’ll use the term token throughout the article, better clarify it now.
For example, document 1 contains the following tokens:
“A”, “panda”, “is”, “a”, “black”, “and”, “white”
Inverted Index
When we search for the word black in our documents, we don’t open them one by one to see if the word is inside.
Before any search happens, our application must index the documents—scan through them to extract the tokens in each. During this step, we build what’s called an inverted index.
An inverted index is a dictionary where:
- The key is a token
- The value is a list of documents that contain that token
For example, our inverted index might include keys like:
“the”, “panda”, “is”, “a”, “animal”, “black”, “and”, “white”, “dog”, “cat”, …
The key “panda” is associated with documents 1, 4, and 5.
The key “black” is associated with documents 1, 3, and 6.
And so on.
Here’s a quick look at the classes we’ll be working with.
First, the Document
class:
public class Document
{
public string Id { get; set; }
public string Name { get; set; }
public string Path { get; set; }
public string Content { get; set; }
public int Length { get; set; }
public List<string> ContentTokens { get; set; }
public List<string> TitleTokens { get; set; }
public DateTime IndexedAt { get; set; } = DateTime.UtcNow;
}
And the search result, via the SearchResult
class:
public class SearchResult
{
public string Id { get; set; }
public string Name { get; set; }
public string Path { get; set; }
public string ContentSnippet { get; set; }
public double Score { get; set; }
public DateTime IndexedAt { get; set; }
}
Our final output will be a list of SearchResult
objects.
Scoring Algorithm – Simple and Basic
A basic first approach is to count how many times a token appears in the document.
If I search for the token “black”, it appears once in doc1, once in doc3, and 11 times in doc6.
Let’s assign a score of 1 for each occurrence of the token. This is very easy to do in C# with LINQ:
public int GetScore(string token, Document doc)
{
int score = 0;
var tokens = Tokenize(content);
score = doc.ContentTokens.Count(t => t == token.ToLowerInvariant());
return score;
}
Here, we’re counting how many times the token from our query appears in the document.
So we’ll get a score of:
- 1 for documents 1 and 3
- 11 for document 6
Alright… But beyond the fact that token count may not be the best metric, we quickly hit another issue.
If my query is “cat”, “black”, then doc6 will still score 11, even though doc3— which contains both tokens—only scores 2.
And doc3 actually contains both query tokens! So intuitively, it should be considered more relevant!
To fix this first issue, we could add a boost to the score if the document contains all the tokens in the query.
But more importantly, we might want to reduce the impact of a token that appears many times in the same document.
Should a document that contains the token “black” 50 times have a score twice as high as one that contains it 25 times?
Probably not. It should be ahead, yes—but not by that much!
Also, documents 1 and 3 both have a score of 1 for the token “black”, and there's no way to break the tie with this simple method.
An Improved Version of the Scoring Function
We can modify our scoring function to:
- Calculate the score for all query tokens
- Limit the impact a single token can have on the final score
- Boost results that match multiple tokens
public int GetScore(List<string> queryTokens, Document doc)
{
int totalScore = 0;
int matchToken = 0;
foreach (var token in queryTokens)
{
int score = doc.ContentTokens.Count(t => t == token.ToLowerInvariant());
if (score > 5)
score = 5;
if (score > 0)
matchToken++;
totalScore += score;
}
totalScore += matchToken * 10;
return totalScore;
}
From now on, a token can no longer influence the score beyond 5.
And the document score gains 10 points per token found in the query.
Let’s do a new test with the query “cat”, “black”:
Matching documents are: 1, 3, 4, and 6.
- Documents 1 and 4, which each contain only one token, will have a score of 11: 1 point for token frequency, and 10 points for matching a token in the query.
- Document 3 will have a score of 22: 2 points for the presence of “cat” and “black”, and 20 points for matching 2 tokens in the query.
- Document 6 will have 15 points: 5 points for the frequency of the token “black” (capped at 5), and 10 points for matching 1 token.
Final result:
- Doc 3 → 22 points
- Doc 6 → 15 points
- Doc 1 and Doc 4 → 11 points
We’ve solved our first two problems… but we can do better!
Here, the real issue is the limit of 5.
It’s not so much the value “5” that’s problematic, but the fact that we have a linear score increase that suddenly stops at an arbitrary limit.
Diminishing Returns
To solve this problem, we won’t impose a fixed limit beyond which the frequency of a term no longer yields points, but instead apply diminishing returns. For example, the first occurrence is worth 1, the second 0.95, the third 0.92, etc.
Here is the formula we’ll use:
For example, using a decay value of 0.97 with the query “black”.
For a document with a frequency of the token “black” equal to 6, we get the following result:
The difference here is not striking (we would have had 6 before, now we have 5.4). But let’s calculate the frequency score with a token that appears even more:
- For 20:
- For 50:
- For 100:
We now have a way to differentiate between two documents that don’t have the same number of occurrences of a token, without overly favoring a document that contains this token a very large number of times.
The impact of the term on the score decreases progressively.
We therefore get a non-linear progression, and avoid arbitrarily stopping the point at which a token’s frequency no longer has an effect.
public double GetScore(List<string> queryTokens, string content, double decay = 0.97)
{
double totalScore = 0;
int matchToken = 0;
foreach (var token in queryTokens)
{
int frequency = doc.ContentTokens.Count(t => t == token.ToLowerInvariant());
if (frequency > 0)
matchToken++;
double freqScore = (1 - Math.Pow(decay, frequency)) / (1 - decay);
totalScore += freqScore;
}
if (matchToken == queryTokens.Count)
matchToken *= 2;
totalScore += matchToken * 10;
return totalScore;
}
But can we do better? Or rather: what other problems might still arise?
Let’s try the following query: “the”, “animal”
If we base ourselves only on the frequency score (and we can, since no document matches both tokens, so the boost from the number of matched tokens will have no effect here), the documents containing the token “the” will all have the same score:
- doc2 – 1
- doc3 – 1
- doc5 – 1
- doc6 – 1
For the token “animal”, which appears only once in document 1, it will also receive a score of 1.
We therefore once again end up with a lack of discrimination between the documents. One might think that, since each term appears only once in each document, we can’t do better… But that’s wrong!
There is a difference: their frequency across the entire document set, not just within a single document.
The word “the” appears in almost every document, whereas “animal” appears in only one.
So there is a rarity difference — and we can take advantage of that.
TF-IDF
Until now, we have only calculated what is called TF (Term Frequency), which is the frequency of a token in a document. It is now time to include in our score the Inverse Document Frequency (IDF), to measure the rarity of a word across all documents.
The idea is that the rarer a word is, the more informative it is. Which is exactly the case here: the token “the”, present everywhere, doesn’t help refine the result.
To calculate our IDF, we will use the following formula:
Where t is the token, N the total number of documents in our corpus, and nt the number of documents containing this token.
Let’s calculate the IDF of “the”:
Then of “animal”:
And let’s use this IDF in our score calculation, now called TF-IDF (the multiplication of TF by IDF):
Since our TF is still equal to 1 here, the calculation is simplified: we only keep the IDF value.
The document scores for our query are now:
- doc1 – 0.778
- doc2 – 0.079
- doc3 – 0.079
- doc4 – 0.079
- doc5 – 0.079
- doc6 – 0.079
And if we had another document containing just the two tokens, it would rank first with a score of 0.857.
Notice that we no longer even need to artificially boost the score based on the number of matched tokens in the query — a boost which was arbitrary. And that’s much better!
We have the formula, now let’s move on to the code:
public double GetScore(List<string> queryTokens, Document doc, double decay = 0.97)
{
double totalScore = 0;
foreach (var token in queryTokens)
{
int frequency = doc.ContentTokens.Count(t => t == token.ToLowerInvariant());
if (frequency == 0)
continue;
int df = ReversedIndex[token].Count;
double idf = Math.Log((double)Documents.Count / (1 + df));
double tf = (1 - Math.Pow(decay, frequency)) / (1 - decay);
double tfidf = tf * idf;
totalScore += tfidf;
}
return totalScore;
}
The frequency calculation doesn’t change: we simply retrieve the frequency of documents containing the token (df), thanks to our inverted index, which already contains this information.
The total number of documents is also already known at indexing time.
So we have our IDF, our TF, and therefore the final document score!
But… can we do even better?
Let’s look at the current limitations (even though we’re already doing pretty well).
One of the problems comes from TF: as we’ve seen, it can increase very quickly and skew the score.
At first, we implemented a saturation limit to prevent this. A limit that, normally, is not included in classic TF-IDF.
But despite this, the algorithm actually favors long documents (via TF), even though a long document is not necessarily more relevant.
Moreover, we have few parameters to tweak our results (except decay, which is also not present in the base version).
It’s therefore time to move on to a new algorithm: BM25.
Best Match 25
Best Match 25 is the 25ᵗʰ version of the algorithm meant to slightly address the problems of TF-IDF.
First, here is its formula:
No need to run away—if you’ve made it this far, you have all the tools in hand to understand it, and we’ll go through it together.
We already know the IDF, so no issue there. Let’s instead focus on what replaces the TF.
Let’s start with the numerator:
- f(t,d) is the frequency of term t in document d.
- k1 +1 is a way to control saturation, exactly like our decay earlier.
The smaller k1 is, the faster the saturation happens, which will reduce the impact of each additional token present in the document in a non-linear way.
For the denominator, we again find our term frequency and the variable k1 for saturation, along with a normalization factor related to the document length.
- |d| is the actual length of the document (number of tokens it contains).
- avgdl is the average length of our indexed documents.
- b is a parameter that adjusts the importance of the normalization.
So, when a document is longer than average, its score will tend to decrease. On the other hand, a short document will see its score increase more quickly.
Let’s look at this with an example. Let’s reuse our list of documents and exclude document 6, which was very long. We’ll search for the token “black”.
Let’s start by defining our parameters k and b.
- k1 will be 1.5
- b will be 0.75
(these are fairly common values for these parameters).
Let’s also compute the average length of our documents, which is: 8 + 4 + 4 + 9 + 5, i.e., 30. Divided by the number of documents (5), we get an average length of 6.
So avgdl equals 6.
The token “black” appears in doc1 and doc3, with an IDF of 0.916.
Let’s start by calculating the score for document 1. Its length |d| is 8.
Let’s plug in the values in the numerator: term frequency in the document is 1, k1 is 1.5. We already know the IDF. So we get:
Or:
In the denominator, we can also replace known values (tf and k1):
We know the average document length and the current document length, so we can replace |d| and avgdl:
Or:
b is known, it is 0.75:
So we get:
A final score of 0.796 for document 1.
What about document 3?
Document 3 is very similar: the term frequency is still 1, and the IDF is the same. The only part that changes is the document length, which instead of being 8 out of 6, is 4 out of 6.
A final score of 1.077 for our document 3 — a better score than document 1!
A lot of calculations were needed to demonstrate how BM25 works, but since the operations are quite simple, they’re actually very easy to translate into code:
public double GetScore(List<string> queryTokens, Document doc)
{
double tfWeight = 1.5;
double docLengthPenalty = 0.75;
double score = 0;
double avgDocLength = Documents.Average(doc => doc.Key.Length);
foreach (var token in queryTokens)
{
int tf = doc.ContentTokens.Count(t => t == token.ToLowerInvariant());
if (tf == 0 || !ReversedIndex.ContainsKey(token))
continue;
int df = ReversedIndex[token].Count;
double idf = Math.Log((double)Documents.Count / (1 + df));
double norm;
norm = (tf * (tfWeight + 1)) /
(tf + tfWeight * (1 - docLengthPenalty + docLengthPenalty * (doc.Content.Length / avgDocLength)));
score += idf * norm;
}
return score;
}
Congratulations, you now know how to calculate a relevance score for a piece of content and return the best document to your user first.
Of course, other parameters can be considered, like tokens found in the title, the document’s modification date, or even its location.
Here, we focused on the content of a document, but this algorithm can be adapted and extended with other fields like the title, date, etc.
Top comments (0)