Let’s write a search engine, part 2 of 2

 from Red Blob Games’s Blog
Blog post: 30 Aug 2025

Last time we implemented a search feature that found all matching documents, sorted in some arbitrary order. This time I want to sort them in a more reasonable order. Here are some ideas, some which depend on the query and others which don’t:

I’m going to try building this:

What we’re trying to build

To keep the scope small, I’m going to stick to substring match and not other query matching algorithms:

Note that sometimes matching should be asymmetric, because the way we write things in queries is not the same way we write things in documents:

There’s a whole lot to explore with matching algorithms but I want to finish this project in a weekend so I will stick to case-insensitive substring match.

For some experiments I need access to headings, so I added headings to the search data. Some of my headings are title="heading" and some are <h3>heading</h3> so I extracted both:

fd --extension bxml | while read -r file; do
    echo "%%%% $file %%%%"
    perl -ne 'print "HEADER=$1\n" if /title="(.*)?"/' "$file"
    perl -ne 'print "HEADER=$2\n" if /<h(\d).*?>(.*?)<\/h\1>/' "$file"
    cat "$file" \
        | pandoc --read html --write plain --wrap none \
        | grep '[a-zA-Z0-9]'
done >all-pages.txt

I’m using regexps instead of xpath or xmllint to extract from my xhtml files, so it misses some things and improperly formats others. That’s ok for now. The main goal is to try out ranking, and then I can go back and improve the extraction if I want to pursue this project.

I then refactored the code into smaller functions to make experimentation easier. I also improved the formatting and made the search results clickable (opening in a new window).

function escapeHTML(text) { // wish this was built in
    const map = {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;',
        '"': '&quot;',
        "'": '&#39;'
    };
    return text.replace(/[&<>"']/g, m => map[m]);
}

function parseIntoDocuments(text) {
    // results[0] will be "", then [1] is a filename, [2] is text, alternating
    const results = text.split(/%%%% (.*?) %%%%$/gm); 
    if (results[0] !== "") throw `expected first split to be empty ${results[0]}`;
    if (results[1].length > 100) "expected second split to be filename";
    let documents = [];
    for (let i = 1; i+1 < results.length; i += 2) {
        let filename = results[i];
        let lines = results[i+1].split("\n");
        lines = lines.map(line => line.trim()); // remove extra whitespace
        lines = lines.filter(line => line); // remove blank lines
        let title = (lines.find((line) => line.startsWith("HEADER=")) ?? "").slice(7) || filename;
        let url = filename.startsWith("redblobgames/")
            ? "https://www.redblobgames.com/" + filename.slice(13)
            : "http://www-cs-students.stanford.edu/~amitp/" + filename;
        url = url.replace(".bxml", ".html");
        url = url.replace("/index.html", "/");
        documents.push({url, title, lines});
    }
    return documents;
}
const words = await (await fetch("./all-pages.txt")).text();
const documents = parseIntoDocuments(words);

function search(query) {
    let results = []; // list of {document, snippet}
    // where document is {url, title, lines}
    // where snippet is a list of {line, matches, isHeader}
    if (query) {
        query = new RegExp(RegExp.escape(query), 'gi');
        for (let document of documents) {
            let snippet = [];
            for (let line of document.lines) {
                let isHeader = false;
                if (line.startsWith("HEADER=")) {
                    line = line.slice(7);
                    isHeader = true;
                }
                let matches = line.matchAll(query).toArray();
                if (matches.length > 0) {
                    snippet.push({line, matches, isHeader});
                }
            }
            if (snippet.length > 0) results.push({document, snippet});
        }
    }
    return results;
}

function formatSnippet(snippet) {
    // TODO: be able to sort snippets before picking the top matches
    const MAX_SNIPPETS = 8;
    let result = "… ";
    for (let {line, matches} of snippet.slice(0, MAX_SNIPPETS)) {
        let pos = 0;
        for (let m of matches) {
            result += escapeHTML(line.slice(pos, m.index)) + `<b>${m[0]}</b>`;
            pos = m.index + m[0].length;
        }
        result += `${escapeHTML(line.slice(pos))} … `;
    }
    return result;
}

function formatSearchResults(query, results) {
    return results
          .map(({document, snippet}) => 
              `<a target="_blank" href="${document.url}">
                  ${document.title}
               </a>
               <div class="url">
                  ${document.url}
               </div>
               <div class="snippet">
                  ${formatSnippet(snippet)}
               </div>`)
          .join("\n");
}

function setUpSearch(id, sortResults) {
    const MAX_DOCUMENTS = 10;
    const el = document.getElementById(id);
    const update = () => {
      let query = el.querySelector("input").value;
      let results = sortResults(search(query));
      results = results.slice(0, MAX_DOCUMENTS);
      el.querySelector(".results").innerHTML = formatSearchResults(query, results);
    };
    el.querySelector("input").addEventListener('input', update);
    update();
}

setUpSearch('search-4', (results) => results);

I wanted to test the refactored code before implementing ranking. This shows documents in some arbitrary order, like before:

Let’s try ranking by the number of matches in the document:

setUpSearch('search-5', (results) => {
   function score({document, snippet}) {
     let numberOfMatches = 0;
     for (let {matches} of snippet) {
       numberOfMatches += matches.length;
     }
     return numberOfMatches;
   }

   return results.toSorted((a, b) => score(b) - score(a));
});

This is certainly better. The top 5 results before were four hexagon articles and then one about colors. The top 5 results now all have lots of hexagon content.

Let’s try ranking header matches higher than other matches:

setUpSearch('search-6', (results) => {
   function score({document, snippet}) {
     let numberOfMatches = 0;
     for (let {matches, isHeader} of snippet) {
       numberOfMatches += matches.length;
       if (isHeader) numberOfMatches += 10; // needs tuning!
     }
     return numberOfMatches;
   }

   return results.toSorted((a, b) => score(b) - score(a));
});

I think this is even better than before. But it’s still mediocre. Should I prioritize matches that occur in a cluster (max matches per block) or matches that are spread out (number of matching blocks)? I don’t know! It will take a lot of testing. Ranking takes a lot of work. There’s so much more that can be done to make search better. Some ideas:

Tangent: I think the description of search results is underrated. Early web search engines used a static description provided by the author of the page (“meta tag” description) or by a curator (Yahoo / DMoz directories), or generated a description per page. Google generated a description by scanning the document to find the query words, so that you could see the keywords in context of the document. Being able to see the matches of your query instead of a generic description made people feel that Google was actually paying attention to the query words. This is a surprisingly hard problem that I don’t see much written about. (For many queries, I think this was more important than PageRank™)

Another question is how to scale this up. I decided not to here. All the words on all my web pages put together are only a few megabytes, smaller than the javascript alone on the NYTimes home page. I can load the entire web site and search over it. But if I had many more pages, I might want an inverted index[1], which can tell us which documents contain a word. See Let’s Build a Full Text Search Engine[2]. If I wanted to index pages written by other people, I might want ranking functions that were more resistant to spamming, such as using links between documents and the anchor text used to describe those links.

I’m not planning to implement these right now. I wanted to see how far I could get in a weekend. The search results I’m getting here are decent. If you want to play with it some more, I made a standalone version that uses the full page to show search results.

This was a fun project!

Email me , or comment here: