Let’s write a search engine, part 2 of 2

 from Red Blob Games’s Blog
DRAFT
Blog post: 30 Aug 2025

Last time we implemented a search feature that found all matching documents, sorted in alphabetical order of the url. This time I want to sort them in a more reasonable order. Here are some ideas, some which depend on the query and others which don’t:

What we’re trying to build

To keep the scope small, I’m going to stick to substring match and not other query matching algorithms:

Note that sometimes matching should be asymmetric, because the way we write things in queries is not the same way we write things in documents:

There’s a whole lot to explore with matching algorithms but I want to finish this project in a weekend so I will stick to case-insensitive substring match.

Another topic to explore is how to construct the description of the search result. Early web search engines used a static description provided by the author of the page (“meta tag” description) or by a curator (Yahoo / DMoz directories), or generated a description per page. Google generated a description by scanning the document to find the query words, so that you could see the keywords in context of the document.

Another topic is how to scale this up. All the words on all my web pages put together are only a few megabytes, smaller than the javascript alone on the NYTimes home page. I can load the entire web site and search over it.

Let’s start by refactoring the code into smaller functions so that we can mix and match for each of our experiments. I also needed to extract the headings from the documents, so I first changed the file format to include the headings as separate lines:

fd --extension bxml | while read -r file; do
    fgrep '<x:noindex' "$file" || fgrep '<x:draft' "$file" || (
        echo "%%%% $file %%%%"
        perl -ne 'print "HEADER=$1\n" if /title="(.*)?"/' "$file"
        perl -ne 'print "HEADER=$2\n" if /<h(\d).*?>(.*?)<\/h\1>/' "$file"
        cat "$file" \
         | pandoc --read html --write plain --wrap none \
         | grep '[a-zA-Z0-9]'
    )
done >all-pages.txt

I’m using regexps instead of xpath or xmllint to extract from my xhtml files, so it misses some things and improperly formats others. That’s ok for now. The main goal is to try out ranking, and then I can go back and improve the extraction if I want to pursue this project.

While working on this part, I improved the formatting and made the search results clickable (opening in a new window).

function escapeHTML(text) { // wish this was built in
    const map = {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;',
        '"': '&quot;',
        "'": '&#39;'
    };
    return text.replace(/[&<>"']/g, m => map[m]);
}

function parseIntoDocuments(text) {
    // results[0] will be "", then [1] is a url, [2] is text, with url/text alternating
    const results = text.split(/%%%% (.*?) %%%%$/gm); 
    if (results[0] !== "") throw `expected first split to be empty ${results[0]}`;
    if (results[1].length > 100) "expected second split to be url";
    let documents = [];
    for (let i = 1; i+1 < results.length; i += 2) {
        let url = results[i];
        let lines = results[i+1].split("\n");
        lines = lines.map(line => line.trim()); // remove extra whitespace
        lines = lines.filter(line => line); // remove blank lines
        let title = (lines.find((line) => line.startsWith("HEADER=")) ?? "").slice(7) || url;
        if (url.startsWith("redblobgames/")) {
            url = "https://www.redblobgames.com/" + url.slice(13);
        } else {
            url = "http://www-cs-students.stanford.edu/~amitp/" + url;
        }
        url = url.replace(".bxml", ".html");
        url = url.replace("/index.html", "/");
        documents.push({url, title, lines});
    }
    return documents;
}
const words = await (await fetch("./all-pages.txt")).text();
const documents = parseIntoDocuments(words);

function search(query) {
    let results = []; // list of {document, snippet}
    // where document is {url, title, lines}
    // where snippet is a list of {line, matches, isHeader}
    if (query) {
        query = new RegExp(RegExp.escape(query), 'gi');
        for (let document of documents) {
            let snippet = [];
            for (let line of document.lines) {
                let isHeader = false;
                if (line.startsWith("HEADER=")) {
                    line = line.slice(7);
                    isHeader = true;
                }
                let matches = line.matchAll(query).toArray();
                if (matches.length > 0) {
                    snippet.push({line, matches, isHeader});
                }
            }
            if (snippet.length > 0) results.push({document, snippet});
        }
    }
    return results;
}

function formatSnippet(snippet) {
    // TODO: be able to sort snippets before picking the top matches
    const MAX_SNIPPETS = 8;
    let result = "… ";
    for (let {line, matches} of snippet.slice(0, MAX_SNIPPETS)) {
        let pos = 0;
        for (let m of matches) {
            result += escapeHTML(line.slice(pos, m.index)) + `<b>${m[0]}</b>`;
            pos = m.index + m[0].length;
        }
        result += `${escapeHTML(line.slice(pos))} … `;
    }
    return result;
}

function formatSearchResults(query, results) {
    return results
          .map(({document, snippet}) => 
              `<a target="_blank" href="${document.url}">
                  ${document.title}
               </a>
               <div class="url">
                  ${document.url}
               </div>
               <div class="snippet">
                  ${formatSnippet(snippet)}
               </div>`)
          .join("\n");
}

function setUpSearch(id, sortResults) {
    const MAX_DOCUMENTS = 10;
    const el = document.getElementById(id);
    const update = () => {
      let query = el.querySelector("input").value;
      let results = sortResults(search(query));
      results = results.slice(0, MAX_DOCUMENTS);
      el.querySelector(".results").innerHTML = formatSearchResults(query, results);
    };
    el.querySelector("input").addEventListener('input', update);
    update();
}

setUpSearch('search-4', (results) => results);

Type hexagon into this search box:

Ok, let’s try ranking by the number of matches in the document:

setUpSearch('search-5', (results) => {
   function score({document, snippet}) {
     let numberOfMatches = 0;
     for (let {matches} of snippet) {
       numberOfMatches += matches.length;
     }
     return numberOfMatches;
   }

   return results.toSorted((a, b) => score(b) - score(a));
});

This is certainly better. The top 5 results before were four hexagon articles and then one about colors. The top 5 results now all have lots of hexagon content.

Let’s try ranking header matches higher than other matches:

setUpSearch('search-6', (results) => {
   function score({document, snippet}) {
     let numberOfMatches = 0;
     for (let {matches, isHeader} of snippet) {
       numberOfMatches += matches.length;
       if (isHeader) numberOfMatches += 10; // needs tuning!
     }
     return numberOfMatches;
   }

   return results.toSorted((a, b) => score(b) - score(a));
});

I think this is even better than before. But it’s still mediocre. Ranking takes a lot of work. There’s so much more that can be done to make search better. Some ideas:

I’m not planning to implement these right now. I wanted to see how far I could get in a weekend. If you want to play with it some more, I made a standalone version that uses the full page to show search results.

This was a fun project!