Let’s write a search engine, part 2 of 2

Blog post: 30 Aug 2025

Last time we implemented a search feature that found all matching documents, sorted in some arbitrary order. This time I want to sort them in a more reasonable order. Here are some ideas, some which depend on the query and others which don’t:

popular pages first
newer (or older) pages first
promote (or demote) certain folders manually
promote (or demote) based on the type of content in the page
pages that have more occurrences of the query string
pages where the query matches are spread out (or concentrated)
pages that have the query string in titles or headings

I’m going to try building this:

To keep the scope small, I’m going to stick to substring match and not other query matching algorithms:

word match (graph pathfinding matches graph for pathfinding and pathfinding graph)
ordered word match (big dog matches big red dog but not dog was big)
regexp match (r.*dog matches red dog but not dog was red)
wildcard match (r* dog matches red dog but not dog was red)
plurals / stemming (hexagon matches hexagons and hexagonal)
stop words (the algorithm matches an algorithm)
spelling (dikjstra matches dijkstra)
synonym match (mapgen matches map generator)
case match (OPEN set matches OPEN set but not open set)
punctuation match (can't matches can't but not cant)
parent category match (#mapgen matches #mapgen4)

Note that sometimes matching should be asymmetric, because the way we write things in queries is not the same way we write things in documents:

open set should match both OPEN set and open set, but OPEN set should match only OPEN set
cant should match both can't and cant, but can't should match only can't
mapgen should match both #mapgen and #mapgen4, but #mapgen4 should match only #mapgen4

There’s a whole lot to explore with matching algorithms but I want to finish this project in a weekend so I will stick to case-insensitive substring match.

For some experiments I need access to headings, so I added headings to the search data. Some of my headings are title="heading" and some are <h3>heading</h3> so I extracted both:

fd --extension bxml | while read -r file; do
    echo "%%%% $file %%%%"
    perl -ne 'print "HEADER=$1\n" if /title="(.*)?"/' "$file"
    perl -ne 'print "HEADER=$2\n" if /<h(\d).*?>(.*?)<\/h\1>/' "$file"
    cat "$file" \
        | pandoc --read html --write plain --wrap none \
        | grep '[a-zA-Z0-9]'
done >all-pages.txt

I’m using regexps instead of xpath or xmllint to extract from my xhtml files, so it misses some things and improperly formats others. That’s ok for now. The main goal is to try out ranking, and then I can go back and improve the extraction if I want to pursue this project.

I then refactored the code into smaller functions to make experimentation easier. I also improved the formatting and made the search results clickable (opening in a new window).

function escapeHTML(text) { // wish this was built in
    const map = {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;',
        '"': '&quot;',
        "'": '&#39;'
    };
    return text.replace(/[&<>"']/g, m => map[m]);
}

function parseIntoDocuments(text) {
    // results[0] will be "", then [1] is a filename, [2] is text, alternating
    const results = text.split(/%%%% (.*?) %%%%$/gm); 
    if (results[0] !== "") throw `expected first split to be empty ${results[0]}`;
    if (results[1].length > 100) "expected second split to be filename";
    let documents = [];
    for (let i = 1; i+1 < results.length; i += 2) {
        let filename = results[i];
        let lines = results[i+1].split("\n");
        lines = lines.map(line => line.trim()); // remove extra whitespace
        lines = lines.filter(line => line); // remove blank lines
        let title = (lines.find((line) => line.startsWith("HEADER=")) ?? "").slice(7) || filename;
        let url = filename.startsWith("redblobgames/")
            ? "https://www.redblobgames.com/" + filename.slice(13)
            : "http://www-cs-students.stanford.edu/~amitp/" + filename;
        url = url.replace(".bxml", ".html");
        url = url.replace("/index.html", "/");
        documents.push({url, title, lines});
    }
    return documents;
}
const words = await (await fetch("./all-pages.txt")).text();
const documents = parseIntoDocuments(words);

function search(query) {
    let results = []; // list of {document, snippet}
    // where document is {url, title, lines}
    // where snippet is a list of {line, matches, isHeader}
    if (query) {
        query = new RegExp(RegExp.escape(query), 'gi');
        for (let document of documents) {
            let snippet = [];
            for (let line of document.lines) {
                let isHeader = false;
                if (line.startsWith("HEADER=")) {
                    line = line.slice(7);
                    isHeader = true;
                }
                let matches = line.matchAll(query).toArray();
                if (matches.length > 0) {
                    snippet.push({line, matches, isHeader});
                }
            }
            if (snippet.length > 0) results.push({document, snippet});
        }
    }
    return results;
}

function formatSnippet(snippet) {
    // TODO: be able to sort snippets before picking the top matches
    const MAX_SNIPPETS = 8;
    let result = "… ";
    for (let {line, matches} of snippet.slice(0, MAX_SNIPPETS)) {
        let pos = 0;
        for (let m of matches) {
            result += escapeHTML(line.slice(pos, m.index)) + `<b>${m[0]}</b>`;
            pos = m.index + m[0].length;
        }
        result += `${escapeHTML(line.slice(pos))} … `;
    }
    return result;
}

function formatSearchResults(query, results) {
    return results
          .map(({document, snippet}) => 
              `<a target="_blank" href="${document.url}">
                  ${document.title}
               </a>
               <div class="url">
                  ${document.url}
               </div>
               <div class="snippet">
                  ${formatSnippet(snippet)}
               </div>`)
          .join("\n");
}

function setUpSearch(id, sortResults) {
    const MAX_DOCUMENTS = 10;
    const el = document.getElementById(id);
    const update = () => {
      let query = el.querySelector("input").value;
      let results = sortResults(search(query));
      results = results.slice(0, MAX_DOCUMENTS);
      el.querySelector(".results").innerHTML = formatSearchResults(query, results);
    };
    el.querySelector("input").addEventListener('input', update);
    update();
}

setUpSearch('search-4', (results) => results);

I wanted to test the refactored code before implementing ranking. This shows documents in some arbitrary order, like before:

Search engine 4:

Let’s try ranking by the number of matches in the document:

setUpSearch('search-5', (results) => {
   function score({document, snippet}) {
     let numberOfMatches = 0;
     for (let {matches} of snippet) {
       numberOfMatches += matches.length;
     }
     return numberOfMatches;
   }

   return results.toSorted((a, b) => score(b) - score(a));
});

Search engine 5:

This is certainly better. The top 5 results before were four hexagon articles and then one about colors. The top 5 results now all have lots of hexagon content.

Let’s try ranking header matches higher than other matches:

setUpSearch('search-6', (results) => {
   function score({document, snippet}) {
     let numberOfMatches = 0;
     for (let {matches, isHeader} of snippet) {
       numberOfMatches += matches.length;
       if (isHeader) numberOfMatches += 10; // needs tuning!
     }
     return numberOfMatches;
   }

   return results.toSorted((a, b) => score(b) - score(a));
});

Search engine 6:

I think this is even better than before. But it’s still mediocre. Should I prioritize matches that occur in a cluster (max matches per block) or matches that are spread out (number of matching blocks)? I don’t know! It will take a lot of testing. Ranking takes a lot of work. There’s so much more that can be done to make search better. Some ideas:

try out the other ranking ideas I listed at the top of the page
extract documents correctly; there are going to be lots of little details to get right (data scientists know that “data cleaning” is one of the most time consuming parts of a project)
identify pages that are just linking to other pages (for example, my blog home page), and prefer returning search results from the linked-to pages
identify pages that are older versions of other pages, and prefer returning the current pages
trim snippets at word boundaries centered at the query matches instead of picking the beginning of the line
split documents into blocks better (for example, <dt> and <dd> should go together, not separate blocks), or maybe get rid of the block idea entirely
try the query matching rules I mentioned earlier, to handle words, phrases, ordering, synonyms, plurals, etc.
set up a test suite of queries and hand-picked best results, so that I can evaluate the different ranking algorithms and tuning parameters
change ranking philosophy from the set of best pages to the best set of pages
use word2vec or a vector db (from the LLM world) to implement fuzzy matching or group related pages together
generate a good snippet instead of the first N matches

Tangent: I think the description of search results is underrated. Early web search engines used a static description provided by the author of the page (“meta tag” description) or by a curator (Yahoo / DMoz directories), or generated a description per page. Google generated a description by scanning the document to find the query words, so that you could see the keywords in context of the document. Being able to see the matches of your query instead of a generic description made people feel that Google was actually paying attention to the query words. This is a surprisingly hard problem that I don’t see much written about. (For many queries, I think this was more important than PageRank™)

Another question is how to scale this up. I decided not to here. All the words on all my web pages put together are only a few megabytes, smaller than the javascript alone on the NYTimes home page. I can load the entire web site and search over it. But if I had many more pages, I might want an inverted index^[1], which can tell us which documents contain a word. See Let’s Build a Full Text Search Engine^[2]. If I wanted to index pages written by other people, I might want ranking functions that were more resistant to spamming, such as using links between documents and the anchor text used to describe those links.

I’m not planning to implement these right now. I wanted to see how far I could get in a weekend. The search results I’m getting here are decent. If you want to play with it some more, I made a standalone version that uses the full page to show search results.

This was a fun project!

Links