Last time we implemented a search feature that found all matching documents, sorted in some arbitrary order. This time I want to sort them in a more reasonable order. Here are some ideas, some which depend on the query and others which don’t:
- popular pages first
- newer (or older) pages first
- promote (or demote) certain folders manually
- promote (or demote) based on the type of content in the page
- pages that have more occurrences of the query string
- pages where the query matches are spread out (or concentrated)
- pages that have the query string in titles or headings
I’m going to try building this:

To keep the scope small, I’m going to stick to substring match and not other query matching algorithms:
- word match (graph pathfinding matches
graph for pathfinding
andpathfinding graph
) - ordered word match (big dog matches
big red dog
but notdog was big
) - regexp match (r.*dog matches
red dog
but notdog was red
) - wildcard match (r* dog matches
red dog
but notdog was red
) - plurals / stemming (hexagon matches
hexagons
andhexagonal
) - stop words (the algorithm matches
an algorithm
) - spelling (dikjstra matches
dijkstra
) - synonym match (mapgen matches
map generator
) - case match (OPEN set matches
OPEN set
but notopen set
) - punctuation match (can't matches
can't
but notcant
) - parent category match (#mapgen matches
#mapgen4
)
Note that sometimes matching should be asymmetric, because the way we write things in queries is not the same way we write things in documents:
-
open set should match both
OPEN set
andopen set
, but OPEN set should match onlyOPEN set
-
cant should match both
can't
andcant
, but can't should match onlycan't
-
mapgen should match both
#mapgen
and#mapgen4
, but #mapgen4 should match only#mapgen4
There’s a whole lot to explore with matching algorithms but I want to finish this project in a weekend so I will stick to case-insensitive substring match.
For some experiments I need access to headings, so I added headings to the search data. Some of my headings are title="heading" and some are <h3>heading</h3> so I extracted both:
fd --extension bxml | while read -r file; do echo "%%%% $file %%%%" perl -ne 'print "HEADER=$1\n" if /title="(.*)?"/' "$file" perl -ne 'print "HEADER=$2\n" if /<h(\d).*?>(.*?)<\/h\1>/' "$file" cat "$file" \ | pandoc --read html --write plain --wrap none \ | grep '[a-zA-Z0-9]' done >all-pages.txt
I’m using regexps instead of xpath
or xmllint
to extract from my xhtml files, so it misses some things and improperly formats others. That’s ok for now. The main goal is to try out ranking, and then I can go back and improve the extraction if I want to pursue this project.
I then refactored the code into smaller functions to make experimentation easier. I also improved the formatting and made the search results clickable (opening in a new window).
function escapeHTML(text) { // wish this was built in const map = { '&': '&', '<': '<', '>': '>', '"': '"', "'": ''' }; return text.replace(/[&<>"']/g, m => map[m]); } function parseIntoDocuments(text) { // results[0] will be "", then [1] is a filename, [2] is text, alternating const results = text.split(/%%%% (.*?) %%%%$/gm); if (results[0] !== "") throw `expected first split to be empty ${results[0]}`; if (results[1].length > 100) "expected second split to be filename"; let documents = []; for (let i = 1; i+1 < results.length; i += 2) { let filename = results[i]; let lines = results[i+1].split("\n"); lines = lines.map(line => line.trim()); // remove extra whitespace lines = lines.filter(line => line); // remove blank lines let title = (lines.find((line) => line.startsWith("HEADER=")) ?? "").slice(7) || filename; let url = filename.startsWith("redblobgames/") ? "https://www.redblobgames.com/" + filename.slice(13) : "http://www-cs-students.stanford.edu/~amitp/" + filename; url = url.replace(".bxml", ".html"); url = url.replace("/index.html", "/"); documents.push({url, title, lines}); } return documents; } const words = await (await fetch("./all-pages.txt")).text(); const documents = parseIntoDocuments(words); function search(query) { let results = []; // list of {document, snippet} // where document is {url, title, lines} // where snippet is a list of {line, matches, isHeader} if (query) { query = new RegExp(RegExp.escape(query), 'gi'); for (let document of documents) { let snippet = []; for (let line of document.lines) { let isHeader = false; if (line.startsWith("HEADER=")) { line = line.slice(7); isHeader = true; } let matches = line.matchAll(query).toArray(); if (matches.length > 0) { snippet.push({line, matches, isHeader}); } } if (snippet.length > 0) results.push({document, snippet}); } } return results; } function formatSnippet(snippet) { // TODO: be able to sort snippets before picking the top matches const MAX_SNIPPETS = 8; let result = "… "; for (let {line, matches} of snippet.slice(0, MAX_SNIPPETS)) { let pos = 0; for (let m of matches) { result += escapeHTML(line.slice(pos, m.index)) + `<b>${m[0]}</b>`; pos = m.index + m[0].length; } result += `${escapeHTML(line.slice(pos))} … `; } return result; } function formatSearchResults(query, results) { return results .map(({document, snippet}) => `<a target="_blank" href="${document.url}"> ${document.title} </a> <div class="url"> ${document.url} </div> <div class="snippet"> ${formatSnippet(snippet)} </div>`) .join("\n"); } function setUpSearch(id, sortResults) { const MAX_DOCUMENTS = 10; const el = document.getElementById(id); const update = () => { let query = el.querySelector("input").value; let results = sortResults(search(query)); results = results.slice(0, MAX_DOCUMENTS); el.querySelector(".results").innerHTML = formatSearchResults(query, results); }; el.querySelector("input").addEventListener('input', update); update(); } setUpSearch('search-4', (results) => results);
I wanted to test the refactored code before implementing ranking. This shows documents in some arbitrary order, like before:
Let’s try ranking by the number of matches in the document:
setUpSearch('search-5', (results) => { function score({document, snippet}) { let numberOfMatches = 0; for (let {matches} of snippet) { numberOfMatches += matches.length; } return numberOfMatches; } return results.toSorted((a, b) => score(b) - score(a)); });
This is certainly better. The top 5 results before were four hexagon articles and then one about colors. The top 5 results now all have lots of hexagon content.
Let’s try ranking header matches higher than other matches:
setUpSearch('search-6', (results) => { function score({document, snippet}) { let numberOfMatches = 0; for (let {matches, isHeader} of snippet) { numberOfMatches += matches.length; if (isHeader) numberOfMatches += 10; // needs tuning! } return numberOfMatches; } return results.toSorted((a, b) => score(b) - score(a)); });
I think this is even better than before. But it’s still mediocre. Should I prioritize matches that occur in a cluster (max matches per block) or matches that are spread out (number of matching blocks)? I don’t know! It will take a lot of testing. Ranking takes a lot of work. There’s so much more that can be done to make search better. Some ideas:
- try out the other ranking ideas I listed at the top of the page
- extract documents correctly; there are going to be lots of little details to get right (data scientists know that “data cleaning” is one of the most time consuming parts of a project)
- identify pages that are just linking to other pages (for example, my blog home page), and prefer returning search results from the linked-to pages
- identify pages that are older versions of other pages, and prefer returning the current pages
- trim snippets at word boundaries centered at the query matches instead of picking the beginning of the line
- split documents into blocks better (for example, <dt> and <dd> should go together, not separate blocks), or maybe get rid of the block idea entirely
- try the query matching rules I mentioned earlier, to handle words, phrases, ordering, synonyms, plurals, etc.
- set up a test suite of queries and hand-picked best results, so that I can evaluate the different ranking algorithms and tuning parameters
- change ranking philosophy from the set of best pages to the best set of pages
- use word2vec or a vector db (from the LLM world) to implement fuzzy matching or group related pages together
- generate a good snippet instead of the first N matches
Tangent: I think the description of search results is underrated. Early web search engines used a static description provided by the author of the page (“meta tag” description) or by a curator (Yahoo / DMoz directories), or generated a description per page. Google generated a description by scanning the document to find the query words, so that you could see the keywords in context of the document. Being able to see the matches of your query instead of a generic description made people feel that Google was actually paying attention to the query words. This is a surprisingly hard problem that I don’t see much written about. (For many queries, I think this was more important than PageRank™)
Another question is how to scale this up. I decided not to here. All the words on all my web pages put together are only a few megabytes, smaller than the javascript alone on the NYTimes home page. I can load the entire web site and search over it. But if I had many more pages, I might want an inverted index[1], which can tell us which documents contain a word. See Let’s Build a Full Text Search Engine[2]. If I wanted to index pages written by other people, I might want ranking functions that were more resistant to spamming, such as using links between documents and the anchor text used to describe those links.
I’m not planning to implement these right now. I wanted to see how far I could get in a weekend. The search results I’m getting here are decent. If you want to play with it some more, I made a standalone version that uses the full page to show search results.
This was a fun project!