Last time we implemented a search feature that found all matching documents, sorted in alphabetical order of the url. This time I want to sort them in a more reasonable order. Here are some ideas, some which depend on the query and others which don’t:
- popular pages first
- newer (or older) pages first
- promote (or demote) certain folders manually
- promote (or demote) based on the type of content in the page
- pages that have more occurrences of the query string
- pages where the query matches are spread out (or concentrated)
- pages that have the query string in titles or headings

To keep the scope small, I’m going to stick to substring match and not other query matching algorithms:
- word match (graph pathfinding matches
graph for pathfinding
andpathfinding graph
) - ordered word match (big dog matches
big red dog
but notdog was big
) - regexp match (r.*dog matches
red dog
but notdog was red
) - wildcard match (r* dog matches
red dog
but notdog was red
) - plurals / stemming (hexagon matches
hexagonal
) - stop words (the algorithm matches
an algorithm
) - spelling (dikjstra matches
dijkstra
) - synonym match (mapgen matches
map generator
) - case match (OPEN set matches
OPEN set
but notopen set
) - punctuation match (can't matches
can't
but notcant
) - parent category match (#mapgen matches
#mapgen4
)
Note that sometimes matching should be asymmetric, because the way we write things in queries is not the same way we write things in documents:
-
open set should match both
OPEN set
andopen set
, but OPEN set should match onlyOPEN set
-
cant should match both
can't
andcant
, but can't should match onlycan't
-
mapgen should match both
#mapgen
and#mapgen4
, but #mapgen4 should match only#mapgen4
There’s a whole lot to explore with matching algorithms but I want to finish this project in a weekend so I will stick to case-insensitive substring match.
Another topic to explore is how to construct the description of the search result. Early web search engines used a static description provided by the author of the page (“meta tag” description) or by a curator (Yahoo / DMoz directories), or generated a description per page. Google generated a description by scanning the document to find the query words, so that you could see the keywords in context of the document.
Another topic is how to scale this up. All the words on all my web pages put together are only a few megabytes, smaller than the javascript alone on the NYTimes home page. I can load the entire web site and search over it.
Let’s start by refactoring the code into smaller functions so that we can mix and match for each of our experiments. I also needed to extract the headings from the documents, so I first changed the file format to include the headings as separate lines:
fd --extension bxml | while read -r file; do fgrep '<x:noindex' "$file" || fgrep '<x:draft' "$file" || ( echo "%%%% $file %%%%" perl -ne 'print "HEADER=$1\n" if /title="(.*)?"/' "$file" perl -ne 'print "HEADER=$2\n" if /<h(\d).*?>(.*?)<\/h\1>/' "$file" cat "$file" \ | pandoc --read html --write plain --wrap none \ | grep '[a-zA-Z0-9]' ) done >all-pages.txt
I’m using regexps instead of xpath
or xmllint
to extract from my xhtml files, so it misses some things and improperly formats others. That’s ok for now. The main goal is to try out ranking, and then I can go back and improve the extraction if I want to pursue this project.
While working on this part, I improved the formatting and made the search results clickable (opening in a new window).
function escapeHTML(text) { // wish this was built in const map = { '&': '&', '<': '<', '>': '>', '"': '"', "'": ''' }; return text.replace(/[&<>"']/g, m => map[m]); } function parseIntoDocuments(text) { // results[0] will be "", then [1] is a url, [2] is text, with url/text alternating const results = text.split(/%%%% (.*?) %%%%$/gm); if (results[0] !== "") throw `expected first split to be empty ${results[0]}`; if (results[1].length > 100) "expected second split to be url"; let documents = []; for (let i = 1; i+1 < results.length; i += 2) { let url = results[i]; let lines = results[i+1].split("\n"); lines = lines.map(line => line.trim()); // remove extra whitespace lines = lines.filter(line => line); // remove blank lines let title = (lines.find((line) => line.startsWith("HEADER=")) ?? "").slice(7) || url; if (url.startsWith("redblobgames/")) { url = "https://www.redblobgames.com/" + url.slice(13); } else { url = "http://www-cs-students.stanford.edu/~amitp/" + url; } url = url.replace(".bxml", ".html"); url = url.replace("/index.html", "/"); documents.push({url, title, lines}); } return documents; } const words = await (await fetch("./all-pages.txt")).text(); const documents = parseIntoDocuments(words); function search(query) { let results = []; // list of {document, snippet} // where document is {url, title, lines} // where snippet is a list of {line, matches, isHeader} if (query) { query = new RegExp(RegExp.escape(query), 'gi'); for (let document of documents) { let snippet = []; for (let line of document.lines) { let isHeader = false; if (line.startsWith("HEADER=")) { line = line.slice(7); isHeader = true; } let matches = line.matchAll(query).toArray(); if (matches.length > 0) { snippet.push({line, matches, isHeader}); } } if (snippet.length > 0) results.push({document, snippet}); } } return results; } function formatSnippet(snippet) { // TODO: be able to sort snippets before picking the top matches const MAX_SNIPPETS = 8; let result = "… "; for (let {line, matches} of snippet.slice(0, MAX_SNIPPETS)) { let pos = 0; for (let m of matches) { result += escapeHTML(line.slice(pos, m.index)) + `<b>${m[0]}</b>`; pos = m.index + m[0].length; } result += `${escapeHTML(line.slice(pos))} … `; } return result; } function formatSearchResults(query, results) { return results .map(({document, snippet}) => `<a target="_blank" href="${document.url}"> ${document.title} </a> <div class="url"> ${document.url} </div> <div class="snippet"> ${formatSnippet(snippet)} </div>`) .join("\n"); } function setUpSearch(id, sortResults) { const MAX_DOCUMENTS = 10; const el = document.getElementById(id); const update = () => { let query = el.querySelector("input").value; let results = sortResults(search(query)); results = results.slice(0, MAX_DOCUMENTS); el.querySelector(".results").innerHTML = formatSearchResults(query, results); }; el.querySelector("input").addEventListener('input', update); update(); } setUpSearch('search-4', (results) => results);
Type hexagon into this search box:
Ok, let’s try ranking by the number of matches in the document:
setUpSearch('search-5', (results) => { function score({document, snippet}) { let numberOfMatches = 0; for (let {matches} of snippet) { numberOfMatches += matches.length; } return numberOfMatches; } return results.toSorted((a, b) => score(b) - score(a)); });
This is certainly better. The top 5 results before were four hexagon articles and then one about colors. The top 5 results now all have lots of hexagon content.
Let’s try ranking header matches higher than other matches:
setUpSearch('search-6', (results) => { function score({document, snippet}) { let numberOfMatches = 0; for (let {matches, isHeader} of snippet) { numberOfMatches += matches.length; if (isHeader) numberOfMatches += 10; // needs tuning! } return numberOfMatches; } return results.toSorted((a, b) => score(b) - score(a)); });
I think this is even better than before. But it’s still mediocre. Ranking takes a lot of work. There’s so much more that can be done to make search better. Some ideas:
- try out the other ranking ideas I listed at the top of the page
- extract documents correctly; there are going to be lots of little details to get right (data scientists know that “data cleaning” is one of the most time consuming parts of a project)
- split documents into blocks better (for example, <dt> and <dd> should go together, not separate blocks), or maybe get rid of the block idea entirely
- clean up html that snuck through (for example, I see <svg> and <a> links in the output, even though it’s supposed to be plain text)
- try the query matching rules I mentioned earlier, to handle words, phrases, ordering, synonyms, plurals, etc.
- set up a test suite of queries and hand-picked best results, so that I can evaluate the different ranking algorithms and tuning parameters
- use word2vec or a vector db (from the LLM world) to implement fuzzy matching or group related pages together
- show the best snippet of text instead of the first (this may turn out to be as tricky as ranking documents)
- optimize using an inverted index[1], which can tell us which documents contain a word ; I have so few documents that I didn’t need this (see Let’s Build a Full Text Search Engine[2])
I’m not planning to implement these right now. I wanted to see how far I could get in a weekend. If you want to play with it some more, I made a standalone version that uses the full page to show search results.
This was a fun project!