Jeff Hube

Implementing Search on a Static Site

I wanted to add search to this site, but since it is a static site hosted on GitHub Pages, the functionality must be implemented on the client. This is where Lunr comes in.

Lunr is a full-text search library written in JavaScript, so it works in the browser and on the server. You can build an index of your content offline, export the index as JSON, and then upload it along with the rest of your content to wherever your site is hosted. When someone searches your site, the searching is performed in their browser using Lunr and the prebuilt index.

I'll explain much of how searching is implemented on this site, and you can see the finished result here.

Building the index

The first step is to build the search index. The details will depend on which static site generator you use, but for Metalsmith, you might end up with something like this.

First I find the files that I want to index and grab their url, title, and body.

var docs = Object.values(files)
        .filter((file) => file.lunr)
        .map((file) => ({
            url: file.path,
            title: file.title,
            body: extractText(file.contents.toString())
        }));

I perform the indexing after converting them from Markdown to HTML, but before inserting them into the site template, so I use JSDOM to extract the text content from the HTML.

function extractText(html) {
    function helper(node) {
        if (node.nodeType === 3 /* TEXT_NODE */) {
            return node.textContent;
        }
        return Array.from(node.childNodes).map(helper).join('');
    }
    return helper(new JSDOM(html).window.document).replace(/[\n\r ]+/g, ' ');
}

Then I create a search index for the title and body of each file. Each file is often given a unique identifier, which in this case is just its index. When Lunr returns the search results it will include the identifier, which is then used to locate the original file. I also whitelist position so that Lunr will return the positions of matches, which I use to display the snippets containing the search term. Note that doing so does impact the size of the index.

var index = lunr(function() {
    this.field('title');
    this.field('body');
    this.metadataWhitelist = ['position'];

    docs.forEach((doc, index) => {
        this.add({ id: index, title: doc.title, body: doc.body });
    });
});

Lastly I export the index to JSON and write it to a file.

var indexJson = JSON.stringify(index.toJSON());
files['search/index.json'] = {
    contents: new Buffer(indexJson)
};

There are a few differences between the code above and what I'm actually using. With the index I include the text content of all indexed documents, which is also used to provide the previously mentioned contextual snippets. I also write the index to a .js file so I can just use <script src="..."> to load it.

Next we need to add a search box. The search box simply redirects to the search page, supplying the search terms in the query string.

<form class="search-form" onsubmit="return doSearch()">
    <input id="search-input" class="search-input" type="text" placeholder="Search" />
</form>
function doSearch(evt) {
    var query = document.getElementById("search-input").value.trim();
    if (query) {
        window.location = "/search?q=" + encodeURIComponent(query);
    }
    return false;
}

The search page

Last is the search page itself. The search page grabs the search terms from the query string and then searches the index.

var term = decodeURIComponent((/[?&]q=([^&]*)/.exec(window.location.search) || [])[1] || "");
// ...
var index = lunr.Index.load(json);
var results = index.search(term);

Lunr will return an array like the one below. The array contains an element for each matching document in the index. ref is the unique identifier of the document, which is just an array index in this case. The score indicates how well the document matched the search term. matchData.metadata will contain an entry for each word in the search terms ('Salesforce', represented in the index as "salesforc"). The entry includes which fields were matched (title and body), and the positions of those matches.

[
    {
        "ref": "1",
        "score": 0.5428010575976667,
        "matchData": {
            "metadata": {
                "salesforc": {
                    "title": {
                        "position": [
                            [0, 11]
                        ]
                    },
                    "body": {
                        "position": [
                            [0, 10],
                            [276, 10],
                            [1132, 10],
                            [2210, 10],
                            [3190, 10],
                            [3236, 10]
                        ]
                    }
                }
            }
        }
    },
    {
        "ref": "2",
        "score": 0.45306026606028593,
        "matchData": {
            "metadata": {
                "salesforc": {
                    "title": {
                        "position": [
                            [0, 11]
                        ]
                    }
                }
            }
        }
    }
]

The last step is to display the results on the search page, which is left as an exercise for the reader 😉. In addition to displaying the title of any matching pages, I chose to display excerpts containing the search term and to highlight all occurrences of the search term. The excerpts are obtained by taking a number of characters before and after each match, stopping on word boundaries. There is some additional logic that prevents redundancy if two occurrences are close together.

Once again, you can see the finished result here.

Cheers,

Jeff