Bleve: How to build a rocket-fast search engine?

Bleve: How to build a rocket-fast search engine?

Ā·

9 min read

Go/Golang is one of my favorite languages; I love the minimalism and how clean it is, it's very compact syntax-wise and tries very hard to keep things simple (I am a big fan of the KISS principle).

One of the major challenges I faced in recent times is building a fast search engine. Sure there are options such as SOLR and ElasticSearch; both work really well and are highly scalable, however, I needed to simplify search, by making it faster and easier to deploy with little to no dependencies.

I needed to optimize enough for me to return results quickly so that they could be re-ranked. While C/Rust might be a good fit for this, I value development speed and productivity. Golang is the best of both worlds I guess.

In this article, I will go through a simple example of how you can build your own search engine using Go, you will be surprised: it's not as complicated as you may think.

Golang: Python on steroids

I don't know why, but Golang feels like Python in a way. The syntax is very easy to grasp, maybe it's the lack of semicolons and brackets everywhere or the lack of ugly try-catch statements. Maybe it's the awesome Go formatter, I don't know.

Anyway, since Golang generates a single self-contained binary, it's super easy to deploy to any production server. You simply "go build" and swap out the executable.

Which is exactly what I needed.

Do you Bleve?

No, that's not a typo šŸ™‚. Bleve is a powerful, easy-to-use, and very flexible search library for Golang.

While as a Go developer, you generally avoid 3rd party packages like the plague; sometimes it makes sense to use a 3rd party package. Bleve is fast, well-designed, and provides sufficient value to justify using it.

In addition, here is why I "Bleve":

  • Self-contained, one of the big advantages of Golang is the single binary, so I wanted to maintain that feel and not need an external DB or service to store and query documents. Bleve runs in memory and writes to disk similar to Sqlite.

  • Easy to extend. Since it's just Go code, I can easily tweak the library or extend it in my codebase as needed.

  • Fast: Search results across 10 million+ documents take just 50-100ms, this includes filtering.

  • Faceting: you cannot build a modern search engine without some level of faceting support. Bleve has full support for common facet types: like ranges or simple category counts.

  • Fast indexing: Bleve is somewhat slower than SOLR. SOLR can index 10 million documents in 30 minutes, while Bleve takes over an hour, however, an hour or so is still pretty decent and fast enough for my needs.

  • Good quality results. Bleve does well with the keyword results but also some semantic-type searches work really well in Bleve too.

  • Fast startup: If you need to restart or deploy an update, it takes mere milliseconds to restart Bleve. There is no blocking of reads to rebuild the index in memory, so searching the index is possible without hiccups just milliseconds after a restart.

Setting up an index?

In Bleve, an "Index" can be thought of as a database table or a collection (NoSQL). Unlike a regular SQL table, you do not need to specify every single column, you basically can get away with the default schema for most use cases.

To initialize a Bleve index, you can do the following:

mappings := bleve.NewIndexMapping()
index, err = bleve.NewUsing("/some/path/index.bleve", 
   mappings, "scorch", "scorch", nil)

if err != nil {
   log.Fatal(err)
}

Bleve supports a few different index types, but I found after much fiddling that the "scorch" index type gives you the best performance. If you don't pass in the last 3 arguments, Bleve will just default to BoltDB.

Adding documents

Adding documents to Bleve is a breeze. You basically can store any type of struct in the index:

type Book struct {
    ID                     int      `json:"id"`
    Name                   string   `json:"name"`
    Genre                  string   `json:"genre"`
}

b := Book{ID: 1234, Name: "Some creative title", Genre: "Young Adult"}
idStr := fmt.Sprintf("%d", b.ID)

// index(string, interface{})
index.index(idStr, b)

If you are indexing a large amount of documents, it's better to use batching:

// You would also want to check if the batch exists already
// - so that you don't recreate it.
batch := index.NewBatch()

if batch.Size() >= 1000 {
    err := index.Batch(batch)
    if err != nil {
        // failed, try again or log etc...
    }
    batch = index.NewBatch()
} else {
   batch.index(idStr, b)
}

As you will notice, a complex task like batching records and writing them to the index is simplified using "index.NewBatch" which creates a container to index documents temporarily.

Thereafter you just check the size as you loop along and flush the index once you reach the batch size limit.

Searching the index

Bleve exposes multiple different search query parsers that you can choose from depending on your search needs. To keep this article short and sweet, I am just going to use the standard query string parser.

searchParser :=  bleve.NewQueryStringQuery("chicken reciepe books")

maxPerPage := 50
ofsset := 0

searchRequest := bleve.NewSearchRequestOptions(
searchParser, maxPerPage, offset, false)

// By default bleve returns just the ID, here we specify
// - all the other fields we would like to return.
searchRequest.Fields = []string{"id", "name", "genre"}

searchResults, err := index.Search(searchResult)

With just these few lines, you now have a powerful search engine that delivers good results with a low memory and resource footprint.

Here is a JSON representation of the search results, "hits" will contain the matching documents:

"status": {
"total": 5,
"failed": 0,
"successful": 5
},
"request": {},
"hits": [],
"total_hits": 19749,
"max_score": 2.221337297308545,
"took": 99039137,
"facets": null
}

Faceting

As mentioned earlier, Bleve provides full faceting support out of the box without having to set these up in your schema. To Facet on the book "Genre" for example, you can do the following:

//... build searchRequest -- see previous section.
// Add facets
genreFacet := bleve.NewFacetRequest("genre", 50)
searchRequest.AddFacet("genre", genreFacet)

searchResults, err := index.Search(searchResult)

We extend our searchRequest from earlier with just 2 lines of code. The "NewFacetRequest" takes in 2 arguments:

  • Field: the field in our index to facet on (string).

  • Size: the number of entries to count (integer). Thus in our example, it will only count the first 50 genres.

The above will now fill the "facets" in our search results.

Next, we simply just add our facet to the search request. Which takes in a "facet name" and the actual facet. "Facet name" is the "key" you will find this result set under in our search results.

Advanced queries and filtering

While the "QueryStringQuery" parser can get you quite a bit of mileage; sometimes you need more complex queries such as "one must match" where you would like to match a search term against several fields and return results so long as at least one field matches.

You can use the "Disjunction" and "Conjunction" query types to accomplish this.

  • Conjunction Query: Basically, it allows you to chain multiple queries together to form one giant query. All child queries must match at least one document.

  • Disjunction Query: This will allow you to perform the "one must match" query mentioned above. You can pass in x amount of queries and set how many child queries must match at least one document.

Disjunction Query example:


  djQuery := bleve.NewDisjunctionQuery()
  fields := []string{"title", "description", "author"}
  // At least 1 of the queries below must match at least one document.
  djQuery.Min = 1

  for _, field := range fields {
       query := bleve.NewMatchQuery(searchTerm)

       // Tell the query which field to match against.
       query.setField(field)
       djQuery.AddQuery(query)
  }

searchRequest := bleve.NewSearchRequestOptions(
djQuery, maxPerPage, offset, false)

Similar to how we used "searchParser" earlier, we can now pass the "Disjunction Query" into the constructor for our "searchRequest".

While not exactly the same, this resembles the following SQL:

SELECT docs FROM docs WHERE title LIKE '%x%' OR description LIKE '%x%'
OR author LIKE '%x%';

ā„¹ļø You can also adjust how fuzzy you want the search to be by setting "query.Fuzziness=[0 or 1 or 2]"

Conjunction Query Example:

cjQuery := bleve.NewConjunctionQuery()

// Keyword search
searchQuery := bleve.NewMatchQuery(searchTerm)
searchQuery.setField("name")

cjQuery.addQuery(searchQuery)

// Price range search
minPrice := 100
maxPrice := 200
priceQuery := bleve.NewNumericRangeQuery(&minPrice, &maxPrice)
priceQuery.setField("price")

cjQuery.AddQuery(priceQuery)

searchRequest := bleve.NewSearchRequestOptions(
cjQuery, maxPerPage, offset, false)

You will notice the syntax is very similar, you can basically just use the "Conjunction" and "Disjunction" queries interchangeably.

This will look similar to the following in SQL:

SELECT docs FROM docs 
WHERE title LIKE '%x%' AND (price >= ? and price < ?)

In summary; use the "Conjunction Query" when you want all child queries to match at least one document and the "Disjunction Query" when you want to match at least one child query but not necessarily all child queries.

Sharding

If you run into speed issues, Bleve also makes it possible to distribute your data across multiple index shards and then query those shards in one request, for example:


searchShardHandler := bleve.NewIndexAlias()
searchShardHandler.Add(indexes...)

searchShardHandler.Search(searchRequest)

Sharding can become quite complex, but as you see above, Bleve takes away a lot of the pain, since it automatically "merges" all the indexes and searches across them, and then returns results in one resultset just as if you searched a single index.

I have been using sharding to search across 100+ shards. The whole search process completes in a mere 100-200 milliseconds on average.

You can create shards as follows:

var indexes []bleve.Index
var i := 1

for {
    if i >= 5 {
       return
    }
    indexShardName := fmt.Sprintf("/some/path/index_%d.bleve", i)
    index, err = bleve.NewUsing(indexShardName, mappings, 
    "scorch", "scorch", nil)

    if err == nil {
       indexes = append(indexes, index)
    }
}

Just be sure to create unique IDs for each document or have some sort of predictable way of adding and updating documents without messing up the index.

An easy way to do this is to store a prefix that contains the shard name in your source DB, or wherever you get the documents from. So that every time you try to insert or update, you look up the "prefix" which will tell you which shard to call ".index" on.

Speaking of updating, simply calling "index.index(idstr, struct)" will update an existing document.

Conclusion

Using just this basic search technique above and putting it behind GIN or the standard Go HTTP server, you can build quite a powerful search API and serve millions of requests without needing to roll out complex infrastructure.

One caveat though; Bleve does not cater for replication, however, since you can wrap this in an API. Simply have a cron job that reads from your source and "blasts" out an update to all your Bleve servers using goroutines.

Alternatively, you can just lock writing to disk for a few seconds and then just "rsync" the data across to slave indexes, although I don't advise doing so because you probably would also need to restart the go binary each time.

Ā