On Saturday Kevin Greene of Spantree gave an interesting talk about Elasticsearch.
Image credit: Niki Odolphie on Flickr
Elasticsearch is a distributed, RESTful, high-performance search engine built for for the cloud. An Elasticsearch query is much faster than a query against a relational database, which (if driven by free-text user input) would probably need to do a full table scan for each query. Bad in the base case, atrocious for Google-like "instant search" that populates while the user types.
Github uses Elasticsearch to index more than 8 million code repositories and critical event data—more than 20 terabytes of data.
"You can search for a project that uses Clojure as the primary language, and has had activity over the past month, and all this functionality is powered by Elasticsearch. You can do lots of queries on that data using Elasticsearch that a standard SQL database won’t support."—Tim Pease, Operations Engineer at Github
Elasticsearch has the following key features, Kevin told the audience:
- Wicket fast speed
- Highly configurable
- Free and open source
- Distributed
- Fault-tolerant
- Handles unstructured data
- Realtime analytics
Elasticsearch is NOT:
- A relational database
- A crawler
- A machine learner
- A user-facing search front-end
- Secure
Kevin Greene presenting his talk “Elasticsearch for, you know, search” at Flourish 2014.
Elasticsearch is document-oriented; there is no need for upfront schema definition. Schema can be defined per type for customization of the indexing process.
Elasticsearch is distributed and highly available, and its search is near real-time. Each index is fully sharded across multiple shards (the number of shards is configurable). Each shard can have one or more slaves (replicas). Read (search) operations can be performed on the slaves.
Elasticsearch uses a formula called TF-IDF to score documents in its database for relevancy for a certain query.
This is the formula for calculating the relevancy score:
TF-IDF relevancy score = (term frequency) * ln [N / (document frequency)]
For example, let's say we have a document with 100 words wherein the word "monkey" appears 5 times. The term frequency (TF) for monkey is then:
(5 / 100) = 0.05.
Now, let's say we have 1,000 documents and the word "monkey" appears in 10 of those. Then, the inverse document frequency (IDF) is calculated as:
ln(1,000 / 10) = ln(100) = 4.6
Then the TF-IDF score for our document for the query "monkey" is the product of TF * IDF:
TF*IDF score = 0.05 * 4.6 = 0.23
In order to determine which document should be displayed first for a user's query "monkey," Elasticsearch compares the TF-IDF scores of all the documents and returns the document with the highest score as the first result.
You can find Elasticsearch on github.
Here is a video of an earlier taping of Kevin's talk:
Video credit: Spantree and Kevin Greene on Vimeo