Elasticsearch transformed: what's new in 5.0?
Elasticsearch team lead Clinton Gormley explores the changes in the latest version of Elasticsearch
When Elasticsearch launched in 2012, it had a bold but clear ambition - to help organisations utilise all their data, in real-time, to make real revenue-generating decisions. Today, built on Apache Lucene, Elasticsearch allows users to build new applications from scratch, harnessing the power of search as a competitive differentiator.
Elasticsearch is now the world's most popular full-text search and analytics engine. It is used by organisations such as the BBC, eBay, Groupon, The Guardian, Goldman Sachs, Microsoft, Uber, Yelp and Wikipedia, among others, to make light work of massive amounts of structured and unstructured data in a variety of use cases - search, security, logging to name a few.
And Elastic has just announced its largest-ever product update, to version 5.0, of the Elastic Stack. Since 2012, Elastic's software products - known to many as the ELK Stack - have been downloaded more than 75 million times. Announced with 5.0, Elastic Stack is now official name of Elastic's open-source products - Elasticsearch, Kibana, Beats, and Logstash. X-Pack, meanwhile, refers to commercial features, such as security, alerting, monitoring, Graph analytics, and reporting.
So, what's new in Elasticsearch 5.0?
Total transformation
Built on Lucene 6.2.0, Elasticsearch 5.0 is an altogether different beast compare to earlier iterations. It changes how many of the underlying fundamentals of search are conducted; comprising a host of brand new features and built on more solid foundations. It is faster, more secure, resilient, and easier to use.
The first noticeable change within Elasticsearch 5.0 is the consistency across the stack, the ease of deployment, and how easy it is to get started. Historically, there was a lot of set-up, configuration and a touch of ‘black magic' required to make earlier versions work together. The first real advantage in Elasticsearch 5.0 is that these integration issues are resolved; deploying Elasticsearch with other components of Elastic Stack 5.0 should just work, and with minimal set-up.
Ingestion made easy
Getting data into Elasticsearch has became much simpler. Ingest Node is a new feature in Elasticsearch 5.0 that incorporates popular Logstash filters (grok, split, convert, date), implemented directly in Elasticsearch as processors. Ingesting log messages is as easy as configuring a pipeline with the Ingest API, and setting up Filebeat to forward a log file to Elasticsearch. It's even possible to run dedicated ingest nodes to separate the extraction and enrichment workload from search, aggregations, and indexing. Logstash, Elastic's existing data collection and normalisation engine, is ideal when pulling from custom sources not handled by Beats. Or, if you want to route data to non-Elasticsearch analysis, Kafka, HDFS and others.
Improved indexing performance
The team spent more than a year completely re-engineering indexing performance to yield an improvement of up to 80 per cent in indexing throughput, and significantly reduced disk usage. Lucene 6 brings a sophisticated new Points data structure for numeric and geopoint fields - Block K-D trees. This radically changes the way numeric values are indexed and searched. Points are 36 per cent faster at query time, 71 per cent faster at index time, searches use 66 per cent less disk, and 85 per cent less memory.
It also now supports both IPv6 and IPv4 domains. Geo-point field now uses Lucene's new LatLonPoint; doubling geo-point query performance. All benchmarks run by Elastic are public, giving customers and community transparency in performance efforts undertaken by Elastic.
Scripting is painless
There's only so much you can accomplish in JSON, but scripting enables you to do a wide variety of other things. The history of scripting in Elasticsearch is a chequered one. It started with MVEL, then moved to Groovy when the MVEL project lost momentum. Groovy proved to be a security risk and it was not possible to sandbox it, which resulted in scripting being disabled by default.
Out of frustration, the team has developed a new scripting language called Painless, designed to be fast, safe, and secure, and enabled by default. Painless is four times faster than Groovy and getting faster. It also comes with nifty features like loop counters to prevent malicious scripts from being used for denial of service attacks.
Elasticsearch transformed: what's new in 5.0?
Elasticsearch team lead Clinton Gormley explores the changes in the latest version of Elasticsearch
Reducing the cost of deep pagination
Search pagination is easy if you just have one server. Paginating results from a cluster of nodes becomes progressively more expensive the deeper you paginate as more results are required from each server to find just the 10 that are of interest. The new SearchAfter feature in Elasticsearch changes all this by providing an efficient way to provide continuous searches; previous results are skipped over to return just the next page.
Search and aggregation
The search API has been re-factored in 5.0, driving sophistication into the way search queries are executed. The complete overhaul of the result-caching mechanism means that large aggregations are only recalculated on updated indices. What does this mean? Speed: repeated queries are now near-instant.
Scrolled search enables users to retrieve all the documents that match a query in batches, which is particularly useful if you want to process a billion documents. Now, scrolled searches can be divided up into slices that can be consumed in parallel - perfect for moving data in and out of Elasticsearch or to Hadoop and back again.
Many users want to stream data through Spark, for example, but Spark instances have a four gigabyte memory limit. If your document exceeds this limit you need to think of clever ways to split it up. Sliced scrolled search in Elasticsearch resolves this problem.
Percolator reverses the relationship between documents and queries. Normally, a query is used to find all matching documents. Percolator turns this around and instead answers the question "which queries match this document". House hunting is a good example of where this could be useful; customers can register the type of house they are interested in, which can be saved as a query.
Then, when a new house becomes available, the percolator can find the queries which the new house matches, and agents can alert customers. Queries are indexed with pertinent information extracted and indexed - when a new document comes in it's easy to lookup which queries match and to only check those queries, instead of the brute force approach of checking all queries. This reduces memory usage and increases performance.
Elastic Stack 5.0
The advances in Elasticsearch itself are just the first part of Elastic's gripping new story. Since its inception, Elasticsearch has invested in building new projects to ease the 'Getting Started' experience and add new capabilities. Tools, like Logstash and Kibana, have become recognised tools in themselves.
One newer addition to the family, Beats, now delivers a suite of open source lightweight data shippers, for importing data types such as time series, metrics, network packet information, Windows metrics, and much more. These lightweight agents give users fast, ready-built connectivity to combine the power of search and analysis in new use cases. Beats even offers connectivity to Apache Kafka where messaging and queuing functionality is needed. With Beats, Elastic has gone further; creating libbeat, a common framework for any user to build and maintain new Beats themselves. In addition to Elastic-supported ‘beats' - FileBeat, MetricBeat, PacketBeat, and WinlogBeat - the community has created more than 30 additional beats.
Then there's Logstash 5.0 and Kibana 5.0, with updated designs, improved performance, and almost too much fun and functionality to mention in one short article. Kibana is the window to the Elastic Stack providing monitoring, security role & user management, graph exploration, and more with X-Pack. Also, the addition of Timelion offers an expressive user interface for interacting with time-series data, including direct APIs for connecting to live time series streams.
Kibana 5.0 also offers Console, (formerly called Sense) built as an interface for interacting with Elasticsearch, including syntax highlighting and tab completion, opening Kibana to a broader range of end users.
But, by far the biggest change in Elastic Stack 5.0 is the simple fact it is now a single unified stack, with simplified installs, "flick-the-switch" connectivity between all its elements, and unified release cycles.