What is Splice Machine?

What is Splice Machine? 2017-10-12T00:28:30+00:00

Splice Machine is a scale-out SQL RDBMS specifically designed to power predictive applications. It is hybrid like a Lambda Architecture with specialized compute engines under the hood, but this complexity is hidden from the user and the system is tightly coupled.

Database Architecture

Splice Machine implemented a Lambda Architecture with Apache Kafka, Apache HBase, and Apache Spark under the hood however this is pre-integrated avoiding the complexity of Lambda Architectures. Users have the choice of just using SQL via standards such as JDBC/ODBC in any programming language or BI tool, or to programmatically access Splice Machine via a seamless integration to Spark where you can manipulate result sets in Spark dataframes directly in Scala, Python, and R.

A Flexible Scale-Out Architecture

Example of using Splice Machine in Java via JDBC

Life of a Query

You issue SQL to Splice Machine and it creates an optimal plan to execute the query on the cluster. Splice Machine uses an advanced cost-based optimizer to determine the best access to the data via base tables or indexes, the best join-ordering, the best distributed join algorithm, and it uniquely chooses the best compute engine (e.g., HBase or Spark) to execute the query depending on its assessment of the query and the data. Single record lookups/updates and short range scans take advantage of HBase’s needle-in-the-haystack querying power while long running joins, aggregations, and groupings run on Spark.

Query Execution

For example, consider an inventory application where you have time-series data representing inventory levels over time based on expected supply and demand of orders and shipments.

Sample Built-in Visualization using Zeppelin

Often these applications require needle-in-the-haystack type of queries like:

  • Find a specific order
  • How much is expected available on some date (e.g., available-to-promise)
  • What stockouts are projected.

These queries require indexed structures that can return results on the order of milliseconds and thus Splice Machine executes these on HBase.

Available-to-Promise Query – Needle in the haystack type of query. Fast lookup on volumes of data.

On the other hand, consider the same inventory application where one wanted to learn a model of why shipments are late. This learning problem might require long-running queries that group all the past orders into bins of “lateness” as input to a machine learning algorithm. They may go over every past order and calculate how late the order was from its original delivery date. The Splice Machine optimizer would use Spark to execute this SQL.

In both of the cases above, the developer just issues the SQL and the optimizer picks the right Lambda architecture engine to execute the query.

Ingestion

Splice Machine can ingest vast amounts data with excellent throughput. The batch ingestion tool can linearly scale-out to PB scale with ingestion at 40MB/sec per node. Splice Machine is tightly integrated with Spark Structured Streaming and Kafka to enable real-time streaming as well.

Hybrid Storage

In addition to the row-based storage of HBase to enable needle-in-the-haystack queries, Splice Machine also enables columnar tables in Apache Parquet, Apache ORC, or Apache Avro formats. This enables customers to use low-cost block storage for data and benefit from the condensed encoding of these formats lowering storage costs.

Next Chapter