Tightly-Coupled Analytics with Spark

Tightly-Coupled Analytics with Spark 2017-10-12T01:00:26+00:00

Data engineers and data scientists are becoming increasingly comfortable with new scale-out architectures such as Spark using flexible programming languages such as Scala and Python to operate the data. Often these developers use new notebook technologies such as Apache Zeppelin or Apache Jupyter to create and share documents that contain live code, equations, visualizations and explanatory text. They use notebooks for data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

We deliver a Splice Machine interpreter and Spark integration tightly embedded within Zeppelin notebooks. This enables data scientists to simultaneous benefit from the libraries delivered with Spark such as MLlib and GraphR but get the ACID benefits and superior optimization of Splice Machine.

Splice Machine can return result sets as DataFrames and enable operations on Spark DataFrames. This enables data scientists to have a durable store for Spark that is transactional. This tight integration greatly reduces the movement of data and the ETL necessary to build models. This ease of programming makes data engineers and scientists more productive and allows them to create resilient data pipelines with the ability to rollback the database to consistent states.

Splice Machine has created a DataFrameReader and DataFrameWriter for Spark enabling full in process integration with SparkSQL. For example, here is some SparkSQL that reads a Splice Machine table and filters with a SparkSQL WHERE clause:

Here is a subset of the Splice Machine Scala API that enables seamless manipulation DataFrames in Spark:

Next Chapter