Spark SQL

Spark SQL

? Works with unstructured and semistructured data ? Structured data have schema (set of fields for each record) ? Spark provides:

? A DataFrame abstraction that simplifies working with structural databases. They are similar to tables in a relational database

? Spark can read and write data in a variety of formats ? Spark lets you query data using SQL both inside a Spark program and from

external tools that connect to Spark SQL through standard database connectors

? Under the hood, Spark SQL is based on an extension of the RDD model called a DataFrame. It contains an RDD of Row objects, each represents a record

? DataFrames store data in a more efficient manner than native RDDs, taking advantage of their schema. In addition they have new operations not available on RDDs such as the ability to run SQL queries.

Linking with Spark SQL

? Additional dependencies are required ? It can be built with or without support for Apache Hive

(it would allow us to access Hive tables, user defined functions, serialization and deserialization formats, Hive query language). ? Python does not require any change. Scala and Java need the following:

groupId = org.apache.spark artifactId = spark-hive_2.10 version = 1.3.0

Linking with Spark SQL

When programming with Spark SQL there are two entry points. ? HiveContext provides access to HiveQL. ? The more basic SQLContext provides a subset

of the Spark SQL support that does not depend on Hive. Use sqlContext when using pySpark

Using Spark SQL in Applications

We construct a HiveContext based on our SparkContext. This context provides additional functions for querying and interacting with Spark SQL data. Using HiveContext we can build DataFrames which represent our structure data and operate on them with SQL or with normal RDD operation like map()

Initializing Sparl SQL

# Import Spark SQL >>> from pyspark.sql import HiveContext, Row # Or if you can't include the hive requirements >>>from pyspark.sql import SQLContext, Row Once we've added our imports, we need to create a HiveContext, or a SQLContext if we cannot bring in the Hive. >>> sc = SparkContext(...) >>> hiveCtx = HiveContext(sc)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download