Spark SQL
Spark SQL
? Works with unstructured and semistructured data ? Structured data have schema (set of fields for each record) ? Spark provides:
? A DataFrame abstraction that simplifies working with structural databases. They are similar to tables in a relational database
? Spark can read and write data in a variety of formats ? Spark lets you query data using SQL both inside a Spark program and from
external tools that connect to Spark SQL through standard database connectors
? Under the hood, Spark SQL is based on an extension of the RDD model called a DataFrame. It contains an RDD of Row objects, each represents a record
? DataFrames store data in a more efficient manner than native RDDs, taking advantage of their schema. In addition they have new operations not available on RDDs such as the ability to run SQL queries.
Linking with Spark SQL
? Additional dependencies are required ? It can be built with or without support for Apache Hive
(it would allow us to access Hive tables, user defined functions, serialization and deserialization formats, Hive query language). ? Python does not require any change. Scala and Java need the following:
groupId = org.apache.spark artifactId = spark-hive_2.10 version = 1.3.0
Linking with Spark SQL
When programming with Spark SQL there are two entry points. ? HiveContext provides access to HiveQL. ? The more basic SQLContext provides a subset
of the Spark SQL support that does not depend on Hive. Use sqlContext when using pySpark
Using Spark SQL in Applications
We construct a HiveContext based on our SparkContext. This context provides additional functions for querying and interacting with Spark SQL data. Using HiveContext we can build DataFrames which represent our structure data and operate on them with SQL or with normal RDD operation like map()
Initializing Sparl SQL
# Import Spark SQL >>> from pyspark.sql import HiveContext, Row # Or if you can't include the hive requirements >>>from pyspark.sql import SQLContext, Row Once we've added our imports, we need to create a HiveContext, or a SQLContext if we cannot bring in the Hive. >>> sc = SparkContext(...) >>> hiveCtx = HiveContext(sc)
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- big data analytics with hadoop and spark at osc
- three practical use cases with azure databricks
- running apache spark applications cloudera
- unified data access with spark sql
- spark sql edu
- bootstrapping big data with spark sql and data frames
- spark sql tutorialspoint
- pyspark of warcraft europython
- advanced analytics with sql and mllib
- data import databricks