Spark SQL supports interacting with the Spark engine using SQL and HiveQL. Spark SQL represents structured data, as Spark DataFrames, which are internally represented as Spark RDDs with an associated schema. Spark SQL allows developers to mix SQL queries with code written with any of the language bindings supported in Spark—Python, Java, Scala, and R—all within a single application. The whole idea of providing an SQL layer on top of the Spark engine framework is to support the following:
• Writing less code
• Reading less code
• Allowing the Spark Catalyst optimizer to do most of the hard work of figuring out where, when, and how to execute the query, so that it is best optimized, has low latency, and does the least amount of work to obtain the results.
The Spark SQL library is composed of the following components:
• Data Source API : This is an API for loading/saving data from any data source.
• DataFrame API : This is the API for higher level representation for data, used directly by the application developers.
• SQL optimizer : This is a rule-based optimizer to optimize the data transformations specified in SQL.
Data Source API is a universal API for loading/saving data, with support for data sources such as Hive and NoSQL databases, flat files, and data formats such as Avro, JSON, JDBC, and Parquet. It allows third-party integration through Spark packages. For example, the built-in comma-separated value (CSV) data source API allows for
• Loading/Saving CSV data
• Automatic schema discovery within the CSV file
• Automatic data type inference
The Data Source API can also automatically prune columns and push filters to the source, as the first steps to optimize the data access.
The DataFrame is a distributed collection of rows organized into named columns.
DataFrame is a data structure for structured data, whereas RDD is a data structure for unstructured data. DataFrame, in other words, is the combination of RDD plus schema.
The DataFrame API in Spark is inspired from R data frames and Python panda libraries, which support data processing and data wrangling of structured data.
A Data Source API implementation returns DataFrames, which provide the ability to combine data from multiple sources and provide uniform access from different language APIs. Having a single data structure allows users to build multiple DSLs (Domain Specific Languages) targeting different developers, but all such DSLs eventually use the same optimizer and code generator