BigData SQL: Analytic Query Support in Spark SQL

Spark started supporting analytic functions from version 1.4. Addition of windowing functions to Spark improves the expressiveness of DataFrame and Spark SQL. Spark
supports three basic kinds of analytic functions: ranking (rank, dense_rank, ntile, row_number), analytic (first_value, last_value, lead, lag), and aggregate.
With Spark SQL, the window functions are used with the OVER clause.
When using the DataFrame API, one uses the function name followed by the OVER clause.
Inside the OVER clause is the window specification, which consists of the partitioning (which decides what rows are to be part of the same partition as the given row in
question), Order (how the rows inside a partition are ordered), and Frame (the rows to be included, based on the current row in question) specification.
In SQL parlance, it would look something like OVER (PARTITION BY … ORDER BY … frame_type BETWEEN start AND end) .
While in the DataFrame API world it would look like the following:
windowSpec = Window.partitionBy(…).orderBy(…)
windowSpec.rowsBetween(start, end)
The frame specification is more detailed, with lots of options, and the frame_type
can be either – ROW / RANGE .
Start can be either UNBOUNDEDPRECEDIN G, CURRENT ROW , PRECEDING , and FOLLOWING ; and end can be either UNBOUNDED , FOLLOWING , CURRENT ROW , PRECEDING , and FOLLOWING .
Spark 2.0 introduced built-in support for time windows. These behave very similarly to the Spark streaming time windows.