Bigdata SQL: Analytic Functions in Apache Hive

Analytic/window functions are a category of functions that are like aggregate functions but scan multiple input rows to compute their outputs. Unlike GROUP BY , which shows
one result per group, analytic functions operate on groups of rows, called windows, expressed in an OVER clause as conditions.
Analytic capabilities are critical to a variety of data integration and analytic functions. Most relational databases have support for window and analytic functions
in their SQL repertoire. Starting from Hive version 0.13, Hive has been steadily adding support for common analytic window functions. This allows Hive to move beyond just
batch processing to a realm where it can be used to support complex analytics queries.
One minor but very important point to remember with regard to analytic functions is that they are evaluated after joins , WHERE , and GROUP BY clauses have been applied. Analytic functions are heavily used in finance and time series analytics, to provide trend, outlier, seasonality, and bucketed analysis. With the advent of IoT (Internet of Things)
and streaming data, applying analytic functions on an IoT data stream becomes a very important use case.
The typical syntax of an analytic query looks as follows:

SELECT function() OVER ( w )
FROM …
WINDOW w as ([ PARTITION BY … ]
[ ORDER BY … ]
[ ROWS|RANGE BETWEEN … ]) ;

Following is what each clause indicates:
• PARTITION BY separates the data into distinct separable groups, similar to GROUP BY .
• ORDER BY describes how the data should be ordered within a partition.
• RANGE|ROWS BETWEEN describes which rows within the partition apply to the current calculation. The default is all the rows prior to the current one.
• WINDOW enables creation of an alias for a particular window specification, so that it can be simply referenced in multiple places within the query.
• Analytic queries enable many types of calculations that would be difficult to express with plain SQL that involves calculations across rows.
• The OVER clause is required for calls to analytic functions such as LEAD() , RANK() , and FIRST_VALUE() .
• PARTITION BY is like the GROUP BY clause in the outermost block of a query. It divides rows into groups having identical values in
the columns concerned. These logical groupings are known as partitions.