Hive has incorporated support for aggregation features for GROUP BY , using grouping sets. Grouping sets is a concise way to implement advanced multiple GROUP BY operations
against the same set of data with multiple UNION and SELECT clauses.
For example, an SQL query with a grouping set looks like the following:
SELECT a,b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)
The preceding SQL is equivalent to the following SQL, with GROUP BY , UNION , and multiple SELECT statements:
SELECT a, null, SUM(c) FROM tab1 GROUP BY a
SELECT null, b, SUM(c) FROM tab1 GROUP BY b
The advantage of using a grouping set as opposed to multiple GROUP BY and UNION clauses is that it completes all processes in one stage of jobs, which is more efficient than
GROUP BY and UNION all having multiple stages.
Hive also includes support for CUBE , and ROLLUP , based queries to be specified directly in the SQL. ROLLUP enables a SELECT statement to calculate multiple levels of aggregations across dimensions. It is an extension of the GROUP BY clause, with high efficiency and low overhead.
The ROLLUP clause is used with GROUP BY to compute the aggregate at the hierarchy levels of a dimension. GROUP BY a, b, c with ROLLUP assumes that the hierarchy is “a” drilling down to “b” drilling down to “c.”
GROUP BY a, b, c, WITH ROLLUP
is equivalent to
GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ( ))
CUBE takes a specified set of grouping columns and creates aggregations for all of their possible combinations. If n columns are specified for CUBE , there will be n 2 combinations of aggregations formed.
GROUP BY a, b, c WITH CUBE
This is equivalent to the following:
GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())