Bigdata SQL: Apache Hive Optimization by Using a Broadcast Join

Let s consider a simple SQL query with no aggregation and involving a relatively smaller table in the right-hand side of the JOIN clause (Inventory Table).

SELECT sales.item, sales.quantity,inventory.quantityleft
FROM sales JOIN inventory
ON (sales.productId = inventory.productID)

However, with the optimizations by the Tez engine, the same SQL query is executed efficiently, by broadcasting the smaller table to all the mapper tasks in the next stage, and there are no intermediate writes to disk

Let s take another SQL query, with aggregation and group by clause, that involves a relatively smaller table at the right-hand side of the JOIN clause.

SELECT SalesTable.item, SalesTable.quantity, SalesTable.AvgPrice, inventory.
quantityLeft
FROM (select item, quantity, avg(sales.sellPrice) as AvgPrice, FROM Sales
GROUPBY item, quantity) SalesTable
JOIN Inventory
ON (sales.itemId = inventory.itemID)

Without Tez, the SQL query would be executed using the MapReduce code. However, with the optimizations by the Tez engine, the same SQL query is executed efficiently, by broadcasting the smaller table to all the mapper tasks in the next stage, and there are no intermediate writes to disk