Bigdata SQL: How Apache Hive Translates SQL into MR

In this section, we will briefly discuss how Hive Server converts an SQL query into a MapReduce program. We will offer a few examples to help you to understand the process.
Note: You will require familiarity with the MapReduce framework fundamentals in order to understand this section. Let us take the most basic SQL query, such as the following:

Select Col1, Col2, Col3 from Table1 where Col4 = “X”

This query scans all the rows of a table (file in Hive), takes only those rows that have Col4 = ” X “, and returns the three columns—Col1, Col2, and Col3 of those filtered rows.

This query is a Map-only query, in which the Mapper code filters out the rows, based on the criterion Col4 == ” X “, and emits Col1, Col2, and Col3. The Map code in pseudo code
would look something like this:

Map(k, record) {
// k – is the key and record is the value – in the key-value paradigm of
Map-Reduce
If (record.Col4 == “X”) {
outRecord = // Create a record with only the
expected output columns
collect(k, outRecord) // output the collected columns with
the same Key
}
}

Lets consider a slightly more complex query, which involves some aggregation operators.

Select Col1, Sum(Col2), Avg(Col3) from Table1 where Col4 = “X” groupby Col1

This query requires some aggregation work—Sum and Average—and, hence, it is not a Map-only query. This query would require the Reduce side framework to get the right
aggregated results.
However, the Map side of this query remains very similar to the first one, except that, because we are doing a group by on Col1, the key emitted from the Map method has to
have Col1 as the key. This key allows the MapReduce framework to shuffle all records with the same value of Col1 to the same reducer, which will then work on the list of records with
the same values and calculate the Average and Sum of those records on Col2 and Col3.
The code for this would look like this:

Map(k, record)
If (record.Col4 == “X”) {
outRecord = < Col2, Col3> // Create a record with only the expected
output columns
collect(Col1, outRecord) // output the collected columns with the
same Key
}
}
Reduce (k, listOfRecords) {
Sum = Avg = 0
foreach record in listOfRecords {
Sum += record.Col2
Avg += record.Col3
}
Avg = Avg / length(listOfRecords)
outputRecord =
emit(k, outputRecord)
}

These were two very simple, basic queries translated to MapReduce. Complex queries involving joins and aggregations can often have multiple Mappers and Reducers across
multiple stages, wherein each MapReduce combination feeds data through disk (HDFS) writes to the next MapReduce job in a chain before the final results are available for use.