Over the last few releases of Hive, starting from version 0.10, Hive has gone through multiple changes, focused primarily on improving the performance of SQL queries. Some of these changes have resulted in 100–150x speed improvements. Hive was originally built around a one-size-fits-all MapReduce execution framework that was optimized for batch execution. While this solution worked for some workloads and applications, it turned out to be inefficient for many types of workloads, especially where multiple MapReduce jobs were chained or for machine learning algorithms that required iterative data processing.
Apache Tez, a new engine that works closely with YARN, generalizes the MapReduce paradigm with a framework based on expressing computations as a dataflow graph. Tez is not meant directly for end users, but it enables developers to build end-user applications with much better performance and flexibility. Tez helps Hive to support both interactive and batch queries by reducing the latency.
Hive has a setting
that specifies the type of engine Hive should use to execute the SQL queries. By default, Hive will revert to using plain MapReduce with no optimization. If set to Tez, Hive will use the Tez engine, which will result in queries executing faster, in most cases.
The following sections offer a deeper look at some of those optimizations. Some of the goals of Tez are to allow for low latency on interactive queries and improve the throughput for batch queries. Tez optimizes Hive jobs by eliminating synchronization barriers and minimization of the reads and write to HDFS. Some of these changes include vectorization of queries, ORC columnar format support, use of a cost-based optimizer (CBO), and incorporation of LLAP (Live Long and Process) functionality. Pipelined execution of the reducer without writing intermediate results to disk, vectorized query execution, and other novel approaches (some of which are discussed later in this chapter) for query execution fall outside the scope and capabilities of the pure MapReduce framework.
Startup costs of Java virtual machines (JVMs) have always been the key bottleneck in Hive, where each JVM started for each mapper/reducer can often take up to 100ms or so.