Big Data SQL:SQL Enhancements and Impala Shortcomings

Let us look into why Impala and Hive show different performance characteristics in the response time to the same data. According to Cloudera, the following are the two main reasons for Impalas good performance:
1. Impala reduces CPU load by multiple optimization, as LLVM and by working on special chipsets, as compared to Hive, and, hence, can increase its I/O bandwidth, which is why it has
better performance than Hive for pure I/O-bound workloads.
2. For long and complex queries, Hive has a multistage MR pipeline that results in multiple stages of reads and writes to disks, resulting in slowdowns. Impala avoids this by having a
totally different engine that does not rely on MR but pipelines the data between the nodes, resulting in efficient usage of the intra-cluster network bandwidth.
Impala version 2.3 has built-in support for querying complex types in Parquet format: ARRAY , MAP , and STRUCT s. SQL queries can work directly on the nested data sets without the need to flatten the data before querying. Impala supports queries on complex data types, using join syntax rather than explode() , as in Hive. Though the latest version of Impala has very good
overage of SQL queries in terms of its compatibility to SQL-2003 standards, Impala does still lag behind Hive in support of SQL support. Impala needs the Hive metastore to function. This can be a problem for certain deployments that do not need or use Hive. Most HiveQL SELECT and INSERT statements run unmodified with Impala. Hive functionality related to TRANSFORM,
JSON, XML, and SerDe is not available in Impala. Some of the aggregate functions in Hive, such as covar* and percentile* are named differently in Impala.
Impala and Hive share the same metastore database, and their tables are often used interchangeably. Impala s SQL syntax follows the SQL-92 standard and includes many industry extensions in areas such as built-in functions. UDFs in Impala are written in C++, which makes them even faster. Impala does support scalar UDFs and UDAFs but currently does not support UDTFs (table functions). Impala only supports single-column distinct count queries. Impala can query data residing in HDFS or HBase. Impalas support for querying from JSON files is very new.