BigData SQL: Code Generation

Impala extensively uses code generation to optimize CPU utilization and reduce latency, by utilizing the latest trends in modern CPUs. It is recommended that you run Impala on newer systems with more disks, because Impala can utilize the full bandwidth of available disks to improve I/O throughput. It is also recommended that you have nodes with large
memory, because Impala benefits from being able to work with data in memory, which often results in lower latency of the SQL queries.
Code generation can dramatically improve CPU efficiency and query execution time. Query execution engines typically incur a lot of overhead in the following: Virtual function calls : Any expression in the SQL query incurs the overhead of being evaluated again and again for each row of data the engine processes. Even if the expression by itself is simple to evaluate, the underlying implementation causes virtual function calls to evaluate the expressions. This causes a huge CPU overhead of context switching, saving the current space in the stack, and calling the virtual function.
Elimination of this overhead can shave off valuable time from the query execution perspective, if the virtual calls are inlined with code generation.
Switch statements : Branching-based queries and branch instructions prevent effective instruction pipelining and instruction-level parallelism, and this can cause CPU inefficiency. The branch predictor can help in these circumstances, but code generation results in better speedups.
Propagating constant literals : Any constant value used in the query can result in memory lookups by the engine when it is executing the query. This can add more latency and can be eliminated if the engine can fold these constants within the generated code.
LLVM : Low Level Virtual Machine is an innovation that is applied by Impala to speed up queries and remove the previously mentioned overheads. You can think of LLVM asthe JVM byte code enerator. LLVM generates optimized code
for the program you have written, using certain compiler flags.
These kinds of optimum code generation are routinely done
for code written in C/C++, based on the hardware platform
in which they are deployed. The only difference, in this case,
is that LLVM is generating optimized code for the SQL query
to be executed. LLVM includes a set of libraries that are thebuilding blocks of a compiler that is used to generate the code
from the SQL query syntax tree. LLVM provides an API for
code generation.