Impala uses its unique distributed query engine to minimize response time. This distributed query engine is installed on all data nodes in the cluster.
There were twofold architectural tenets for Impala. One was to make Impala extremely fast, to support the low-latency query requirement for SQL on large data sets. The other was to ensure its linear scalability. Impala optimizes CPU usage to get the query result in the shortest possible time. Impala recommends using servers with large memory, preferably 96GB+, in addition to servers with modern chipsets, such as Sandybridge. Impala heavily leverages modern CPU instructions for fast code generation and speeding up queries on large data sets.
In this section, we discuss the main components that make Impala a low-latency SQL query engine.
Impala comprises three main daemons (long-running processes) that handle the functionalities that Impala needs to process a query. Impala daemons are installed in each of the data nodes. This is done during the installation phase of Impala. Impala daemons run on the same node from where the data is to be queried, preserving the data locality. Each Impala daemon is an isolated process in the shared nothing architecture.
The Impala node to which the clients (e.g., impala-shell) are connected plays the role of query planner/coordinator, while the other nodes are the query execution engines. In other words, one Impala daemon acts as a leader, while the other Impala daemons running on each Hadoop data node act as execution engines. Outlined following are the main components of the Impala engine.
impalad : The brain behind Impala is the query engine. Its function is to process queries using all the optimization rules built within the engine, to access and process the data in the most efficient manner. Impala relies heavily on the data distribution and block placement capabilities of HDFS, to ensure data locality for each impalad. The impalad process has three components: the Query Planner, the Query Coordinator, and the Query Executor. The Query Planner syntactically and semantically validates the query, transforms it to a logical and physical plan, and, finally, compiles the plan into a physical distributed query plan made up of query fragments, which are taken by the Query Coordinators to execute. impalad processes data blocks in the data node where it is executing and reads the data directly from the local disk. This minimizes network load and benefits from the file cache on the data nodes.
statestored : This maintains the status of the other Impala daemons running on the data nodes, where status includes information about the health of the node. This daemon monitors the health of impalad on all the nodes in a cluster. If by chance an impalad daemon becomes unresponsive, the statestore daemon communicates with other nodes in the cluster, and subsequent queries do not involve the unresponsive impala node. This daemon has only one running process.
catalog server : This daemon synchronizes the metadata of the tables with impalad, guaranteeing that all of impalads metadata is in sync. There is only one instance of this daemon running in the cluster.
Queries can be submitted to Impala through either Impala Shell or JDBC/ODBC drivers. Once a query is submitted, the query planner process turns the query request into a collection of plan fragments, and then the coordinator initiates execution on a remote impalad. Intermediate results are streamed between impalads before the query results are streamed back to client.
The coordinator orchestrates interactions between impalad across all the data nodes and also aggregates results produced by each data node. impalad also exposes a remote procedure call (RPC) interface that other impalad can use to connect to exchange data. Also, this interface allows the coordinator to assign work for each impalad.
Following query submission by the client, the usual steps of query validation, syntax and semantic analysis, are done before the query is optimized by the query engine. Every query is first validated syntactically and semantically, to make sure that there are no errors in the useres query, both from a syntax perspective as well as from a semantic perspective, ensuring that the query makes sense. The metadata for the query exists in the Hive metastore. After this step, query planning occurs, whereby Impala tries to figure out the best way to solve the query to get the results. The EXPLAIN query dumps the output of what is going on within Impala to figure out the way to solve the problem. The EXPLAIN query provides an outline of the steps that impalad will perform and the relevant details on how the workload will be distributed among the nodes.
Optimization involves generating the best physical plan for the query, in terms of cost of execution as well as code generation of the query for faster execution on the hardware. The optimized query is submitted to the coordinator process, which orchestrates the query execution. When a query executes, the coordinator orchestrates interactions between Impala nodes, and once the result is available from each impalad process, it aggregates the results. This coordinator process is part of the impalad process and resides in all the nodes. Any node can act as the query coordinator. It is the coordinator that assigns work units to the impalad processes. Once the work is assigned to each impalad process by the coordinator, the impalad process works with the storage engine to implement the query operators, for example, constant folding, predicate pushdown, etc., so as to extract only the relevant portions of the data that are really needed to satisfy the query.
The execution engine (executor process) in the impalad daemon executes the optimized query by reading from the data source at high speeds. It leverages all the disks and their controllers to read at an optimized speed and executes query fragments that have been optimized by the code optimizer, which includes Impala LLVM and the codegeneration process. Impala executor serves hundreds of plan fragments at any given time. Impala does nothing special for failover. Because HDFS provides failover using replication, if the impalad daemon is installed on the replicated nodes, the Impala process will seamlessly start using the impalad on the replicated nodes.