Bigdata SQL: Apache Hive CBO Optimizers

When Hive originally came into the ecosystem, it had a simple rule-based optimizer to translate SQL queries into MapReduce code. Hive relied on some complex hints from the end user embedded in the SQL to address the shortcomings of the rule-based optimizer. These hints would be used to drive selections between a map-side vs. reduce-side join.
Relational databases engines reflect more than 30-plus years of research and development work in query optimization. Starting from version 0.13, Hive introduced cost-based optimization using Apache Calcite. Calcite is an open source, enterprise-grade CBO and query execution framework.
CBO generates efficient execution plans by examining the tables and conditions (filter, join) in the query to reduce latency and resource utilization. Calcite uses a plan pruner to select the cheapest query plan. All SQL queries are converted by Hive to a physical operator tree, which is then converted to MapReduce jobs. This conversion includes SQL parsing and transforming and operator-tree optimization.
Inside the CBO, an SQL query goes through the following four phases of evaluation:
1. Parse and validate query. It generates an AST (abstract syntax tree) for a valid query that contains the operations/operators the code must execute.
2. Generate exhaustive execution plans. It generates logically equivalent plans that yield the same result set.
3. Perform cost evaluation. For each of the preceding plans generated, it calculates the execution cost, based on heuristic statistics (cardinality, selectivity).
4. Choose the best low-cost plan. Two of the biggest challenges for a query optimizer are how to leverage the join ordering and table sizes. Let us look briefly at each of them.