Bigdata SQL: Recommendations to Speed Up Hive

In the last section of this chapter, we have outlined some of the very high-level approaches one can take to improve Hive performance. The following is by no means an exhaustive list, however.
Here are some recommendations to speed up Hive:
• Use the Tez or Spark engine and not MapReduce. Set the configuration to hive.execution.engine=tez or hive. execution.engine=spark , to use a different engine from the default MapReduce.
• Use Partitions and Bucketing wherever possible, if your data sets allow it and your data processing pipelines make it feasible. This drastically reduces the join time, because, in such cases, Hive only looks at a small fragment of the data sets when it does the join.
• Always enable statistics on Hive tables. This can help the CBO optimizer to build the best plan, based on actual data rather than plain heuristics.
• If you are on the latest version of Hive, make sure hive. vectorized.execution.enabled = true is turned on. Older versions of Hive had some problems with vectorization that are now resolved.
• Use ORC. Hive is optimized to work best with the ORC format, and, wherever possible, convert non-ORC tables to the ORC format before doing analytics or running SQL queries on the data. Using ORC and vectorization provides a lot of speedup with very little effort.
• When you have complex SQL queries that involve subqueries, make sure the subqueries are rewritten to temporary tables before the actual higher level query executes. It has been seen that this pattern always yields better performance than letting Hive work with subqueries directly.