BigData SQL: Recommendations to Make Impala Queries Faster

Some of the recommended best practices and empirical rules to keep in mind to make queries run faster include the following:
• Use numeric types (not strings) when possible, because using string data types can result in higher memory consumption, more storage, and slower processing.
• Opt for a decimal data type rather than a float/double type.
• Identify query access patterns from the different use cases and create the right partitioning strategy, for example.
• Table columns used in WHERE clauses are possible choices for partition keys.
• Dates or spatial boundaries or geography can be good choices for partition keys.
• Make sure partition size is less than 100K or so.
• If possible, limit the columns to less than 2K. This can affect performance of the Hive metastore.
• Configure Impala to use Parquet and Snappy for best
performance. If you have given any updates, opt for using Avro, for best performance.• There is a fine line between block size selection. Larger blocks
result in better throughput but lower parallelism, while the
opposite is true for smaller block sizes.
• You should tune your memory requirements after gathering some query statistics, using the explain query plan feature. Look at the peak memory usage profile to get better estimates.
• Favor machines with 128GB of RAM and 10GB network interconnect.• Use such tools as EXPLAIN , SUMMARY , and PROFILE , which retur nplan fragments without executing the query. SUMMARY gives
an overview of the runtime statistics, and PROFILE gives an exhaustive listing of runtime statistics after query execution, for example, the number of rows processed and amount of memory consumed to run the query.