Bigdata SQL: Apache hive partitioning and bucketing

In most cases, your workloads will be bottlenecked with I/O or network. It is wise, therefore, to choose a format that reduces the amount of data that gets transferred over the wire. In most cases, it is true that CPU cores are idle while waiting to get the data and start processing. Depending on your workload pattern and the complexities of SQL analytic queries, it behooves you to choose the right data format, along with the right compression algorithm, based on whether the compression algorithm is CPU-bound or
space-bound. Always incorporate partitioning in your data ingestion and data pipelines, because that is the best way to leverage distributed systems and improve throughput. If there are too many partitions in your data, it is also advisable to consider using bucketing as a way to reduce small partitions. Partitions also help in use cases involving bad data or data quality issues or solving change-data-capture-based scenarios. Performance tuning is a whole separate topic by itself, and performance tuning  on distributed systems is an even more involved topic. A lot of performance tuning also depends on the kind of hardware, the specification of the I/O and networking, and memory sizing and bandwidth.