Bigdata SQL: Apache Bucketing

Bucketing is another technique used to cluster data sets into more manageable parts, to optimize query performance. It is different from partitioning in that each bucket corresponds to segments of files in HDFS. Too granular a partitioning may lead to deep and small partitions and directories that can increase the number of files and slow down Name Node performance and increase its overhead.

Bucketing can be used on tables with partitioning or without partitions. The value of the bucket column is hashed by a user-defined number into buckets. Records with the same values of the bucket column are stored in the same bucket (segment of files). Buckets help Hive to efficiently perform sampling and map side joins. Buckets are essentially files in the leaf-level directories that correspond to records that have the same column-value hash. Hive users can specify the number of buckets per partition or per table.
When tables have been bucketed, it is important to set the number of Reduce tasks to be equal to the number of buckets. This can be done by setting the number of Reduce tasks explicitly for every job. Another way is to let Hive automatically bucket the data by setting the hive.enforce.bucketing property to true .