Bigdata SQL: Partitioning

Partititioning, especially in the context of Hive, allows for division of the data in a table by one or more columns, based on the values of the  partitioned columns. This segregates the input records into different files/directories, based on the column values. A simple query in Hive scans the whole table. This is a slow process, which could be speeded up by creating Hive partitions. When the table is queried, if the partitioned clause is used in the WHERE clause, only the required partitions (directory) is  scanned, thereby reducing the I/O costs, by avoiding reading of data that is known not to satisfy the query, based on the partitioned column.

• As the data is sliced in partitions across directories, a query is faster to process the partitioned part of the data, instead of doing
a full scan.

•Having too many partitions creates large numbers of files and directories in HDFS, which is an overhead to Name Node, because it must keep all metadata for the file system in memory.

• Partitions optimize queries based on WHERE clauses but are less useful for other queries not involving the partition column in the WHERE clause.