What is Bucketing and Clustering in Hive?how it is different from partitioning ?
Bucketing and Clustering is the process in Hive, to decompose table data sets into more manageable parts.The bucketing concept is based on HashFunction(Bucketing column) mod No.of Buckets. The bucket number is found by this HashFunction. No. of buckets is mentioned while creating bucket table.
In the table is divided into the number of partitions, and these partitions can be further subdivided into more manageable parts known as Buckets/Clusters. Records with the same bucketed column will be stored in the same bucket."clustered by" clause is used to divide the table into buckets. Each bucket will be saved as a file under table directory.Bucketing can be done along with partitioning or without partitioning on Hive tables.Bucketed tables will create almost equally distributed data file parts. We can also sort the records in each bucket by one or more columns. Since the data files are equally sized parts, map-side joins will be faster on bucketed tables.
The property hive.enforce.bucketing = true enables dynamic bucketing while loading data into the Hive table, and sets the number of reducers equal to the number of buckets specified.
Below is the example to create bucketed table,
Eg: create table bucketed_table (ID int, name varchar(64), state varchar(64), city varchar(64))partitioned by (country varchar(64))clustered by (state) sorted by (city) into 4 bucketsrow format delimited fields terminated by ',';
Here, for a particular country, each state records will be clustered under a bucket.
Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try out queries on a fractionof data for testing and debugging purpose when the original data sets are very huge.
How is it different from partitioning?
Unlike in partitioning, where tables are divided into partitions via creating a directory for each partition, whereas, in Bucketing, buckets are divided as files.In partitioning, tables are not equally partitioned, whereas in Bucketing, buckets are almost equally divided, and the no. of buckets can be specified while creation of bucketed table.
For more details follow on this link
Apache Hive - In Depth Hive Tutorial for Beginners - DataFlair
Recommend to take ESS 440 - Apache Hive Essentials | MapR .
Retrieving data ...