In the previous explanations, some details are missing. To better understand how partitioning and grouping works, you should look at how the data is stored in the bush. Let's say you have a table
CREATE TABLE mytable ( name string, city string, employee_id int ) PARTITIONED BY (year STRING, month STRING, day STRING) CLUSTERED BY (employee_id) INTO 256 BUCKETS
then the bush will store data in a directory hierarchy, like
/user/hive/warehouse/mytable/y=2015/m=12/d=02
So, you should be careful when splitting, because if, for example, you divide by employee_id and you have millions of employees, there will be millions of directories in your file system. The term "power" refers to the number of possible field values ββthat may have. For example, if you have a "country" field, there are about 300 countries in the world, so the number of elements will be ~ 300. For a field like timestamp_ms, which changes every millisecond, the number of elements can be billions. In general, when choosing a field to split, it should not have much power, because there will be too many directories in your file system.
On the other hand, clustering, known as grouping, will result in a fixed number of files, as you specify the number of segments. What the hive will do is take a field, calculate a hash, and assign a record to that segment. But what happens if you use, say, 256 segments, and the field in which you form the packet has low power (for example, it is the state of the USA, so there can only be 50 different values)? You will have 50 blocks with data and 206 blocks without data.
Someone already mentioned how partitions can significantly reduce the amount of requested data. So in my example table, if you want to make queries only from a certain date in advance, the separation by year / month / day will significantly reduce the number of I / O operations. I think someone also mentioned how grouping can speed up joining with other tables that have exactly the same distribution, so in my example, if you join two tables with one employee_id, the hive can do the join after the segment (even better if they are already sorted by employee_id, since they are going to sort the parts that are already sorted, which works in linear time, otherwise O (n)).
Thus, grouping works well when the field has high power and the data is evenly distributed across the segments. Separation works best when the number of elements in the separation field is not too large.
In addition, you can split into several fields in order (year / month / day is a good example), while you can use only one field.