Is there a proven performance difference between joins in Hive on INT / BIGINT and VARCHAR?

For many years, when I read / heard about the "performance benefits" of database joins in bigint columns, OVER joins the (var) char columns.

Unfortunately, when searching for real answers / tips regarding "questions with a similar type":

  • The examples used are in the "traditional" DBMS context, such as Mysql or Oracle / SQL Server. Take for example this question or this example
  • The answer is quite old, and the final difference at runtime is not that big. Again, see this example.

I have not seen an example using the Hive version (preferably version 1.2.1 or higher), where a large (BIG-DATA-ISH) data set (say, 500 million + rows ) is attached to a data set of a similar size:

  • Bigint column
  • VERSUS a (var) Char (32).
  • VERSUS a (var) Char (255).

I choose size 32 because it is the size of the MD5 hash converted to characters and 255 because it is “in the range” of the largest natural key I have ever seen.

In the future, I would expect the hive:

  • to run under the Tez engine
  • use a (compressed) file format like ORC + ZLip / Snappy

Does anyone know of such an example , justified with proof, showing Hive Explain plans, processor resources, files and network resources + query execution time?

+5
source share

All Articles