Is there a proven performance difference between joins in Hive on INT / BIGINT and VARCHAR?

Question

Is there a proven performance difference between joins in Hive on INT / BIGINT and VARCHAR?

For many years, when I read / heard about the "performance benefits" of database joins in bigint columns, OVER joins the (var) char columns.

Unfortunately, when searching for real answers / tips regarding "questions with a similar type":

The examples used are in the "traditional" DBMS context, such as Mysql or Oracle / SQL Server. Take for example this question or this example
The answer is quite old, and the final difference at runtime is not that big. Again, see this example.

I have not seen an example using the Hive version (preferably version 1.2.1 or higher), where a large (BIG-DATA-ISH) data set (say, 500 million + rows ) is attached to a data set of a similar size:

Bigint column
VERSUS a (var) Char (32).
VERSUS a (var) Char (255).

I choose size 32 because it is the size of the MD5 hash converted to characters and 255 because it is “in the range” of the largest natural key I have ever seen.

In the future, I would expect the hive:

to run under the Tez engine
use a (compressed) file format like ORC + ZLip / Snappy

Does anyone know of such an example , justified with proof, showing Hive Explain plans, processor resources, files and network resources + query execution time?

+5

join int varchar hive query-performance

Rogier Werschkull Aug 31 '16 at 10:42

source share

No one has answered this question yet.

See similar questions:

122

Is there a real performance difference between the INT and VARCHAR primary keys?

1

Data Vault 2.0 in SQL Server

or similar:

4331