Is varchar different than a string in Hive?

Question

Is varchar different than a string in Hive?

Since version 0.12 Hive supports the VARCHAR data type.

Will VARCHAR provide better performance than STRING in a typical analytic query on a bush?

+7

hadoop hive

Santiago Cepas Sep 11 '14 at 15:13

source share

2 answers

Raunak Kathuria · Answer 1 · 2015-03-27T07:28:37+0000

In the hive, by default String is mapped to VARCHAR (32762), so this means

if the value exceeds 32762, then the value is truncated
if the data does not require a maximum VARCHAR length for storage (for example, if a column never exceeds 100 characters), then it allocates unnecessary resources for processing this column

The default behavior for the STRING data type is to map the type to the SQL VARCHAR data type (32762), the default behavior can lead to performance problems.

This explanation is based on IBM BIG SQL, which uses Hive implictly.

Link to an IBM BIGINSIGHTS document

Abhi · Answer 2 · 2015-11-25T13:37:04+0000

The varchar data type is also stored internally as a string. The only difference I see is String without limits with a maximum value of 32,767 bytes, and Varchar is limited to a maximum value of 65,535 bytes. I do not think that we will have a performance gain, because the internal implementation for both cases is String. I don’t know much about the internal components of the hive, but I could see the extra processing done by the bush to trim the varchar values. The following is the code (org.apache.hadoop.hive.common.type.HiveVarchar): -

 public static String enforceMaxLength(String val, int maxLength) { String value = val; if (maxLength > 0) { int valLength = val.codePointCount(0, val.length()); if (valLength > maxLength) { // Truncate the excess chars to fit the character length. // Also make sure we take supplementary chars into account. value = val.substring(0, val.offsetByCodePoints(0, maxLength)); } } return value; }

If someone has performed a performance analysis / benchmarking, share it.

Is varchar different than a string in Hive?

More articles: