Is varchar different than a string in Hive?

Since version 0.12 Hive supports the VARCHAR data type.

Will VARCHAR provide better performance than STRING in a typical analytic query on a bush?

+7
hadoop hive
source share
2 answers

In the hive, by default String is mapped to VARCHAR (32762), so this means

  • if the value exceeds 32762, then the value is truncated
  • if the data does not require a maximum VARCHAR length for storage (for example, if a column never exceeds 100 characters), then it allocates unnecessary resources for processing this column

The default behavior for the STRING data type is to map the type to the SQL VARCHAR data type (32762), the default behavior can lead to performance problems.

This explanation is based on IBM BIG SQL, which uses Hive implictly.

Link to an IBM BIGINSIGHTS document

+3
source share

The varchar data type is also stored internally as a string. The only difference I see is String without limits with a maximum value of 32,767 bytes, and Varchar is limited to a maximum value of 65,535 bytes. I do not think that we will have a performance gain, because the internal implementation for both cases is String. I don’t know much about the internal components of the hive, but I could see the extra processing done by the bush to trim the varchar values. The following is the code (org.apache.hadoop.hive.common.type.HiveVarchar): -

 public static String enforceMaxLength(String val, int maxLength) { String value = val; if (maxLength > 0) { int valLength = val.codePointCount(0, val.length()); if (valLength > maxLength) { // Truncate the excess chars to fit the character length. // Also make sure we take supplementary chars into account. value = val.substring(0, val.offsetByCodePoints(0, maxLength)); } } return value; } 

If someone has performed a performance analysis / benchmarking, share it.

0
source share

All Articles