The varchar data type is also stored internally as a string. The only difference I see is String without limits with a maximum value of 32,767 bytes, and Varchar is limited to a maximum value of 65,535 bytes. I do not think that we will have a performance gain, because the internal implementation for both cases is String. I don’t know much about the internal components of the hive, but I could see the extra processing done by the bush to trim the varchar values. The following is the code (org.apache.hadoop.hive.common.type.HiveVarchar): -
public static String enforceMaxLength(String val, int maxLength) { String value = val; if (maxLength > 0) { int valLength = val.codePointCount(0, val.length()); if (valLength > maxLength) {
If someone has performed a performance analysis / benchmarking, share it.
Abhi
source share