We have a hive warehouse, and he wanted to use a spark for various tasks (mainly for classification). From time to time, return results in the form of a beehive table. For example, we wrote the following python function to find the total amount of the original_table column, divided by the original_table column. The function works, but we are worried that it is inefficient, especially cards for conversion to key-value pairs and dictionary versions. The functions combiner, mergeValue, mergeCombiner are defined elsewhere, but they work fine.
from pyspark import HiveContext rdd = HiveContext(sc).sql('from original_table select *') #convert to key-value pairs key_value_rdd = rdd.map(lambda x: (x[0], int(x[1]))) #create rdd where rows are (key, (sum, count) combined = key_value_rdd.combineByKey(combiner, mergeValue, mergeCombiner) # creates rdd with dictionary values in order to create schemardd dict_rdd = combined.map(lambda x: {'k1': x[0], 'v1': x[1][0], 'v2': x[1][1]}) # infer the schema schema_rdd = HiveContext(sc).inferSchema(dict_rdd) # save schema_rdd.saveAsTable('new_table_name')
Are there more efficient ways to do the same?
source share