Spark-redshift takes a long time to write redshift

I am working on tuning a spark streamer with kinesis and redshift. I read data from kinesis every 10 seconds, process it and write to redshift using the spark-redshift lib.

The problem is that you have a lot of time to write just 300 lines.

This is what he shows me in the console

[Stage 56:====================================================> (193 + 1) / 200]

Looking at my logs, df.write.format does this.

I have a spark installation on a machine with a 4 gigabyte tank and 2-core Amazon EC2 working with -master local mode [*].

This is how I create a stream

kinesisStream = KinesisUtils.createStream(ssc, APPLICATION_NAME, STREAM_NAME, ENDPOINT, REGION_NAME, INITIAL_POS, CHECKPOINT_INTERVAL, awsAccessKeyId =AWSACCESSID, awsSecretKey=AWSSECRETKEY, storageLevel=STORAGE_LEVEL)    
CHECKPOINT_INTERVAL = 60
storageLevel = memory

kinesisStream.foreachRDD(writeTotable)
def WriteToTable(df, type):
    if type in REDSHIFT_PAGEVIEW_TBL:
        df = df.groupby([COL_STARTTIME, COL_ENDTIME, COL_CUSTOMERID, COL_PROJECTID, COL_FONTTYPE, COL_DOMAINNAME, COL_USERAGENT]).count()
        df = df.withColumnRenamed('count', COL_PAGEVIEWCOUNT)

        # Write back to a table

        url = ("jdbc:redshift://" + REDSHIFT_HOSTNAME + ":" + REDSHIFT_PORT + "/" +   REDSHIFT_DATABASE + "?user=" + REDSHIFT_USERNAME + "&password="+ REDSHIFT_PASSWORD)

        s3Dir = 's3n://' + AWSACCESSID + ':' + AWSSECRETKEY + '@' + BUCKET + '/' + FOLDER

        print 'Start writing to redshift'
        df.write.format("com.databricks.spark.redshift").option("url", url).option("dbtable", REDSHIFT_PAGEVIEW_TBL).option('tempdir', s3Dir).mode('Append').save()

        print 'Finished writing to redshift'

please let me know the reason for this a lot of time.

+4
source share
2

Redshift Spark, . spark-redshift S3, Redshift . . , .

, , (, 200). , , spark.sql.shuffle.partitions 200 . Spark.

, , 200 . , 200 S3, .

, , , :

df = df.coalesce(4).withColumnRenamed('count', COL_PAGEVIEWCOUNT)

200 4 S3 . . spark.sql.shuffle.partitions, , , , .

+6

API- databrick. . . API Databric. , Avaro. AWS. . API Databrick avaro S3, avaro. .

0

All Articles