I have a csv file on Amazon s3 with a size of 62 MB (114,000 lines). I convert it to a spark dataset and take the first 500 lines. The code is as follows:
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"@"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");
set.take(500)
The whole operation takes from 20 to 30 seconds.
Now I am trying to do the same, but using csv, I am using the mySQL table with 119,000 rows. The MySQL server is located in amazon ec2. The code is as follows:
String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;
SparkSession spark=StartSpark.getSparkSession();
SQLContext sc = spark.sqlContext();
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
.read()
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
set.take(500);
It takes 5 to 10 minutes. I am running a spark inside jvm. Using the same configuration in both cases.
I can use partitionColumn, numParttition, etc., but I don’t have a numeric column, and another problem - I don’t know the table layout.
, , , , , ?