How to read csv in sparkR ver 1.4?

Question

How to read csv in sparkR ver 1.4?

As the new version of spark (1.4) was released, there was a good interface interference with spark from an R package called sparkR . There is a command on the R documentation page for spark that allows you to read json files as RDD objects

 people <- read.df(sqlContext, "./examples/src/main/resources/people.json", "json")

I am trying to read data from a .csv file as described on this revolutionary week's blog

 # Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv # Launch SparkR using # ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3 # The SparkSQL context should already be created for you as sqlContext sqlContext # Java ref type org.apache.spark.sql.SQLContext id 1 # Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here. flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")

The note says that to enable this operation, I need the spark-csv package. So I downloaded this package from this github repo with this command:

 $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3

But then I ran into this error while trying to read the .csv file.

 > flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true") 15/07/03 12:52:41 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230) ... 25 more Error: returnStatus == 0 is not TRUE

Any idea what this error means and how to solve it?

Of course, I could try reading .csv standard way, for example:

 read.table("data.csv") -> flights

and then I can convert R data.frame to spark DataFrame as follows:

 flightsDF <- createDataFrame(sqlContext, flights)

But that’s not how I like it, and it really takes a lot of time.

+6

r csv apache-spark apache-spark-sql sparkr

Marcin kosiński Jul 03 '15 at 10:50

source share

3 answers

If you are using Rstudio:

  library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') sqlContext <- sparkRSQL.init(sc)

does the trick. Make sure the version you specify for spark-csv matches the one you downloaded.

+5

angerhang Dec 9 '15 at 0:11

source share

Make sure you install the spark from inside the spark using:

 install.packages("C:/spark/R/lib/sparkr.zip", repos = NULL)

not github

who decided it for me.

-1

Levi brackman Oct 21 '16 at 18:55

source share

grubjesic · Accepted Answer · 2015-07-03T12:26:37+0000

You must start the sparkR console each time as follows:

 sparkR --packages com.databricks:spark-csv_2.10:1.0.3

How to read csv in sparkR ver 1.4?

More articles: