Reading a text file in SparkR 1.4.0

Does anyone know how to read a text file in SparkR version 1.4.0? Are there any Spark packages for this?

+4
source share
3 answers

Spark 1.6+

You can use the input format textto read a text file as DataFrame:

read.df(sqlContext=sqlContext, source="text", path="README.md")

Spark & ​​lt; = 1.5

The short answer is no. SparkR 1.4 is almost completely devoid of the low-level API, leaving only a limited subset of Data Frame operations. As you can read on the old SparkR webpage :

2015 SparkR Apache Spark (1.4). (...) Spark R , ETL .

, - , spark-csv:

> df <- read.df(sqlContext, "README.md", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
+--------------------+
|                  C0|
+--------------------+
|      # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
+--------------------+

RDD, map, flatMap, reduce filter, , , , .

API , - , , . SparkR, , . ::: man-:

": ,       , ,       . ,       ,      .

, , , . 1.4 API Catalyst , , , API 1.4.

> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)

[[1]]
[1] 14

[[2]]
[1] 0

[[3]]
[1] 78

, spark-csv, textFile, .

+3

, http://ampcamp.berkeley.edu/5/exercises/sparkr.html

-

 textFile <- textFile(sc, "/home/cloudera/SparkR-pkg/README.md")

SparkR Context.R textFile, SparkContext API- TextFile RDD, .

# Create an RDD from a text file.
#
# This function reads a text file from HDFS, a local file system (available on all
# nodes), or any Hadoop-supported file system URI, and creates an
# RDD of strings from it.
#
# @param sc SparkContext to use
# @param path Path of file to read. A vector of multiple paths is allowed.
# @param minPartitions Minimum number of partitions to be created. If NULL, the default
#  value is chosen based on available parallelism.
# @return RDD where each item is of type \code{character}
# @export
# @examples
#\dontrun{
#  sc <- sparkR.init()
#  lines <- textFile(sc, "myfile.txt")
#}
textFile <- function(sc, path, minPartitions = NULL) {
  # Allow the user to have a more flexible definiton of the text file path
  path <- suppressWarnings(normalizePath(path))
  # Convert a string vector of paths to a string containing comma separated paths
  path <- paste(path, collapse = ",")

  jrdd <- callJMethod(sc, "textFile", path, getMinPartitions(sc, minPartitions))
  # jrdd is of type JavaRDD[String]
  RDD(jrdd, "string")
}

https://github.com/apache/spark/blob/master/R/pkg/R/context.R

https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_rdd.R

0

Infact, you can use the databricks / spark-csv package to process TSV files.

For instance,

data <- read.df(sqlContext, "<path_to_tsv_file>", source = "com.databricks.spark.csv", delimiter = "\t")

There are many options here - databricks-spark-csv # features

0
source

All Articles