An error occurred while trying to upload a file using sc.textFile whose name has a colon in it

I use pyspark (Spark 1.0.1) in IPython to load a gzipped TSV file whose names contain colons in it. I can upload the file in order when I rename it, but otherwise I get an error.

Teams:

inputFile = '/vol/data/standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.gz'
input = sc.textFile(inputFile).map(loadRecord)
input.count()

And I get the following trace

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-69-3f1537c7b8bd> in <module>()
----> 1 input.count()

/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in count(self)
    706         3
    707         """
--> 708         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
    709
    710     def stats(self):

/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in sum(self)
    697         6.0
    698         """
--> 699         return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
    700
    701     def count(self):

/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in reduce(self, f)
    617             if acc is not None:
    618                 yield acc
--> 619         vals = self.mapPartitions(func).collect()
    620         return reduce(f, vals)
    621

/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in collect(self)
    581         """
    582         with _JavaStackTrace(self.context) as st:
--> 583           bytesInJava = self._jrdd.collect().iterator()
    584         return list(self._collect_iterator_through_file(bytesInJava))
    585

/vol/code/spark/spark-1.0.1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    535         answer = self.gateway_client.send_command(command)
    536         return_value = get_return_value(answer, self.gateway_client,
--> 537                 self.target_id, self.name)                                                                                                                                     
    538
    539         for temp_arg in temp_args:

and mistake

    Py4JJavaError: An error occurred while calling o206.collect.
    : java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI:     standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.\
gz
    at org.apache.hadoop.fs.Path.initialize(Path.java:148)
    at org.apache.hadoop.fs.Path.<init>(Path.java:126)
    at org.apache.hadoop.fs.Path.<init>(Path.java:50)
    at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1038)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
    at scala.Option.getOrElse(Option.scala:120)

How to download this file without renaming it? I cannot rename all the files that I want to process using Spark.

+4
source share
1 answer

The Hadoop path parser interprets :as a protocol separator. The solution is to explicitly specify the protocol. If this is a local file:

inputFile = 'file:///vol/data/standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.gz'
input = sc.textFile(inputFile).map(loadRecord)
input.count()

, : /vol/data/standard_feed*

Spark , (HDFS, S3, GCS ..). s3n://my-bucket/some-directory/* ( S3).

+1

All Articles