I have a file in HDFS inside my VirtualBox HortonWorks HDP 2.3_1 virtual machine.
If I go to the guest spark shell and referring to the file, it works fine
val words = sc.textFile ("hdfs: ///tmp/people.txt") words.count
However, if I try to access it from a local Spark application on my Windows host, it does not work
val conf = new SparkConf().setMaster("local").setAppName("My App") val sc = new SparkContext(conf) val words=sc.textFile("hdfs://localhost:8020/tmp/people.txt") words.count
Gives out
An exception in the stream "main" org.apache.spark.SparkException: The lawsuit is interrupted due to the failure of the stage: Task 0 at stage 0.0 failed 1 time, last failure: Lost task 0.0 at stage 0.0 (TID 0, localhost): org. apache.hadoop.hdfs.BlockMissingException: Failed to get the block: BP-452094660-10.0.2.15-1437494483194: blk_1073742905_2098 file = / tmp / people.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode (DFSInputSt38.chooseDataNode (DFSInput3838: 38) ) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo (DFSInputStream.java►26)
Port 8020 is open, and if I select the wrong file name, it will tell me
Input path does not exist: hdfs:
localhost: 8020 must be correct as a guest HDP VM hat NAT port tunneling in a Windows box.
And I say that if I give him the wrong name, I get the corresponding exception
My pom has
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>1.4.1</version> <scope>provided</scope> </dependency>
Am I doing something wrong? And what is a BlockMissingException trying to tell me?
source share