How can I use the JSR-203 file system provider with Apache Spark?

We want to use the HDFS NIO.2 file system provider in the spark job. However, we ran into classpath issues with file system providers: they must be in the system class, which will be used through the Paths.get(URI) API Paths.get(URI) . As a result, the provider was not found, even if it was provided in jar files supplied in spark-submit.

Here's the spark-submit command:

 spark-submit --master local["*"] \ --jars target/dependency/jimfs-1.1.jar,target/dependency/guava-16.0.1.jar \ --class com.basistech.tc.SparkFsTc \ target/spark-fs-tc-0.0.1-SNAPSHOT.jar 

And here is the job class that does not work with β€œfile system not found”.

 public final class SparkFsTc { private SparkFsTc() { // } public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("File System Test Case"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.parallelize(Collections.singletonList("foo")); System.out.println(logData.getNumPartitions()); logData.mapPartitions(itr -> { FileSystem fs = Jimfs.newFileSystem(); Path path = fs.getPath("/root"); URI uri = path.toUri(); Paths.get(uri); // expect this to go splat. return null; }).collect(); } } 

Is there any mechanism to convince the spark to add the FS provider to the appropriate class path?

Readers should be aware that file system providers are special. If you read the code in the JRE, you will see

 ServiceLoader<FileSystemProvider> sl = ServiceLoader .load(FileSystemProvider.class, ClassLoader.getSystemClassLoader()). 

They must be in the "system class loader". They are not found locally.

This thing will work fine if I myself acquired a FileSystem object reference instead of using Paths.get(URI) .

+6
source share

All Articles