Use this:
sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[TmpFileFilter], classOf[PathFilter])
Here is my code TmpFileFilter.scalathat omits files .tmp:
import org.apache.hadoop.fs.{Path, PathFilter}
class TmpFileFilter extends PathFilter {
override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
}
You can define your own PathFilter.
source
share