Is it possible to read pdf / audio / video files (unstructured data) using Apache Spark?

Is it possible to read pdf / audio / video files (unstructured data) using Apache Spark? For example, I have thousands of pdf invoices, and I want to read the data from them and do some analytics. What steps should be taken to process unstructured data?

+6
source share
1 answer

Yes it is. Use sparkContext.binaryFilesto download files in binary format, and then use mapto match the values ​​in some other format - for example, parse the binary using Apache Tika or Apache POI.

pseudo code:

val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework

, , . InputStream

+7

All Articles