XmlInputFormat for Apache Flink

Is there something similar to Mahout XmlInputFormat but for Flink?

I have a large XML file and I want to extract certain elements. In my case, this is a Wikipedia dump, and I need to get all the tags <page>.

those. if i have a file

<mediawiki>
  <siteinfo>...</siteinfo>
  <page>...</page>
  <page>...</page>
  <page>...</page>
</mediawiki>

I want all 3 entries to <page>...</page>be used in mappers. Ideally, this should be valid XML that returns an xpath request /mediawiki/page.

+4
source share
1 answer

Mahout XmlInputFormat extends Hadoop TextInputFormat. Flink has common shells for Hadoop InputFormats, so XmlInputFormat should also be supported.

Hadoop InputFormats :

DataSet<Tuple2<LongWritable, Text>> input =
  env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);

. .

+4

All Articles