If you need plain text, you should use the WikiClean library https://github.com/lintool/wikiclean .
I had the same problem and it looks like this was the only effective solution that worked for me in java.
There are two possibilities:
1) When you have non-XML text, you must add the xml tags necessary for this processing. Suppose you process the XML file earlier, and now you have content without an XML structure, then you simply add xmlStartTag and xmlEndTag, as in the code below, and process it.
String xmlStartTag = "<text xml:space=\"preserve\">"; String xmlEndTag = "</text>"; String articleWithXml = xmlStartTag + article.getText() + xmlEndTag; WikiClean cleaner = new WikiClean.Builder().build(); String plainWikiText = cleaner.clean(articleWithXml);
2) When you read the Wikipedia dump file directly (xml file), in this case you just pass it through the file and it goes through.
WikiClean cleaner = new WikiClean.Builder().build(); String plainWikiText = cleaner.clean(XMLFileContents);
Sh. Sina
source share