Wikipedia: Java library for removing text markup on Wikipedia

I downloaded the wikipedia dump and now I want to remove the wikipedia markup in the contents of each page. I tried to write regular expressions, but there are too many of them to process. I found a python library, but I need a java library because I want to integrate into my code.

Thanks.

+7
java parsing wikipedia wiki
source share
4 answers

Do this in two steps:

  • Let some existing tool convert MediaWiki markup to plain HTML
  • Converts plain HTML to text.

The following demo:

import net.java.textilej.parser.MarkupParser; import net.java.textilej.parser.builder.HtmlDocumentBuilder; import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect; import javax.swing.text.html.HTMLEditorKit; import javax.swing.text.html.parser.ParserDelegator; import java.io.StringReader; import java.io.StringWriter; public class Test { public static void main(String[] args) throws Exception { String markup = "This is ''italic'' and '''that''' is bold. \n"+ "=Header 1=\n"+ "a list: \n* item A \n* item B \n* item C"; StringWriter writer = new StringWriter(); HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer); builder.setEmitAsDocument(false); MarkupParser parser = new MarkupParser(new MediaWikiDialect()); parser.setBuilder(builder); parser.parse(markup); final String html = writer.toString(); final StringBuilder cleaned = new StringBuilder(); HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() { public void handleText(char[] data, int pos) { cleaned.append(new String(data)).append(' '); } }; new ParserDelegator().parse(new StringReader(html), callback, false); System.out.println(markup); System.out.println("---------------------------"); System.out.println(html); System.out.println("---------------------------"); System.out.println(cleaned); } } 

gives:

 This is ''italic'' and '''that''' is bold. =Header 1= a list: * item A * item B * item C --------------------------- <p>This is <i>italic</i> and <b>that</b> is bold. </p><h1 id="Header1">Header 1</h1><p>a list: </p><ul><li>item A </li><li>item B </li><li>item C</li></ul> --------------------------- This is italic and that is bold. Header 1 a list: item A item B item C 
+9
source share

If you need plain text, you should use the WikiClean library https://github.com/lintool/wikiclean .

I had the same problem and it looks like this was the only effective solution that worked for me in java.

There are two possibilities:

1) When you have non-XML text, you must add the xml tags necessary for this processing. Suppose you process the XML file earlier, and now you have content without an XML structure, then you simply add xmlStartTag and xmlEndTag, as in the code below, and process it.

 String xmlStartTag = "<text xml:space=\"preserve\">"; String xmlEndTag = "</text>"; String articleWithXml = xmlStartTag + article.getText() + xmlEndTag; WikiClean cleaner = new WikiClean.Builder().build(); String plainWikiText = cleaner.clean(articleWithXml); 

2) When you read the Wikipedia dump file directly (xml file), in this case you just pass it through the file and it goes through.

 WikiClean cleaner = new WikiClean.Builder().build(); String plainWikiText = cleaner.clean(XMLFileContents); 
+2
source share

Mylyn WikiText can convert various Wiki syntaxes to HTML and other formats. It also supports the MediaWiki syntax that uses Wikipedia. Although Mylyn WikiText is mainly an Eclipse plugin, it is also available as a standalone library .

+1
source share

Try Mediawiki's approach to plain text . You probably need to improve the PlainTextConverter class for your needs. In combination with an example for converting Wikipedia texts to HTML, you can overlay the contents of the template.

+1
source share

All Articles