Open Source Parser Code for Mediawiki Markup

I'm interested in sample analysis of Mediawiki XML markup to generate a custom HTML page, which is a subset of HTML created using the PHP Mediawiki rendering engine.

I want this to be for BzReader, a standalone Mediawiki editor for a compressed dump written in C #. So a C # parser would be ideal, but any good code would help.

Of course, if no one has done this before, I think it's time to start a project that supports a free and separate Mediawiki parser, based on Mediawiki's own parser, but less closely integrated with Mediawiki itself.

So, does anyone know of any base from which I could start, this would be better than hacking from the Mediawiki PHP code?

+6
c # php parsing open-source mediawiki
source share
3 answers

There is a list of parsers at http://www.mediawiki.org/wiki/Alternative_parsers , but the C # parser is not included there ...

+7
source share

Update
Conversely, Screwturn does not adhere to the Mediawiki syntax, but uses its own version, which changes slightly.

The Mediawiki syntax is not amenable to LALR (or even LL *) analysis, since there are many uncertainties in its definition and also allows HTML. There this one is discussed in this question , you essentially stick to writing your own analyzer and tokenizer, and not just writing a BNF file for it, and then using ANTLR / Gold / Irony.

The Roadkill Wiki uses the Creole parser to parse its Mediawiki, but with limited support.


Screwturn is licensed under the GPL and has a C # parser:

The class you're in is Core.Formatter, which has many regexes to do its job:

public static class Formatter { } 

This is not the most beautiful code, "but it works."

+6
source share

I had a few words to say about Mediawiki templates here . Interestingly, now there is a list of alternative parsers, I will have to investigate this.

+4
source share

All Articles