Wikipedia (MediaWiki) URI Encoding Scheme

How does Wikipedia (or MediaWiki in general) encode page names in a URI? This is not normal URI encoding, since spaces are replaced with underscores, and double quotes are not encoded and something like that.

+6
uri encoding wikipedia mediawiki
source share
2 answers

http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you get some description of what their engine applies to article titles.

They should have something like this in their LocalSettings.php: $ wgArticlePath = '/ wiki / $ 1';

and the correct server URI overwrites the configuration - they seem to be using Apache (HTTP header), so probably mod_rewrite. http://www.mediawiki.org/wiki/Manual:Short_URL

You can also refer to the index.php file for the Wikipedia article: http://en.wikipedia.org/w/index.php?title=Foo%20bar and redirect the engine to http://en.wikipedia.org/wiki/Foo_bar . Behind the scenes, mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki mechanism, it is just as if you had visited http://en.wikipedia.org/w/index.php?title=Foo_bar - this page does not redirect you.

+5
source share

The process is quite complicated and not entirely good. You need to look at the Title class found in includes/Title.php . You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.

Please note that (as with MediaWiki) the code is not decoupled in the slightest way. If you want to replicate it, you need to extract the logic, not just reuse the class.

The logic looks something like this:

  • Decode character references (e.g. & eacute;)
  • Convert spaces to underscores
  • Check if the title is a link to a namespace or interwiki
  • Delete hash fragments (e.g. Apple#Name
  • Delete prohibited characters
  • Links to the Forbid subdirectory (e.g. ../directory/page )
  • Disable triple tilde sequences ( ~~~ ) (for some reason)
  • Limit size to 255 bytes
  • capital letter

In addition, I believe that I am right in saying that quotation marks should not be encoded by the original user - browsers can handle them transparently.

I hope this helps!

+7
source share

All Articles