Extract paragraphs from Wikipedia API using PHP cURL

Question

Extract paragraphs from Wikipedia API using PHP cURL

Here I am trying to use the Wikipedia API (MediaWiki) - http://en.wikipedia.org/w/api.php

Get a GET at http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=[keyword] to get a list of suggested pages for the keyword
Scroll through each suggested page using GET at http://en.wikipedia.org/w/api.php?format=json&action=query&export&titles=[page title]
Extract all paragraphs found on the page to an array
Do something with an array

I'm stuck on # 3. I see a bunch of JSON data that includes "\ n \ n" between paragraphs, but for some reason the PHP explode () function is not working.

Essentially, I just want to grab the “meat” of every Wikipedia page (not the headings or any formatting, just the content) and break it into paragraphs into an array.

Any ideas? Thank!

+5

php curl parsing wikipedia-api mediawiki

Kane May 21 '10 at 6:25

source share

1 answer

Emil Vikström · Accepted Answer · 2010-05-21T07:13:56+0000

\n\n- these are literally these characters, not line breaks. Make sure you use single quotes around the string in explode:

$parts = explode('\n\n', $text);

If you decide to use double quotes, you will have to avoid characters \, for example:

$parts = explode("\\n\\n", $text);

On the other hand: why do you extract data in two different formats? Why not just go to JSON or just XML?

Extract paragraphs from Wikipedia API using PHP cURL

More articles: