How to get text of HTML content on Wikipedia page (via Wikipedia API)?

I just want to get the content (no link, no category, no image ... text only)

+7
source share
1 answer

Cannot get "text only" from the Wikipedia API. You can load the HTML page (if you do this using index.php, not api.php, use action=render to avoid loading the entire contents of the skin) or wikitext (which you can do via the API or by passing action=raw in index.php); you will have to analyze it yourself to remove a bit that you do not want to save.

In HTML output, MediaWiki usually adds classes well to various interface elements that you might want to filter out; templates and such created by users, perhaps, are smaller (for example, a hack for sorting tables simply puts some text in the range display:none , not a class).

To get wikitext via the API, use prop=revisions . To get displayable HTML, use action=parse .

+10
source

All Articles