Copy data from Wikipedia

I'm trying to find or build a web scraper that can go through and find every state / national park in the USA, as well as their GPS coordinates and land area. I looked through some structures, such as Scrapy, and then I see that there are some sites specifically designed for Wikipedia, such as http://wiki.dbpedia.org/About . Is there any particular advantage for one of them, or will one work better to load information into an online database?

+4
source share
3 answers

Suppose you want to analyze pages such as this page on Wikipedia . The following code should work.

var doc = new HtmlDocument(); doc = .. //Load the document here. See doc.Load(..), doc.LoadHtml(..), etc. //We get all the rows from the table (except the header) var rows = doc.DocumentNode.SelectNodes("//table[contains(@class, 'sortable')]//tr").Skip(1); foreach (var row in rows) { var name = HttpUtility.HtmlDecode(row.SelectSingleNode("./*[1]/a[@href and @title]").InnerText); var loc = HttpUtility.HtmlDecode(row.SelectSingleNode(".//span[@class='geo-dec']").InnerText); var areaNodes = row.SelectSingleNode("./*[5]").ChildNodes.Skip(1); string area = ""; foreach (var a in areaNodes) { area += HttpUtility.HtmlDecode(a.InnerText); } Console.WriteLine("{0,-30} {1,-20} {2,-10}", name, loc, area); } 

I tested it and produced the following output:

 Acadia 44.35A°N 68.21A°W 47,389.67 acres (191.8 km2) American Samoa 14.25A°S 170.68A°W 9,000.00 acres (36.4 km2) Arches 38.68A°N 109.57A°W 76,518.98 acres (309.7 km2) Badlands 43.75A°N 102.50A°W 242,755.94 acres (982.4 km2) Big Bend 29.25A°N 103.25A°W 801,163.21 acres (3,242.2 km2) Biscayne 25.65A°N 80.08A°W 172,924.07 acres (699.8 km2) Black Canyon of the Gunnison 38.57A°N 107.72A°W 32,950.03 acres (133.3 km2) Bryce Canyon 37.57A°N 112.18A°W 35,835.08 acres (145.0 km2) Canyonlands 38.2A°N 109.93A°W 337,597.83 acres (1,366.2 km2) Capitol Reef 38.20A°N 111.17A°W 241,904.26 acres (979.0 km2) Carlsbad Caverns 32.17A°N 104.44A°W 46,766.45 acres (189.3 km2) Channel Islands 34.01A°N 119.42A°W 249,561.00 acres (1,009.9 km2) Congaree 33.78A°N 80.78A°W 26,545.86 acres (107.4 km2) Crater Lake 42.94A°N 122.1A°W 183,224.05 acres (741.5 km2) Cuyahoga Valley 41.24A°N 81.55A°W 32,860.73 acres (133.0 km2) Death Valley 36.24A°N 116.82A°W 3,372,401.96 acres (13,647.6 km2) Denali 63.33A°N 150.50A°W 4,740,911.72 acres (19,185.8 km2) Dry Tortugas 24.63A°N 82.87A°W 64,701.22 acres (261.8 km2) Everglades 25.32A°N 80.93A°W 1,508,537.90 acres (6,104.8 km2) Gates of the Arctic 67.78A°N 153.30A°W 7,523,897.74 acres (30,448.1 km2) Glacier 48.80A°N 114.00A°W 1,013,572.41 acres (4,101.8 km2) (...) 

I think the beginning. If any page crashes, you should see if the layout changes, etc.

Of course, you will also need to find a way to get all the links you want to parse.

One important thing . Do you know if clearing Wikipedia is allowed? I have no idea, but you have to make sure that this is before that ... ;)

+10
source

Although the question is a bit old, another alternative available right now is to avoid scratches and get raw data from protectedplanet.net - it contains data from the World Database of Protected Areas and the UN List of Protected Areas . (Disclosure: I worked at UNEP-WCMC , the organization that created and maintains the database and website.)

It's free for non-commercial use, but you need to register to download. For example, this page allows you to upload 22,600 protected areas in the US like KMZ, CSV and SHP (contains lat, lng, borders, IUCN category and a bunch of other metadata).

+4
source

I would not use this approach not in the best way.

My idea would be to switch to an API with openstreetmap.org (or any other GEO-based API that you can request) and request the necessary data from it. National parks are likely to be found quite easily. You can get names from a source such as Wikipedia, and then request the GEO APIs to provide you with the information you need.

By the way, what happened to the Wikipedias List of National Parks ?

+1
source

All Articles