Scrambling hierarchical data

Question

Scrambling hierarchical data

I am trying to clear the list of Dept stores for continents / countries from Dept global stores . I run the following code to get the continents first, as we can see that the XML hierarchy is such that the countries that hold each continent are not child nodes of that continent.

> url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country" > doc = htmlTreeParse(url, useInternalNodes = T) > nodeNames = getNodeSet(doc, "//h2/span[@class='mw-headline']") > # For Africa > xmlChildren(nodeNames[[1]]) $a <a href="/wiki/Africa" title="Africa">Africa</a> attr(,"class") [1] "XMLInternalNodeList" "XMLNodeList" > xmlSize(nodeNames[[1]]) [1] 1

I know what countries could do in a separate getNodeSet command, but I just wanted to make sure that I was missing something. Is there a smarter way to get all the data on every continent, and then in every country at once?

+4

xml r xml-parsing xpath web-scraping

user1848018 Feb 01 '13 at 18:04

source share

1 answer

agstudy · Accepted Answer · 2013-02-01T20:18:12+0000

uisng xpath, several paths can be combined with | delimiter. Therefore, I use it to get tournaments and stores on the same list. Then I get the second list of competitions. I use the last list to smash the first

 url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country" library(XML) xmltext <- htmlTreeParse(url, useInternalNodes = T) ## Here I use the combined xpath cont.shops <- xpathApply(xmltext, '//*[@id="mw-content-text"]/ul/li| //*[@id="mw-content-text"]/h3',xmlValue) cont.shops<- do.call(rbind,cont.shops) ## from list to vector head(cont.shops) ## first element is country followed by shops [,1] [1,] "[edit] Â Tunisia" [2,] "Magasin GÃƒÂ©nÃƒÂ©ral" [3,] "Mercure Market" [4,] "Promogro" [5,] "Geant" [6,] "Carrefour" ## I get all the contries in one list contries <- xpathApply(xmltext, '//*[@id="mw-content-text"]/h3',xmlValue) contries <- do.call(rbind,contries) ## from list to vector head(contries) [,1] [1,] "[edit] Â Tunisia" [2,] "[edit] Â Morocco" [3,] "[edit] Â Ghana" [4,] "[edit] Â Kenya" [5,] "[edit] Â Nigeria" [6,] "[edit] Â South Africa"

Now I'm doing some processing to split cont.shops using countries.

 dd <- which(cont.shops %in% contries) ## get the index of contries freq <- c(diff(dd),length(cont.shops)-tail(dd,1)+1) ## use diff to get Frequencies contries.f <- rep(contries,freq) ## create the factor splitter ll <- split(cont.shops,contries.f)

I can check the result:

 > ll[[contries[1]]] [1] "[edit] Â Tunisia" "Magasin GÃƒÂ©nÃƒÂ©ral" "Mercure Market" "Promogro" "Geant" [6] "Carrefour" "Monoprix" > ll[[contries[2]]] [1] "[edit] Â Morocco" [2] "Alpha 55, one 6-story store in Casablanca" [3] "Galeries Lafayette, to open in 2011[1] within Morocco Mall, in Casablanca"

Scrambling hierarchical data

More articles: