How to clear this squawka page?

I am trying to extract the following information:

On the page

http://epl.squawka.com/stoke-city-vs-arsenal/01-03-2014/english-barclays-premier-league/matches

pressing the red button "full statistics" opens a menu that includes (on the left side) the "Crosses" button. This opens up the image of a football field with 19 arrows on it on the right side of the screen; these are Stokes cross passes in the Stoke Arsenal match. They are color coded, red = not completed, green = completed, yellow = pass key. When you click on the arrow, it will tell you who gave the pass and at what minute of the game. In addition, the arrows show where the player stood when he gave the pass and where the player who was transferred was located.

I would like to clear this page to get a table with columns:

command; Sender name; location from the sender; positioning from the receiver; min; color-of-arrows

This is a set of cross passages made by Stoke, I would also like to automatically repeat this for Arsenal (hence the “club” column in the table above).

Although I used to have cleaned up web pages in the past, they were all static rather unrelated pages, and I was completely overwhelmed by how to clear the information from this page. I am very grateful for the help in clearing the data that I just described. I am well versed in R, so I especially appreciate the code that will help me achieve this in R, but I am also very grateful for the help that uses a different language or software.

Thanks Peter

+6
source share
1 answer

Peter, as guys, indicated that you can do this with Selenium. I also like to use the excellent selectr package. The idea is to briefly interact with the site, and then the rest elsewhere. squawkData should contain everything you need.

# RSelenium::startServer() # if needed require(RSelenium) remDr <- remoteDriver() remDr$open() remDr$setImplicitWaitTimeout(3000) remDr$navigate("http://epl.squawka.com/stoke-city-vs-arsenal/01-03-2014/english-barclays-premier-league/matches") squawkData <- remDr$executeScript("return new XMLSerializer().serializeToString(squawkaDp.xml);", list()) require(selectr) example <- querySelectorAll(xmlParse(squawkData[[1]]), "crosses time_slice") example[[1]] <time_slice name="0 - 5" id="1"> <event player_id="531" mins="4" secs="39" minsec="279" team="44" type="Failed"> <start>73.1,87.1</start> <end>97.9,49.1</end> </event> </time_slice> 

DISCLAIMER I am the author of the RSelenium package. The main vignette for operations can be found on the Basics of RSelenium and RSelenium: testing brilliant applications .

Additional information can be easily accessed with selectr:

 > xmlValue(querySelectorAll(xmlParse(squawkData[[1]]), "players #531 name")[[1]]) [1] "Charlie Adam" > xmlValue(querySelectorAll(xmlParse(squawkData[[1]]), "game team#44 long_name")[[1]]) [1] "Stoke City" 

UPDATE:
To process the example in a data frame, you can do something like

 out <- lapply(example, function(x){ # handle each event if(length(x['event']) > 0){ res <- lapply(x['event'], function(y){ matchAttrs <- as.list(xmlAttrs(y)) matchAttrs$start <- xmlValue(y['start']$start) matchAttrs$end <- xmlValue(y['end']$end) matchAttrs }) return(do.call(rbind.data.frame, res)) } } ) > head(do.call(rbind, out)) player_id mins secs minsec team type start end event 531 4 39 279 44 Failed 73.1,87.1 97.9,49.1 event5 311 6 33 393 31 Failed 92.3,13.1 93.0,31.0 event1 376 8 57 537 31 Failed 97.7,6.1 96.7,16.4 event6 311 13 50 830 31 Failed 99.5,0.5 94.9,42.6 event11 311 14 11 851 31 Failed 99.5,0.5 93.1,51.0 event7 311 17 41 1061 31 Failed 99.5,99.5 92.6,50.1 
+9
source

All Articles