How to build webscraper in R using readLines and grep?

Question

How to build webscraper in R using readLines and grep?

I am completely new to R. I want to compose a 1-millionth body of newspaper articles. So I'm trying to write a web scraper to extract newspaper articles, for example. Guardian website: http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs .

The scraper is designed to run on one page, receive the text of the main text of the article, delete all tags and save it in a text file. Then he should go to the next article through the links on this page, get the article, and so on, until the file contains about 1 million words.

Unfortunately, I did not work out very well with my scraper.

I used readLines () to get to the source of the website and now I would like to get the corresponding line in the code.

The appropriate section in the Guardian uses this identifier to indicate the body of the article:

<div id="article-body-blocks">         
  <p>
    <a href="http://www.guardian.co.uk/politics/boris"
       title="More from guardian.co.uk on Boris Johnson">Boris Johnson</a>,
       the...a different approach."
  </p>
</div>

I tried to get this section using various expressions with grep and lookbehind - trying to get a line after this id, but I think this doesn't work after a few lines. At least I can't get it to work.

Can anyone help? It would be great if someone could provide me with code that I can continue to work on!

Thank.

+5

r web-scraping

Kat Oct 31 '11 at 18:34

source share

1 answer

daroczig · Accepted Answer · 2011-10-31T20:35:07+0000

, grep readLines, , . :.

:

html <- readLines('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')

str_extract from stringr :

library(stringr)
body <- str_extract(paste(html, collapse='\n'), '<div id="article-body-blocks">.*</div>')

, body , <p> . gsub ( ). :

gsub('<script(.*?)script>|<span(.*?)>|<div(.*?)>|</div>|</p>|<p(.*?)>|<a(.*?)>|\n|\t', '', body)

@Andrie, . :

library(XML)
library(RCurl)
webpage <- getURL('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[@id='article-body-blocks']/p", xmlValue)

body :

> str(body)
 chr [1:33] "The deputy prime minister, Nick Clegg, has said the government regional growth fund will provide a \"snowball effect that cre"| __truncated__ ...

: , ( @Martin Morgan ):

xpathSApply(htmlTreeParse('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs', useInternalNodes = TRUE, encoding='UTF-8'), "//div[@id='article-body-blocks']/p", xmlValue)

How to build webscraper in R using readLines and grep?

More articles: