Displaying a network of links between blogs using R?

I would like to get advice on how to create and visualize a link map between blogs to reflect the “social network” between them.

Here is how I think to do it:

  1. Start with one (or more) blog home pages and collect all the links on this page.
  2. Remove all links that are internal links (that is, if I start with www.website.com. Then I want to remove all links from the form "www.website.com/***"). But keep all external links.
  3. Go to each of these links (if you haven't visited them yet) and repeat step 1.
  4. Continue until (let's say) X jumps from the first page.
  5. The plot of the data collected.

I believe that in order to do this in R, you need to use RCurl / XML (thanks to Shane for your answer here ) in combination with something like igraph.

But since I have no experience working with any of them, is there someone here who could fix me if I skipped any important step, or attach some useful piece of code to solve this problem?

PS: My motivation for this issue is that in a week I will give a talk about useR 2010 on the topic of “Blogging and R”, and I thought that this could be a good way to not only entertain the audience, but also motivate it to do something like that yourself.

Many thanks!

Tal

+5
source share
2 answers

NB: , , .:)

, , , , ( R, , RCurl XML):

library(RCurl)
library(XML)

get.links.on.page <- function(u) {
  doc <- getURL(u)
  html <- htmlTreeParse(doc, useInternalNodes = TRUE)
  nodes <- getNodeSet(html, "//html//body//a[@href]")
  urls <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
  urls <- sort(urls)
  return(urls)
}

# a naieve way of doing it. Python has 'urlparse' which is suppose to be rather good at this
get.root.domain <- function(u) {
  root <- unlist(strsplit(u, "/"))[3]
  return(root)
}

# a naieve method to filter out duplicated, invalid and self-referecing urls. 
filter.links <- function(seed, urls) {
  urls <- unique(urls)
  urls <- urls[which(substr(urls, start = 1, stop = 1) == "h")]
  urls <- urls[grep("http", urls, fixed = TRUE)]
  seed.root <- get.root.domain(seed)
  urls <- urls[-grep(seed.root, urls, fixed = TRUE)]
  return(urls)
}

# pass each url to this function
main.fn <- function(seed) {
  raw.urls <- get.links.on.page(seed)
  filtered.urls <- filter.links(seed, raw.urls)
  return(filtered.urls)
}

### example  ###
seed <- "http://www.r-bloggers.com/blogs-list/"
urls <- main.fn(seed)

# crawl first 3 links and get urls for each, put in a list 
x <- lapply(as.list(urls[1:3]), main.fn)
names(x) <- urls[1:3]
x

R, x, , .

, !

+7

Tal,

k-snowball R. , - XMl htmlTreeParse. HTML- , , .

, igraph , graph.compose . , " ". :

  1. -
  2. - ( ) ,
  3. - .
  4. , .

R, , Python, Google SocialGraph API.

!

+4

All Articles