In R, how to analyze a specific frame inside a web page?

Question

In R, how to analyze a specific frame inside a web page?

Greetings to all

Is there a way to read HTML code only from a specific frame on a web page?

For example, if I submit a URL to google translate, is there a way to parse only the translated frame of the page? Whenever I try, I can only access the top frame on the page, but not the translated frame. Here is my self-contained sample code:

library(XML) url <- "http://www.baidu.com/s?wd=r+project" url.google.translate <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep="")) htmlTreeParse(url.google.translate, useInternalNodes = FALSE)

The above code refers to this URL:

 $file [1] "http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=http://www.baidu.com/s?wd=r+project"

However, the output is available only for the top frame of the page, and not for the main frame, which interests me.

Hope this made sense and thanks in advance for any help.

Tony

UPDATE - Thanks to the answer from @kwantam below (accepted), I was able to use it to get my solution as follows (autonomously):

 > # Load R packages > library(RCurl) > library(XML) > > # STAGE 1 - find forward url in relevent frame > ( url <- "http://www.baidu.com/s?wd=r+project" ) [1] "http://www.baidu.com/s?wd=r+project" > gt.url <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep="")) > gt.doc <- getURL(gt.url) > gt.html <- htmlTreeParse(gt.doc, useInternalNodes = TRUE, error=function(...){}) > nodes <- getNodeSet(gt.html, '//frameset//frame[@name="c"]') > gt.parameters <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]]) > gt.url <- paste("http://translate.google.com", gt.parameters, sep = "") > > # STAGE 2 - find forward url to translated page > doc <- getURL(gt.url, followlocation = TRUE) > html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){}) > url.trans <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]]) > url.trans <- strsplit(url.trans, "URL=", fixed = TRUE)[[1]][2] > url.trans <- gsub("\"/>", "", url.trans, fixed = TRUE) > url.trans <- xmlValue(getNodeSet(htmlParse(url.trans, asText = TRUE), "//p")[[1]]) > > # STAGE 3 - load translated page > url.trans [1] "http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=ALkJrhiCMu1mKv-czCmEaB7PO925TJCa-A " > #getURL(url.trans)

If anyone knows of a simpler solution to what I gave above, please feel free to let me know! :)

+2

r google-translate

Tony breyal Nov 23 '10 at 16:15

source share

2 answers

For your specific translation needs, you might be better off accessing the Google Translation API via the REST interface rather than screening the screen:

http://code.google.com/apis/language/translate/overview.html

+2

Spacedman Nov 23 '10 at 17:46

source share

kwantam · Accepted Answer · 2010-11-23T16:37:47+0000

Most of the following answer is for a specific google translate case. In most cases, you just need to parse the <frameset> and pull out any frame you are looking for, although it may not be immediately obvious what is basic from HTML (maybe look at the relative size of the frames).

It looks like you will need to keep track of a few updates to get the actual content. In particular, when you grab the URL just mentioned, you will see something like

  *snip* <noframes> <script> <!--document.location="/translate_p?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;usg=asdf";--> </script> <a href="/translate_p?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;usg=asdf">Translate </a> </noframes> *snip*

If you follow the link here (don't forget to "unescape" & "first), this will give you another small piece of HTML that includes

 <meta http-equiv="refresh" content="0;URL=http://translate.googleusercontent.com/translate_c?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;rurl=translate.google.com&amp;usg=asdf">

Again, unescaping '&' and then after updating, you will have the translated page you are looking for.

Play wget or curl with this, and you should understand what you need to do.

In R, how to analyze a specific frame inside a web page?

More articles: