Greetings to all
Is there a way to read HTML code only from a specific frame on a web page?
For example, if I submit a URL to google translate, is there a way to parse only the translated frame of the page? Whenever I try, I can only access the top frame on the page, but not the translated frame. Here is my self-contained sample code:
library(XML) url <- "http://www.baidu.com/s?wd=r+project" url.google.translate <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep="")) htmlTreeParse(url.google.translate, useInternalNodes = FALSE)
The above code refers to this URL:
$file [1] "http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=http://www.baidu.com/s?wd=r+project"
However, the output is available only for the top frame of the page, and not for the main frame, which interests me.
Hope this made sense and thanks in advance for any help.
Tony
UPDATE - Thanks to the answer from @kwantam below (accepted), I was able to use it to get my solution as follows (autonomously):
> # Load R packages > library(RCurl) > library(XML) > > # STAGE 1 - find forward url in relevent frame > ( url <- "http://www.baidu.com/s?wd=r+project" ) [1] "http://www.baidu.com/s?wd=r+project" > gt.url <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep="")) > gt.doc <- getURL(gt.url) > gt.html <- htmlTreeParse(gt.doc, useInternalNodes = TRUE, error=function(...){}) > nodes <- getNodeSet(gt.html, '//frameset//frame[@name="c"]') > gt.parameters <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]]) > gt.url <- paste("http://translate.google.com", gt.parameters, sep = "") > > # STAGE 2 - find forward url to translated page > doc <- getURL(gt.url, followlocation = TRUE) > html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){}) > url.trans <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]]) > url.trans <- strsplit(url.trans, "URL=", fixed = TRUE)[[1]][2] > url.trans <- gsub("\"/>", "", url.trans, fixed = TRUE) > url.trans <- xmlValue(getNodeSet(htmlParse(url.trans, asText = TRUE), "//p")[[1]]) > > # STAGE 3 - load translated page > url.trans [1] "http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=ALkJrhiCMu1mKv-czCmEaB7PO925TJCa-A " > #getURL(url.trans)
If anyone knows of a simpler solution to what I gave above, please feel free to let me know! :)