Create Corpus from many html files in R

Question

Create Corpus from many html files in R

I would like to create Corpus for a collection of downloaded HTML files and then read them in R for future mining.

Essentially, this is what I want to do:

Create a body from several html files.

I tried using DirSource:

library(tm) a<- DirSource("C:/test") b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))

but it returns "invalid directory parameters"

Read everything in html files from Corpus at once. Not sure how to do this.
Disassemble them, convert them to plain text, remove tags. Many people suggested using XML, but I did not find a way to process multiple files. All of them are for one file.

Thank you very much.

+6

html r xml-parsing text-mining corpus

user2097824 Feb 22 '13 at 3:45

source share

4 answers

Ben · Answer 1 · 2013-02-22T07:38:22+0000

That should do it. Here I have a folder on my computer of HTML files (random sampling from SO), and I created a corpus from them, then a matrix of document terms, and then performed several trivial tasks of text mining.

 # get data setwd("C:/Downloads/html") # this folder has your HTML files html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files # load packages library(tm) library(RCurl) library(XML) # get some code from github to convert HTML to text writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R"))) source("htmlToText.R") # convert HTML to text html2txt <- lapply(html, htmlToText) # clean out non-ASCII characters html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub="")) # make corpus for text mining corpus <- Corpus(VectorSource(html2txtclean)) # process text... skipWords <- function(x) removeWords(x, stopwords("english")) funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords) a <- tm_map(a, PlainTextDocument) a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs) a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words # remove most frequent words for this corpus a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] inspect(a.dtm2) # carry on with typical things that can now be done, ie. cluster analysis a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7) a.dtm.df <- as.data.frame(inspect(a.dtm3)) a.dtm.df.scale <- scale(a.dtm.df) d <- dist(a.dtm.df.scale, method = "euclidean") fit <- hclust(d, method="ward") plot(fit)

enter image description here

 # just for fun... library(wordcloud) library(RColorBrewer) m = as.matrix(t(a.dtm1)) # get word counts in decreasing order word_freqs = sort(colSums(m), decreasing=TRUE) # create a data frame with words and their frequencies dm = data.frame(word=names(word_freqs), freq=word_freqs) # plot wordcloud wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

enter image description here

agstudy · Answer 2 · 2013-02-22T04:12:57+0000

This will fix the error.

  b<-Corpus(a, ## I change DireSource(a) by a readerControl=list(language="eng", reader=readPlain))

But I think you need to use an xml reader to read your Html. Sort of:

 r <- Corpus(DirSource('c:\test'), readerControl = list(reader = readXML),spec)

But you need to provide a spec argument, which depends on your file structure. see for example readReut21578XML . This is a good example of an xml / html parser.

Oscar de León · Answer 3 · 2013-02-22T04:16:55+0000

To read all html files into an R object, you can use

 # Set variables folder <- 'C:/test' extension <- '.htm' # Get the names of *.html files in the folder files <- list.files(path=folder, pattern=extension) # Read all the files into a list htmls <- lapply(X=files, FUN=function(file){ .con <- file(description=paste(folder, file, sep='/')) .html <- readLines(.con) close(.con) names(.html) <- file .html })

This will give you a list, and each element will be the HTML content of each file.

Later I will publish it, having disassembled it, I am in a hurry.

giocomai · Answer 4 · 2015-03-16T11:31:51+0000

I found the boilerpipeR package, especially useful for extracting only the "main" text of an html page.

Create Corpus from many html files in R

More articles: