Create Corpus from many html files in R

I would like to create Corpus for a collection of downloaded HTML files and then read them in R for future mining.

Essentially, this is what I want to do:

  • Create a body from several html files.

I tried using DirSource:

library(tm) a<- DirSource("C:/test") b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain)) 

but it returns "invalid directory parameters"

  • Read everything in html files from Corpus at once. Not sure how to do this.

  • Disassemble them, convert them to plain text, remove tags. Many people suggested using XML, but I did not find a way to process multiple files. All of them are for one file.

Thank you very much.

+6
source share
4 answers

That should do it. Here I have a folder on my computer of HTML files (random sampling from SO), and I created a corpus from them, then a matrix of document terms, and then performed several trivial tasks of text mining.

 # get data setwd("C:/Downloads/html") # this folder has your HTML files html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files # load packages library(tm) library(RCurl) library(XML) # get some code from github to convert HTML to text writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R"))) source("htmlToText.R") # convert HTML to text html2txt <- lapply(html, htmlToText) # clean out non-ASCII characters html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub="")) # make corpus for text mining corpus <- Corpus(VectorSource(html2txtclean)) # process text... skipWords <- function(x) removeWords(x, stopwords("english")) funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords) a <- tm_map(a, PlainTextDocument) a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs) a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words # remove most frequent words for this corpus a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] inspect(a.dtm2) # carry on with typical things that can now be done, ie. cluster analysis a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7) a.dtm.df <- as.data.frame(inspect(a.dtm3)) a.dtm.df.scale <- scale(a.dtm.df) d <- dist(a.dtm.df.scale, method = "euclidean") fit <- hclust(d, method="ward") plot(fit) 

enter image description here

 # just for fun... library(wordcloud) library(RColorBrewer) m = as.matrix(t(a.dtm1)) # get word counts in decreasing order word_freqs = sort(colSums(m), decreasing=TRUE) # create a data frame with words and their frequencies dm = data.frame(word=names(word_freqs), freq=word_freqs) # plot wordcloud wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2")) 

enter image description here

+11
source

This will fix the error.

  b<-Corpus(a, ## I change DireSource(a) by a readerControl=list(language="eng", reader=readPlain)) 

But I think you need to use an xml reader to read your Html. Sort of:

 r <- Corpus(DirSource('c:\test'), readerControl = list(reader = readXML),spec) 

But you need to provide a spec argument, which depends on your file structure. see for example readReut21578XML . This is a good example of an xml / html parser.

+2
source

To read all html files into an R object, you can use

 # Set variables folder <- 'C:/test' extension <- '.htm' # Get the names of *.html files in the folder files <- list.files(path=folder, pattern=extension) # Read all the files into a list htmls <- lapply(X=files, FUN=function(file){ .con <- file(description=paste(folder, file, sep='/')) .html <- readLines(.con) close(.con) names(.html) <- file .html }) 

This will give you a list, and each element will be the HTML content of each file.

Later I will publish it, having disassembled it, I am in a hurry.

0
source

I found the boilerpipeR package, especially useful for extracting only the "main" text of an html page.

0
source

All Articles