Extract unique HTML tags from a document

Question

Extract unique HTML tags from a document

I have an HTML document in R and I want to extract a list of unique tags from this document with a count of their frequency of occurrence.

I could have skipped all possible tags as follows, but was hoping for a solution that does not require a predefined list of tags:

library('XML') url <- 'http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array' doc <- htmlParse(url) all_tags <- c('//p', '//a', '//b', '//u', '//i') counts <- sapply(all_tags, function(x) length(xpathSApply(doc, x))) free(doc)

+4

xml r web-scraping

Zach Aug 18 '15 at 18:17

source share

2 answers

Hadleyverse version (but with a return to the base if necessary):

 library(xml2) library(dplyr) url <- 'http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array' doc <- read_html(url) tags <- xml_name(xml_find_all(doc, "//*")) # base version sort(table(tags)) ## tags ## body form h1 head html title sub h3 i noscript ## 1 1 1 1 1 1 2 3 3 3 ## h4 h2 th link hr ol ul em input b ## 4 5 5 7 8 10 11 12 12 14 ## script meta img br pre strong tbody table code li ## 16 17 26 27 41 43 55 79 104 115 ## tr p td div a span ## 127 150 268 358 371 423 # hadleyverse arrange(count(data_frame(tag=tags), tag), desc(n)) ## Source: local data frame [36 x 2] ## ## tag n ## 1 span 423 ## 2 a 371 ## 3 div 358 ## 4 td 268 ## 5 p 150 ## 6 tr 127 ## 7 li 115 ## 8 code 104 ## 9 table 79 ## 10 tbody 55 ## .. ... ...

+2

hrbrmstr Aug 18 '15 at 19:02

source share

lukeA · Accepted Answer · 2015-08-18T19:06:23+0000

A classic version of an XML package might look like this:

 tab <- table(xpathSApply(doc, "//*", xmlName)) tab[c('p', 'a', 'b', 'u', 'i')]

Extract unique HTML tags from a document

More articles: