Best way to get HTML variable tag

I am trying to extract some HTML from different blogs and noticed that different providers use the same tag in different ways.

For example, here are two main providers that use the meta name generation tag in different ways:

  • Blogger: <meta content='blogger' name='generator'/>(first content, name later and yes, single quotes!)
  • WordPress: <meta name="generator" content="WordPress.com" />(first name, content later)

Is there a way to extract the content value for all cases (single / double quotes, first / last in a string)?

PS Although I use Java, the answer would probably help more people if it were usually used for regular expressions.

+5
source share
8

: .

. SGML XML, XML (, ). , . , .

+14

, , - HTML-, node (, , node) DOM . - - , , http://java-source.net/open-source/html-parsers

+3

XHTML.

, - .

, , .

"" XML- API, Infoset. API DOM SAX .

( RegEx), , , .

+2

: ( , ) HTML W3C, :

SGML , (ASCII- 34), (ASCII- 39)... - .

, , .

+2

, Java HTMLEditorKit. , , , .

+1

, -, REGEX, /<meta\s.*content=.*>/, , . REGEX, , http://www.codehouse.com/webmaster_tools/regex/ .

0

, , :

content\s*=\s*['"].*?['"]

content = "blogger"

content='Worpress.com'

. , , regexpal.

Once you get this, you can get everything between quotes, but you choose whether it is another regular expression (which is just immoral at this point) or just iterates over characters.

0
source

If you use java, you can look at tagsoup , which is a SAX-compatible parser for "[parsing] HTML as it occurs in the wild."

0
source

All Articles