Best way to get HTML variable tag

Question

Best way to get HTML variable tag

I am trying to extract some HTML from different blogs and noticed that different providers use the same tag in different ways.

For example, here are two main providers that use the meta name generation tag in different ways:

Blogger: <meta content='blogger' name='generator'/>(first content, name later and yes, single quotes!)
WordPress: <meta name="generator" content="WordPress.com" />(first name, content later)

Is there a way to extract the content value for all cases (single / double quotes, first / last in a string)?

PS Although I use Java, the answer would probably help more people if it were usually used for regular expressions.

+5

language-agnostic html regex

pek Aug 28 '08 at 2:23

source share

8

, , - HTML-, node (, , node) DOM . - - , , http://java-source.net/open-source/html-parsers

+3

martinatime 28 . '08 2:30

XHTML.

, - .

, , .

"" XML- API, Infoset. API DOM SAX .

( RegEx), , , .

+2

Sergio Acosta 28 . '08 2:28

: ( , ) HTML W3C, :

SGML , (ASCII- 34), (ASCII- 39)... - .

, , .

+2

Grey Panther 28 . '08 2:56

, Java HTMLEditorKit. , , , .

+1

Preston 28 . '08 3:24

, -, REGEX, /<meta\s.*content=.*>/, , . REGEX, , http://www.codehouse.com/webmaster_tools/regex/ .

0

martinatime 28 . '08 3:20

, , :

content\s*=\s*['"].*?['"]

content = "blogger"

content='Worpress.com'

. , , regexpal.

Once you get this, you can get everything between quotes, but you choose whether it is another regular expression (which is just immoral at this point) or just iterates over characters.

0

dwestbrook Aug 28 '08 at 3:38

source share

If you use java, you can look at tagsoup , which is a SAX-compatible parser for "[parsing] HTML as it occurs in the wild."

0

Peter Stuifzand Aug 28 '08 at 12:53

source share

Brad Wilson · Accepted Answer · 2008-08-28T02:31:40+0000

: .

. SGML XML, XML (, ). , . , .

Best way to get HTML variable tag

More articles: