Regexp that matches all text content of HTML input

Question

Regexp that matches all text content of HTML input

I have articles on my site that I would like to correct and translate automatically. But I need to get content without HTML tags.

The idea is to have a regular expression that could get all the content between the tags (and, if possible, also the content found in the tag fields, for example <img alt='Little house'> ). The problem is that I really don't know how to write such a regular expression. Any ideas?

+1

html c # regex .net

Tigroumeow Dec 6 '09 at 15:02

source share

4 answers

Regular expression may not be the best choice for this job (I will show you the obligatory tirade).

I would recommend you study the HTML parsing library to help you here, something like the Html Agility Pack .

+1

Andrew Hare Dec 6 '09 at 15:06

source share

As people say, regular expression is not the most recommended way, but if you decide that regular expression is the way, you should start:

 string pattern = @"(<(/?[^>]+)>)" strippedString = Regex.Replace(str, pattern, string.Empty);

+1

Elad Dec 6 '09 at 15:12

source share

I’m not sure if this helps, but I have the opportunity to translate articles on my site into the preferred language for readers, I did this using the Bing translation widget so I don’t understand html, all this is done for me.

0

user156862 Dec 6 '09 at 15:17

source share

jheddings · Accepted Answer · 2009-12-06T15:08:53+0000

I would recommend using an HTML parser instead of relying on a regular expression. Parsing HTML with regex is usually no-no and it's almost impossible to get right for all cases. There are many questions on SO that come to the same conclusion.

EDIT looks like we had the same idea ... Also, here is a question that more parsers are discussing.

Regexp that matches all text content of HTML input

More articles: