I have articles on my site that I would like to correct and translate automatically. But I need to get content without HTML tags.
The idea is to have a regular expression that could get all the content between the tags (and, if possible, also the content found in the tag fields, for example <img alt='Little house'> ). The problem is that I really don't know how to write such a regular expression. Any ideas?
<img alt='Little house'>
I would recommend using an HTML parser instead of relying on a regular expression. Parsing HTML with regex is usually no-no and it's almost impossible to get right for all cases. There are many questions on SO that come to the same conclusion.
EDIT looks like we had the same idea ... Also, here is a question that more parsers are discussing.
Regular expression may not be the best choice for this job (I will show you the obligatory tirade).
I would recommend you study the HTML parsing library to help you here, something like the Html Agility Pack .
As people say, regular expression is not the most recommended way, but if you decide that regular expression is the way, you should start:
string pattern = @"(<(/?[^>]+)>)" strippedString = Regex.Replace(str, pattern, string.Empty);
Iβm not sure if this helps, but I have the opportunity to translate articles on my site into the preferred language for readers, I did this using the Bing translation widget so I donβt understand html, all this is done for me.