Here's how you do it using HtmlAgilityPack .
First your HTML sample:
var html = "<html>\r\n<body>\r\nbla bla </td><td>\r\nbla bla \r\n<body>\r\n<html>";
Download it (as a line in this case):
var doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html);
If you get it from the Internet, similar:
var web = new HtmlWeb(); var doc = web.Load(url);
Now select only text nodes with non-spaces and crop them.
var text = doc.DocumentNode.Descendants() .Where(x => x.NodeType == HtmlNodeType.Text && x.InnerText.Trim().Length > 0) .Select(x => x.InnerText.Trim());
You can get this as one concatenated string if you want:
String.Join(" ", text)
Of course, this will only work on simple web pages. Any complex will also return nodes with data that you clearly don't want, such as javascript functions, etc.
yamen
source share