C # parsing HTML for general use?

What is the best way to take an HTML string and incorporate it into something useful?

Essentially, if I take a URL and get the HTML from that URL in .net, I get a response, but it will look like a file or stream or string.

What if I want an actual document or something that I could scan as an XmlDocument object?

I have some thoughts and an already implemented solution, but I am interested to know what the community thinks about it.

+5
source share
6 answers

I am using mshtml api.

just refer to the mshtml assembly, then add a namespace.

HTMLDocument, , , API , util , .

+3

HTML- XML, XHTML, XML-.

HTML Agility Pack. .net DOM, .

+7

Tidy.net html, . XmlDocument , , .

Tidy document = new Tidy();
TidyMessageCollection messageCollection = new TidyMessageCollection();

document.Options.DocType = DocType.Omit;
document.Options.Xhtml = true;
document.Options.CharEncoding = CharEncoding.UTF8;
document.Options.LogicalEmphasis = true;

document.Options.MakeClean = false;
document.Options.QuoteNbsp = false;
document.Options.SmartIndent = false;
document.Options.IndentContent = false;
document.Options.TidyMark = false;

document.Options.DropFontTags = false;
document.Options.QuoteAmpersand = true;
document.Options.DropEmptyParas = true;

MemoryStream input = new MemoryStream();
MemoryStream output = new MemoryStream();
byte[] array = Encoding.UTF8.GetBytes(xmlResult);
input.Write(array, 0, array.Length);
input.Position = 0;

document.Parse(input, output, messageCollection);

string tidyXhtml = Encoding.UTF8.GetString(output.ToArray());

XmlDocument outputXml = new XmlDocument();
outputXml.LoadXml((tidyXhtml);
+3
var browser = new System.Windows.Forms.WebBrowser();
browser.Navigate(new System.Uri("http://example.com"));
var doc = browser.Document;

HtmlDocument members

, doc.All, HtmlControlCollection, ICollection<HtmlControl>.

HtmlControl.DomElement mshtml, .

+1

- System.Windows.Forms.HtmlDocument. DOM.

, HTTP, , HTML ( ), , , , .

HTTP , , , , , . , , HTTPWebResponse, .

+1

HTML Agility Pack, HtmlMonkey ( HTML) Github.

. , HTML-, DOM, .

0

All Articles