What is the best way to take an HTML string and incorporate it into something useful?
Essentially, if I take a URL and get the HTML from that URL in .net, I get a response, but it will look like a file or stream or string.
What if I want an actual document or something that I could scan as an XmlDocument object?
I have some thoughts and an already implemented solution, but I am interested to know what the community thinks about it.
I am using mshtml api.
just refer to the mshtml assembly, then add a namespace.
HTMLDocument, , , API , util , .
HTML- XML, XHTML, XML-.
HTML Agility Pack. .net DOM, .
Tidy.net html, . XmlDocument , , .
Tidy document = new Tidy(); TidyMessageCollection messageCollection = new TidyMessageCollection(); document.Options.DocType = DocType.Omit; document.Options.Xhtml = true; document.Options.CharEncoding = CharEncoding.UTF8; document.Options.LogicalEmphasis = true; document.Options.MakeClean = false; document.Options.QuoteNbsp = false; document.Options.SmartIndent = false; document.Options.IndentContent = false; document.Options.TidyMark = false; document.Options.DropFontTags = false; document.Options.QuoteAmpersand = true; document.Options.DropEmptyParas = true; MemoryStream input = new MemoryStream(); MemoryStream output = new MemoryStream(); byte[] array = Encoding.UTF8.GetBytes(xmlResult); input.Write(array, 0, array.Length); input.Position = 0; document.Parse(input, output, messageCollection); string tidyXhtml = Encoding.UTF8.GetString(output.ToArray()); XmlDocument outputXml = new XmlDocument(); outputXml.LoadXml((tidyXhtml);
var browser = new System.Windows.Forms.WebBrowser(); browser.Navigate(new System.Uri("http://example.com")); var doc = browser.Document;
HtmlDocument members
HtmlDocument
, doc.All, HtmlControlCollection, ICollection<HtmlControl>.
doc.All
HtmlControlCollection
ICollection<HtmlControl>
HtmlControl.DomElement mshtml, .
HtmlControl.DomElement
mshtml
- System.Windows.Forms.HtmlDocument. DOM.
, HTTP, , HTML ( ), , , , .
HTTP , , , , , . , , HTTPWebResponse, .
HTML Agility Pack, HtmlMonkey ( HTML) Github.
. , HTML-, DOM, .