Scrambling a webpage using C # and HTMLAgility

I read that HTMLAgility 1.4 is a great solution for cleaning a web page. As a new programmer, I hope that I can contribute to this project. I do this as a form of C # application. The page I'm working with is pretty straight forward. The information I need is stuck between the two tags and, My goal is to pull out the data for Part-Num, Manu-Number, Description, Manu-Country, Last Modified, Last Modified. From the page and send data to the sql table. One of them is that there is also a small png pic, which also needs to be captured from src = "/ partcode / number.

I have no code to be completed. I thought this piece of code would tell me I'm going in the right direction. Even entering into debugging, I do not see that he is doing anything. Maybe someone can point me in the right direction. The more detailed, the better, as it is obvious that I have something to learn. Thank you, I would really appreciate it.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Xml;

namespace Stats
{
    class PartParser
    {
        static void Main(string[] args)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml("http://localhost");//my understanding this reads the entire page in?
            var tables = doc.DocumentNode.SelectNodes("//table");// I assume that this sets up the search for words containing table

        }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                Console.WriteLine(ex.StackTrace);
                Console.ReadKey();    
            }
        }
    }
}

Web Code:

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
<title>Part Number Database: Item Record</title>
<table class="data">
<tr><td>Part-Num</td><td width="50"></td><td><img src="/partcode/number/072140" alt="072140"/></td></tr>
<tr><td>Manu-Number</td><td width="50"></td><td><img src="/partcode/manu/00721408" alt="00721408" /></td></tr>    
<tr><td>Description</td><td></td><td>Widget 3.5</td></tr>
<tr><td>Manu-Country</td><td></td><td>United States</td></tr>    
<tr><td>Last Modified</td><td></td><td>26 Jan 2009,  8:08 PM</td></tr>    
<tr><td>Last Modified By</td><td></td><td>Manu</td></tr>
</table>
<p>
</body>
</html>
+5
source share
3 answers

Check out this article about 4GuysFromRolla

http://www.4guysfromrolla.com/articles/011211-1.aspx

, HTML Agility Pack, . , , .

+6

:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");   

LoadHtml(html) html , , - :

HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc  = htmlWeb.Load("http://stackoverflow.com");
+6

, HTML. , ( rows, cells case). 127.0.0.1, . Main .

HtmlDocument doc = new HtmlWeb().Load("http://127.0.0.1");    

var rows = doc.DocumentNode.SelectNodes("//table[@class='data']/tr");
foreach (var row in rows)
{
    var cells = row.SelectNodes("./td");
    string title = cells[0].InnerText;
    var valueRow = cells[2];
    switch (title)
    {
        case "Part-Num":
            string partNum = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
            Console.WriteLine("Part-Num:\t" + partNum);
            break;
        case "Manu-Number":
            string manuNumber = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
            Console.WriteLine("Manu-Num:\t" + manuNumber);
            break;
        case "Description":
            string description = valueRow.InnerText;
            Console.WriteLine("Description:\t" + description);
            break;
        case "Manu-Country":
            string manuCountry = valueRow.InnerText;
            Console.WriteLine("Manu-Country:\t" + manuCountry);
            break;
        case "Last Modified":
            string lastModified = valueRow.InnerText;
            Console.WriteLine("Last Modified:\t" + lastModified);
            break;
        case "Last Modified By":
            string lastModifiedBy = valueRow.InnerText;
            Console.WriteLine("Last Modified By:\t" + lastModifiedBy);
            break;
    }
}
+4

All Articles