Simple web crawler in c #

Question

Simple web crawler in c #

I created a simple web crawler, but I want to add a recursion function so that every open page can get the URLs on that page, but I have no idea how I can do this, and I want to also enable streams to make it faster here this is my code

namespace Crawler { public partial class Form1 : Form { String Rstring; public Form1() { InitializeComponent(); } private void button1_Click(object sender, EventArgs e) { WebRequest myWebRequest; WebResponse myWebResponse; String URL = textBox1.Text; myWebRequest = WebRequest.Create(URL); myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet //and save it in the stream StreamReader sreader = new StreamReader(streamResponse);//reads the data stream Rstring = sreader.ReadToEnd();//reads it to the end String Links = GetContent(Rstring);//gets the links only textBox2.Text = Rstring; textBox3.Text = Links; streamResponse.Close(); sreader.Close(); myWebResponse.Close(); } private String GetContent(String Rstring) { String sString=""; HTMLDocument d = new HTMLDocument(); IHTMLDocument2 doc = (IHTMLDocument2)d; doc.write(Rstring); IHTMLElementCollection L = doc.links; foreach (IHTMLElement links in L) { sString += links.getAttribute("href", 0); sString += "/n"; } return sString; }

+8

c # web-crawler

Khaled Mohamed May 04 '12 at 16:32

source share

4 answers

I created something similar using the Reactive Extension .

https://github.com/Misterhex/WebCrawler

Hope he can help you.

 Crawler crawler = new Crawler(); IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/")); observable.Subscribe(onNext: Console.WriteLine, onCompleted: () => Console.WriteLine("Crawling completed"));

+7

Misterhex Jun 07 '13 at 2:37

source share

Below is the answer / recommendation.

I believe that you should use dataGridView instead of textBox , as when viewing it in the graphical interface it is easier to see the found links (URL).

You can change:

 textBox3.Text = Links;

to

  dataGridView.DataSource = Links;

Now for the question you did not include:

 using System. "'s"

which ones were used, as it would be clear if I can get them, because they cannot understand it.

+2

Connor Sep 13 '12 at 14:33

source share

In terms of design, I wrote several web browsers. Basically, you want to implement the depth of the first search using the stack data structure. You can also use Breadth First Search, but you are likely to run into stack memory issues. Good luck.

0

Tom Sep 13 '12 at 14:41

source share

Darius kucinskas · Accepted Answer · 2012-05-10T10:49:29+0000

I set your GetContent method as follows to get new links from the bypass page:

 public ISet<string> GetNewLinks(string content) { Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))"); ISet<string> newLinks = new HashSet<string>(); foreach (var match in regexLink.Matches(content)) { if (!newLinks.Contains(match.ToString())) newLinks.Add(match.ToString()); } return newLinks; }

Update

Bugfix: the regex should be regexLink. Thanks to @shashlearner for pointing this out (my fog).

Simple web crawler in c #

More articles: