Why is there a limit on the number of simultaneous downloads?

I am trying to create my own simple web crawler. I want to download files with specific extensions from a URL. I have the following code:

private void button1_Click(object sender, RoutedEventArgs e) { if (bw.IsBusy) return; bw.DoWork += new DoWorkEventHandler(bw_DoWork); bw.RunWorkerAsync(new string[] { URL.Text, SavePath.Text, Filter.Text }); } //-------------------------------------------------------------------------------------------- void bw_DoWork(object sender, DoWorkEventArgs e) { try { ThreadPool.SetMaxThreads(4, 4); string[] strs = e.Argument as string[]; Regex reg = new Regex("<a(\\s*[^>]*?){0,1}\\s*href\\s*\\=\\s*\\\"([^>]*?)\\\"\\s*[^>]*>(.*?)</a>", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase); int i = 0; string domainS = strs[0]; string Extensions = strs[2]; string OutDir = strs[1]; var domain = new Uri(domainS); string[] Filters = Extensions.Split(new char[] { ';', ',', ' ' }, StringSplitOptions.RemoveEmptyEntries); string outPath = System.IO.Path.Combine(OutDir, string.Format("File_{0}.html", i)); WebClient webClient = new WebClient(); string str = webClient.DownloadString(domainS); str = str.Replace("\r\n", " ").Replace('\n', ' '); MatchCollection mc = reg.Matches(str); int NumOfThreads = mc.Count; Parallel.ForEach(mc.Cast<Match>(), new ParallelOptions { MaxDegreeOfParallelism = 2, }, mat => { string val = mat.Groups[2].Value; var link = new Uri(domain, val); foreach (string ext in Filters) if (val.EndsWith("." + ext)) { Download((object)new object[] { OutDir, link }); break; } }); throw new Exception("Finished !"); } catch (System.Exception ex) { ReportException(ex); } finally { } } //-------------------------------------------------------------------------------------------- private static void Download(object o) { try { object[] objs = o as object[]; Uri link = (Uri)objs[1]; string outPath = System.IO.Path.Combine((string)objs[0], System.IO.Path.GetFileName(link.ToString())); if (!File.Exists(outPath)) { //WebClient webClient = new WebClient(); //webClient.DownloadFile(link, outPath); DownloadFile(link.ToString(), outPath); } } catch (System.Exception ex) { ReportException(ex); } } //-------------------------------------------------------------------------------------------- private static bool DownloadFile(string url, string filePath) { try { HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); request.UserAgent = "Web Crawler"; request.Timeout = 40000; WebResponse response = request.GetResponse(); Stream stream = response.GetResponseStream(); using (FileStream fs = new FileStream(filePath, FileMode.CreateNew)) { const int siz = 1000; byte[] bytes = new byte[siz]; for (; ; ) { int count = stream.Read(bytes, 0, siz); fs.Write(bytes, 0, count); if (count == 0) break; } fs.Flush(); fs.Close(); } } catch (System.Exception ex) { ReportException(ex); return false; } finally { } return true; } 

The problem is that although it works fine for two concurrent downloads:

  new ParallelOptions { MaxDegreeOfParallelism = 2, } 

... it does not work for large degrees of parallelism like:

  new ParallelOptions { MaxDegreeOfParallelism = 5, } 

... and I get connection timeout exceptions.

At first I thought it was due to WebClient :

  //WebClient webClient = new WebClient(); //webClient.DownloadFile(link, outPath); 

... but when I replaced it with the DownloadFile function that used the HttpWebRequest , I still got the error.

I tested it on many web pages and did not change anything. I also confirmed with the chrome extension, "Download Master", that these web servers allow multiple concurrent downloads. Does anyone know why I get an Exception timeout when trying to load many files in parallel?

+7
source share
2 answers

You need to assign ServicePointManager.DefaultConnectionLimit . By default, concurrent connections to the same host are 2. Also see the related SO message when using web.config connectionManagement .

+6
source

As far as I know, IIS will limit the total number of input and output connections, however this number should be in the range 10 ^ 3 not ~ 5.

Is it possible that you are testing the same URL? I know that many web servers limit the number of concurrent connections from clients. Example: Are you testing while trying to download 10 copies of http://www.google.com ?

If so, you can try a list of different sites, for example:

+1
source

All Articles