How to get a web page title without loading the entire page source

Question

How to get a web page title without loading the entire page source

I am looking for a method that will allow me to get the title of a webpage and save it as a string.

However, all the solutions that I have found so far include downloading the source code for the page, which is not very practical for a large number of web pages.

The only way I could see would be to limit the length of the string, or load either a given number of characters, or stop after it reaches the tag, but will this obviously still be quite large?

thanks

+4

c #

quotidian Jul 25 '12 at 15:09

source share

2 answers

The easiest way to handle this is to download it and then split it up:

  using System; using System.Net.Http; private async void getSite(string url) { HttpClient hc = new HttpClient(); HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute)); string source = await response.Content.ReadAsStringAsync(); //process the source here }

To process the source, you can use the method described here in the article Retrieving Content from HTML Tags

+1

user151243 Oct 4 '12 at 1:43

source share

newfurniturey · Accepted Answer · 2012-07-25T15:29:19+0000

Since the <title> is in the HTML itself, there will be no way to upload a file to find "name only". You should be able to download part of the file until you read it in the <title> or </head> , and then stop it, but you still need to download (at least part) of the file.

This can be achieved by using HttpWebRequest / HttpWebResponse and reading the data from the response stream until we read in the <title></title> block or the </head> . I added a </head> because in the actual HTML the header should be displayed in the main block - so with this check we will never parse the whole file anyway (unless, of course, there is a head block).

The following should be able to complete this task:

 string title = ""; try { HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest); HttpWebResponse response = (request.GetResponse() as HttpWebResponse); using (Stream stream = response.GetResponseStream()) { // compiled regex to check for <title></title> block Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase); int bytesToRead = 8092; byte[] buffer = new byte[bytesToRead]; string contents = ""; int length = 0; while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) { // convert the byte-array to a string and add it to the rest of the // contents that have been downloaded so far contents += Encoding.UTF8.GetString(buffer, 0, length); Match m = titleCheck.Match(contents); if (m.Success) { // we found a <title></title> match =] title = m.Groups[1].Value.ToString(); break; } else if (contents.Contains("</head>")) { // reached end of head-block; no title found =[ break; } } } } catch (Exception e) { Console.WriteLine(e); }

UPDATE: The original source example has been updated to use the compiled Regex and using operator for Stream to improve efficiency and ease of maintenance.

How to get a web page title without loading the entire page source

More articles: