The fastest way to get a list of <title> values from all pages on the localhost website

Question

The fastest way to get a list of <title> values from all pages on the localhost website

Essentially, I want to host my local site and create a list of all the names and URLs, as in:

  http: //localhost/mySite/Default.aspx My Home Page
 http: //localhost/mySite/Preferences.aspx My Preferences
 http: //localhost/mySite/Messages.aspx Messages

I am running windows. I am open to everything that works - C # console application, PowerShell, some existing tool, etc. We can assume that the tag exists in the document.

Note. I need to actually pass the files, as the header can be set in code, not markup.

+4

web-crawler screen-scraping

Larsenal Dec 02 '08 at 20:01

source share

5 answers

I think the script is similar to what Adam Rosenfield suggested - this is what you want, but if you want to use the actual URLs, try using wget . With some suitable options, it will print a list of all the pages on your site (and also download them, which you can possibly suppress with --spider ). The wget program is available through the regular Cygwin installer.

+3

rmeador Dec 02 '08 at 20:37

source share

Ok, I'm not familiar with Windows, but to get you in the right direction: use XSLT transform with

<xsl: value - select = "/ head / title" /> to return the title or, if you want, use XPath '/ head / title' to return the title.

0

Roalt Dec 02 '08 at 20:23

source share

I would use wget as described above. Make sure you don't have spider spiders on your site.

0

Chris nava Dec 02 '08 at 21:58

source share

you should consider using a scrapy shell

check

http://doc.scrapy.org/intro/tutorial.html

in the console put something like this:

hxs.x ('/ html / head / title / text ()'). Extract()

If you need all the headers, you have to make a spider ... it's really easy.

Also consider switching to linux: P

0

llazzaro Jul 02 '09 at 3:34

source share

Adam rosenfield · Accepted Answer · 2008-12-02T20:29:37+0000

A quick and dirty Cygwin Bash script that does the job:

#!/bin/bash for file in $(find $WWWROOT -iname \*.aspx); do echo -en $file '\t' cat $file | tr '\n' ' ' | sed -i 's/.*<title>\([^<]*\)<\/title>.*/\1/' done

Explanation: this finds every .aspx file under the $ WWWROOT root directory, replaces all newlines with spaces so that there are no newlines between <title> and </title> , and then the text is extracted between these tags.

The fastest way to get a list of <title> values ​​from all pages on the localhost website

More articles:

The fastest way to get a list of <title> values from all pages on the localhost website