Anyone got C # code to parse robots.txt and evaluate URLS against it

Question

Anyone got C # code to parse robots.txt and evaluate URLS against it

Short question:

Does anyone have C # code to parse robots.txt and then evaluate URLS against it to see if they will be excluded or not.

Long Term Question:

I am creating a sitemap for a new site not yet released by Google. The site map has two modes: user mode (for example, a traditional sitemap) and "admin" mode.

Admin mode will display all possible URLs on the site, including custom login URLs or URLs for a specific external partner - for example, example.com/oprah for those who see our site on Oprah. I want to track published links somewhere other than an Excel spreadsheet.

I should have suggested that someone could post the /oprah link on their blog or somewhere else. We actually do not want this “mini-site” to be indexed, because this would cause non-supporters to find Oprah special offers.

Therefore, at the same time, I was creating a site map. I also added URLs like /oprah to exclude robots.txt from our file.

Then (and this is the actual question), I thought: "It would be nice to show on the site map whether files for robots are indexed or not." It would be quite simple - just analyze robots.txt and then evaluate the link to it.

However, this is a “bonus feature”, and of course I do not have time to leave and write (I even thought it was probably not so difficult), so I was wondering if anyone had already written any code for parsing robots. txt?

+6

c # robots.txt

Simon_Weaver Mar 11 '09 at 5:47

source share

3 answers

I like the code and the tests at http://code.google.com/p/robotstxt/ recommend it as a starting point.

+3

Sam saffron May 14, '12 at 23:46

source share

A little self-help, but since I needed a similar parser, and I could not find anything I was happy with, I created my own:

http://nrobots.codeplex.com/

I like any feedback

+1

Saguiitay Sep 13 '10 at 19:01

source share

realMarkusSchmidt · Accepted Answer · 2009-03-11T06:25:53+0000

I hate to say this, but just Google C # robots.txt parser and click on the first hit. This is a CodeProject article on a simple search engine implemented in C # called "Searcharoo" and contains the Searcharoo.Indexer.RobotsTxt class, which is described as:

Check and, if any, download and analyze the robots.txt file on the site
Provide an interface for the Spider to check each Url for robots.txt rules.

Anyone got C # code to parse robots.txt and evaluate URLS against it

More articles: