I have the same problem. Nutch retells only old URLs, even they do not exist in seed.txt.
The first time I start nutch, I do the following:
Add the domain "www.domain01.com" to / root / Desktop / apache -nutch 2.1 / runtime / local / urls / seed.txt (without quotes)
In / root / Desktop / apache-nutch-2.1 / runtime / local / conf / regex-urlfilter.txt add a new line:
# accept anything else
^ Http: // (. [A-z0-9] *) * www.domain01.com/sport/
In / root / Desktop / apache-nutch-2.1 / conf / regex-urlfilter.txt add a new line:
# accept anything else
^ Http: // (. [A-z0-9] *) * www.domain01.com/sport/
... and everything was in order.
Then I made the following changes:
Remove www.domain01.com from / root / Desktop / apache -nutch-2.1 / runtime / local / urls / seed.txt and add two new domains: www.domain02.com and www.domain03.com.
Remove www.domain01.com from / root / Desktop / apache -nutch-2.1 / runtime / local / conf / regex-urlfilter.txt and add two new lines:
# accept anything else
^ Http: // (. [A-z0-9] *) www.domain02.com/sport/
^ Http: // (. [A-z0-9]) * www.domain03.com/sport/
Remove www.domain01.com from / root / Desktop / apache -nutch-2.1 / conf / regex-urlfilter.txt and add two new lines:
# accept anything else
^ Http: // (. [A-z0-9] *) www.domain02.com/sport/
^ Http: // (. [A-z0-9]) * www.domain03.com/sport/
Next, I execute the following commands:
updatedb bin/nutch inject urls bin/nutch generate urls bin/nutch updatedb bin/nutch crawl urls -depth 3
And the nut still scans the site www.domain01.com
I do not know why?
I am using Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux runs on a virtual machine in Windows 7 (x64).
Dragan menoski
source share