How to skip nutch

I am using Nutch 2.1 integrated with mysql. I scanned 2 sites, and Nich successfully scanned them and saved the data in Mysql. I am using Solr 4.0.0 for search.

Now my problem is that when I try to crawl some site, for example trailer.apple.com or any other site, it always scans the last scanned URLs. Even I deleted the last crawled URLs from the seeds.txt file and entered new URLs. But Nyuch does not crawl to new addresses.

Can someone tell me what I'm actually doing wrong.

Also, please offer me any Nutch Plugin that can help you bypass video and movie sites.

Any help would be really noticeable.

+2
web-crawler nutch
source share
3 answers

I have the same problem. Nutch retells only old URLs, even they do not exist in seed.txt.

The first time I start nutch, I do the following:

  • Add the domain "www.domain01.com" to / root / Desktop / apache -nutch 2.1 / runtime / local / urls / seed.txt (without quotes)

  • In / root / Desktop / apache-nutch-2.1 / runtime / local / conf / regex-urlfilter.txt add a new line:

    # accept anything else
    ^ Http: // (. [A-z0-9] *) * www.domain01.com/sport/

  • In / root / Desktop / apache-nutch-2.1 / conf / regex-urlfilter.txt add a new line:

    # accept anything else
    ^ Http: // (. [A-z0-9] *) * www.domain01.com/sport/

... and everything was in order.

Then I made the following changes:

  • Remove www.domain01.com from / root / Desktop / apache -nutch-2.1 / runtime / local / urls / seed.txt and add two new domains: www.domain02.com and www.domain03.com.

  • Remove www.domain01.com from / root / Desktop / apache -nutch-2.1 / runtime / local / conf / regex-urlfilter.txt and add two new lines:

    # accept anything else
    ^ Http: // (. [A-z0-9] *) www.domain02.com/sport/
    ^ Http: // (. [A-z0-9]) * www.domain03.com/sport/

  • Remove www.domain01.com from / root / Desktop / apache -nutch-2.1 / conf / regex-urlfilter.txt and add two new lines:

    # accept anything else
    ^ Http: // (. [A-z0-9] *) www.domain02.com/sport/
    ^ Http: // (. [A-z0-9]) * www.domain03.com/sport/

Next, I execute the following commands:

updatedb bin/nutch inject urls bin/nutch generate urls bin/nutch updatedb bin/nutch crawl urls -depth 3 

And the nut still scans the site www.domain01.com

I do not know why?

I am using Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux runs on a virtual machine in Windows 7 (x64).

+2
source share

This post is a bit dated but still valid for most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ , maybe the last pages with a crawl are those that change the majority. Nutch uses an adaptive algorithm to plan repeat crawls, so when a page is very static, it should not be retold very often. You can override how often you want to iterate with nutch-site.xml. In addition, the seed.txt file should be a list of visits after you paste the URLs. Nutch no longer uses it (unless you manually re-enter it again)

Another configuration that might help is your regex-urlfilter.txt if you want to point to a specific location or exclude specific domains / pages, etc.

Greetings.

+1
source share

u just add ur nutch-site.xml under the property tag. it works for me ,,,, check it out ..........

<property> <name>file.crawl.parent</name> <value>false</value> </property

and u just change regex-urlfilter.txt

# skip file: ftp: and mailto: urls # - ^ (file | FTP | MAILTO):
# accept anything else +.

after deleting this manual control indexer or command as well .. rm -r $ NUTCH_HOME / indexdir

after starting ur crawl cammand ...........

0
source share

All Articles