Java crawler library - recursive http subtree download using parser directory

Question

Java crawler library - recursive http subtree download using parser directory

My application is currently reading data by copying the file system tree from a remote machine through a shared drive, so it works like a deep copy of the file system from the point of view of the application.

This solution restricts somewhat, and I also want to support the second option - copy the subtree via http .

The library should do something like wget --recursive , which parses the list of directories and uses it to move around the tree.

I could not find any java library.

I can implement this functionality myself (using NekoHTML or something similar), but I don't like to reinvent the wheel.

Is there such a library that I can easily use in my application?

Perfectly:

posted to Maven Central repository as I use Maven to build
as much as possible depending on other libraries
there is no need to support robot exclusion - it will work only on a limited set of intermediate servers

Thanks.

Note: send pointers to the home pages of the libraries that you personally used.

+4

java http web-crawler

Petr Kozelka Aug 20 '11 at 22:12

source share

1 answer

Pascal essiembre · Accepted Answer · 2014-08-26T03:35:33+0000

Norconex HTTP Collector crawls tree-like sites with one or more start URLs. It can be used as a Java library in your application or as a command line application. You can decide what to do with each document that it scans. Being a full-blown web crawler, he probably does more than you, but you can customize it to suit your needs.

For example, it by default retrieves the text found in your documents and allows you to decide what to do with this text by connecting the "Committer" (that is, where to "commit" the extracted content). In your case, I think that you want only for raw documents and ignore the text conversion part. You can do this by connecting your own document processor, and then filter out the documents so that they stop processing after you have processed them in your own way.

The project is open source, hosted on Github and completely “modified.” It supports the robots.txt file, but you can disable it if you want. The only drawback for you is the presence of several dependencies, but since you are using Maven, they should be resolved automatically without effort. You will find information about the Maven repository on the product website.

Java crawler library - recursive http subtree download using parser directory

More articles: