How to create a scanning bot?

I am working on a small project to analyze content on some sites that interest me; This is a real DIY project that I do for my entertainment / enlightenment, so I would like to encode it as much as possible.

Obviously, I will need the data to feed my application, and I thought that I would write a little crawler that would take maybe 20 thousand pages of html and write them to text files on my hard drive. However, when I looked at SO and other sites, I could not find any information on how to do this. Is it possible? There seem to be open source options available (webpshinx?), But I would like to write this myself if possible.

Schema is the only language that I know well, but I thought I would use this project to learn some of the Java, so I would be wondering if there are any racket or java libraries that would be useful for this.

So, I think, to summarize my question, what are your good resources to start with this? How can I get my crawler to request information from other servers? Should I write a simple parser for this, or is it superfluous, given that I want to take the whole html file and save it as txt?

+5
source share
6 answers

This is entirely possible, and you can definitely do it with Racket. You can take a look at the PLaneT libraries; In particular, Neil Van Dyke HtmlPrag:

http://planet.racket-lang.org/display.ss?package=htmlprag.plt&owner=neil

.. , . - .

, - .

+5

Racket, .

"Unix tools":

  • curl ( Racket system) .
  • Racket URI <a>.
    • "" .
    • " " HTML, .
    • , , , , .

, , curl . Racket net/url.

curl, -, , - , :

  • 30- ?
  • // cookie ( )?
  • HTTP keep-alive?
  • .

curl , :

(define curl-core-options
  (string-append
   "--silent "
   "--show-error "
   "--location "
   "--connect-timeout 10 "
   "--max-time 30 "
   "--cookie-jar " (path->string (build-path 'same "tmp" "cookies")) " "
   "--keepalive-time 60 "
   "--user-agent 'my crawler' "
   "--globoff " ))

(define (curl/head url out-file)
  (system (format "curl ~a --head --output ~a --url \"~a\""
                   curl-core-options
                   (path->string out-file)
                   url)))

(define (curl/get url out-file)
  (system (format "curl ~a --output ~a --url \"~a\""
                  curl-core-options
                  (path->string out-file)
                  url)))

, Racket. curl .

: . script. , . .

+1

java, crawler4j.

.

0

Java, Clojure?

lisp java html *, - . , Java, , Java Clojure.

!

* SO .

0

, - , - (, http://www.httrack.com/), . , .., , , .

, HTML, .

HTML; Java-, HTML ( , , ) - : http://jericho.htmlparser.net/docs/index.html

: , , ; , Commons HttpClient , Jericho .

0

Perl ( , -).

I suggest you read the wget documentation and use the tool for inspiration. Wget - netcat webcrawling; Its feature set will inspire you.

Your program must accept the list of URLs to start with and add them to the list of URLs to try. Then you need to decide whether you want to collect each URL or add only those of the domains (and subdomains?) That are listed in the original list.

I made you a pretty reliable starting point in the diagram:

(define (crawl . urls)
  ;; I would use regular expressions for this unless you have a special module for this
  ;; Hint: URLs tend to hide in comments. referal tags, cookies... Not just links.
  (define (parse url) ...)
  ;; For this I would convert URL strings to a standard form then string=
  (define (url= x y) ...)
  ;; use whatever DNS lookup mecanism your implementation provides
  (define (get-dom) ...)
  ;; the rest should work fine on its own unless you need to modify anything
  (if (null? urls) (error "No URLs!")
      (let ([doms (map get-dom urls)])
        (let crawl ([done '()])
          (receive (url urls) (car+cdr urls)
            (if (or (member url done url=)
                      (not (member (get-dom url) doms url=)))
                (crawl urls done)
                (begin (parse url) (display url) (newline)
                  (crawl (cons url done)))))))))
0
source

All Articles