Scrapy & captcha

I use scrapy to submit the form at https://www.barefootstudent.com/jobs (any links to the page, etc. http://www.barefootstudent.com/los_angeles/jobs/full_time/full_time_nanny_needed_in_venice_217021 )

My scapy bot successfully registered, but I can not avoid captcha. For the submit form, I use scrapy.FormRequest.from_reponse

frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 
                                   'security': captcha, 'name': 'fx',
                                   'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                                   }, callback=self.afterForm)

    yield frq

I want to download captcha image from this page and manual input in runtime script. etc.

captcha = raw_input("put captcha in manually>")  

I'm trying to

 urllib.urlretrieve(captcha, "./captcha.jpg")

But this method loads the wrong captcha (the site rejects my input). I try to call urllib.urlretieve several times in one pass of the script and each time it returns different captchas :(

ImagePipeline. , ( ) , , yeild.

 item = BfsItem()
 item['image_urls'] = [captcha]
 yield item
 captcha = raw_input("put captcha in manually>")  
 frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 
                                   'security': captcha, 'name': 'fx',
                                   'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                                   }, callback=self.afterForm)
 yield frq

, script , !

script FormRequest ?

!

+4
3

, , ( , ):

1 - URL- ( )

def parse_page_with_captcha(response):
    captcha_url = response.xpath(...)
    data_for_later = {'captcha_form': response} # store the response for later use
    return Request(captcha_url, callback=self.parse_captcha_download, meta=data_for_later)

2 - Scrapy , Scrapy.

def parse_captcha_download(response):
    captcha_target_filename = 'filename.png'
    # save the image for processing
    i = Image.open(StringIO(response.body))
    i.save(captcha_target_filename)

    # process the captcha (OCR, or sending it to a decaptcha service, etc ...)
    captcha_text = solve_captcha(captcha_target_filename)

    # and now we have all the data we need for building the form request
    captcha_form = response.meta['captcha_form']

    return scrapy.FormRequest.from_response(captcha_form, formdata={'message': 'itttttttt', 
                               'security': captcha_text, 'name': 'fx',
                               'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                               }, callback=self.afterForm)

- /, . cookie /, .

, , , , .

, Verz1Lka?

urllib.urlretrieve scrapy. , , ( ..), : cookie , URL .., , .

, , , Scrapy, , , .

+1

script. ..

captchas, , , ocr, , ocr.space api ( "" ocr):

enter image description here

Anther - Kantu, - OCR.

0

captcha, cookie, URL-. Scrapy , scrapy . https://doc.scrapy.org/en/latest/topics/media-pipeline.html

0

All Articles