Running scrapy spider in background in Flask app

I am creating an application that uses Flask and Scrapy . When access to the root URL of my application is processed, it processes some data and displays it. In addition, I also want to (re) launch my spider, if it is not already running. Since my spider takes about 1.5 hours to complete, I run it as a background process using threading . Here is a minimal example (you will also need testspiders ):

import os from flask import Flask, render_template import threading from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import Settings from testspiders.spiders.followall import FollowAllSpider def crawl(): spider = FollowAllSpider(domain='scrapinghub.com') crawler = Crawler(Settings()) crawler.configure() crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.crawl(spider) crawler.start() log.start() reactor.run() app = Flask(__name__) @app.route('/') def main(): run_in_bg = threading.Thread(target=crawl, name='crawler') thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)] if 'crawler' not in thread_names: run_in_bg.start() return 'hello world' if __name__ == "__main__": port = int(os.environ.get('PORT', 5000)) app.run(host='0.0.0.0', port=port) 

As a side note, the following lines were my special approach to try and determine if my crawler thread is working. If there is a more idiomatic approach, I would appreciate some recommendations.

 run_in_bg = threading.Thread(target=crawl, name='crawler') thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)] if 'crawler' not in thread_names: run_in_bg.start() 

Going to the problem - if I save the above script as crawler.py , run python crawler.py and go to localhost:5000 , then I will get the following error (ignore scrapy HtmlXPathSelector ):

 exceptions.ValueError: signal only works in main thread 

Although the spider works, it does not stop because the signals.spider_closed signal signals.spider_closed works in the main thread (according to this error). As expected, subsequent requests to the root URL result in heavy errors.

How can I create an application to launch my spider, if it does not scan yet, and at the same time immediately returns control to my application (i.e. I do not want to wait for the crawler to complete) for other things

+6
source share
1 answer

It is not a good idea for a jar to start long streams like this.

I would recommend using a queuing system such as celery or rabbit. A jar application can put tasks in a queue that you would like to do in the background and then return immediately.

You can then get workers outside your main application to perform these tasks and complete all your scrapers.

+6
source

All Articles