I am creating an application that uses Flask and Scrapy . When access to the root URL of my application is processed, it processes some data and displays it. In addition, I also want to (re) launch my spider, if it is not already running. Since my spider takes about 1.5 hours to complete, I run it as a background process using threading . Here is a minimal example (you will also need testspiders ):
import os from flask import Flask, render_template import threading from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import Settings from testspiders.spiders.followall import FollowAllSpider def crawl(): spider = FollowAllSpider(domain='scrapinghub.com') crawler = Crawler(Settings()) crawler.configure() crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.crawl(spider) crawler.start() log.start() reactor.run() app = Flask(__name__) @app.route('/') def main(): run_in_bg = threading.Thread(target=crawl, name='crawler') thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)] if 'crawler' not in thread_names: run_in_bg.start() return 'hello world' if __name__ == "__main__": port = int(os.environ.get('PORT', 5000)) app.run(host='0.0.0.0', port=port)
As a side note, the following lines were my special approach to try and determine if my crawler thread is working. If there is a more idiomatic approach, I would appreciate some recommendations.
run_in_bg = threading.Thread(target=crawl, name='crawler') thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)] if 'crawler' not in thread_names: run_in_bg.start()
Going to the problem - if I save the above script as crawler.py , run python crawler.py and go to localhost:5000 , then I will get the following error (ignore scrapy HtmlXPathSelector ):
exceptions.ValueError: signal only works in main thread
Although the spider works, it does not stop because the signals.spider_closed signal signals.spider_closed works in the main thread (according to this error). As expected, subsequent requests to the root URL result in heavy errors.
How can I create an application to launch my spider, if it does not scan yet, and at the same time immediately returns control to my application (i.e. I do not want to wait for the crawler to complete) for other things