Elements versus item loaders in scrapy

Question

Elements versus item loaders in scrapy

I am new to scrapy, I know that elements are used to populate the copied data, but I cannot understand the difference between elements and element loaders. I tried to read some sample codes, they used storage loaders for storage instead of elements, and I cannot understand why. The scripting documentation for me was not clear enough. Can someone give a simple explanation (better with an example) about when product loaders are used and what additional features they provide in subjects?

+7

python web-scraping scrapy scrapy-spider

Airbear Aug 24 '16 at 15:19

source share

1 answer

Granitosaurus · Accepted Answer · 2016-08-24T18:24:13+0000

I really like the official explanation in the docs:

Commodity loaders provide a convenient mechanism for filling scrapers. Even if items can be populated using a proprietary dictionary-like API, product loaders provide a much more convenient API for populating them from the scraping process, by automating some common tasks, such as parsing raw data before assigning it.
In other words, the elements provide a container of scraped data, while the Loaders element provides a mechanism for filling this container.

The last paragraph should answer your question.
Merchandise loaders are great as they allow you to have so many shortcuts to process and reuse a bunch of code so everything is neat, clean and clear.

An example of a comparison example. Let's say we want to clear this item:

class MyItem(Item): full_name = Field() bio = Field() age = Field() weight = Field() height = Field()

The only view approach would look something like this:

 def parse(self, response): full_name = response.xpath("//div[contains(@class,'name')]/text()").extract() # ie returns ugly ['John\n', '\n\t ', ' Snow'] item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip()) bio = response.xpath("//div[contains(@class,'bio')]/text()").extract() item['bio'] = ' '.join(i.strip() for i in full_name if i.strip()) age = response.xpath("//div[@class='age']/text()").extract_first(0) item['age'] = int(age) weight = response.xpath("//div[@class='weight']/text()").extract_first(0) item['weight'] = int(age) height = response.xpath("//div[@class='height']/text()").extract_first(0) item['height'] = int(age) return item

vs Item Loaders:

 # define once in items.py from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst clean_text = Compose(MapCompose(lambda v: v.strip()), Join()) to_int = Compose(TakeFirst(), int) class MyItemLoader(ItemLoader): default_item_class = MyItem full_name_out = clean_text bio_out = clean_text age_out = to_int weight_out = to_int height_out = to_int # parse as many different places and times as you want def parse(self, response): loader = MyItemLoader(selector=response) loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()") loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()") loader.add_xpath('age', "//div[@class='age']/text()") loader.add_xpath('weight', "//div[@class='weight']/text()") loader.add_xpath('height', "//div[@class='height']/text()") return loader.load_item()

As you can see, the item loader is much cleaner and easier to scale. Let's say you have 20 more fields, of which the same processing logic shares a lot, it would be suicide to do this without Item Loaders. Commodity trucks are awesome and you must use them!

Elements versus item loaders in scrapy

More articles: