item-scrapy框架5-python_itemadapter-CSDN博客

本文链接：https://blog.csdn.net/gaogzhen/article/details/123664872

文章目录

1、概述

Item主要的目标是从数据源，典型的如网页，提取结构化的数据。Spiders可以Item（python定义键值对形式的对象）返回提取好的数据。

scrapy 支持多种类型的Item。当你创建一个item的时候，你可以创建任意你想要的item。当你编写接收item的代码的时候，应当支持任意类型的item。

2、Item分类

scrapy通过itemadapter库支持以下类型的item：dictionaries、item object、dataclass object和attrs object。

2.1、Dictionaries

dict字典我们已经很熟悉了，不在详述。

2.1、Item objects

item object提供类dict的API，增强的额外特性使它称为功能最为齐全的item。

class scrapy.item.Item
class scrapy.Item

API列表：item object 复制的的标准的dict API不在讲解

定义fields名称
- KeyError：当使用未定义字段名称时，会抛出改异常
可以定义字段元数据，用来自定义序列化
Item.copy()：浅拷贝
Item.deepcopy()：深拷贝

示例：

from scrapy.item import Item, Field

class CustomItem(Item):
    one_field = Field()
    another_field = Field()

2.2、Dataclass objects

版本2.2新增

定义类型和默认值
定义字段元数据通过dataclasses.field()，用于自定义序列化

在python3.7+中生效,python3.6通过dataclasses backpoint使用。

示例：

from dataclasses import dataclass

@dataclass
class CustomItem:
    one_field: str
    another_field: int

2.3、attr.s objects

特征类似dataclass object，如果要使用改类型，需要安装attrs package包。

示例：

import attr

@attr.s
class CustomItem:
    one_field = attr.ib()
    another_field = attr.ib()

3、Item Object详述

3.1、声明Item子类

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

3.2、创建item对象

>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)

3.3、获取字段值

>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['last_updated']
Traceback (most recent call last):
    ...
KeyError: 'last_updated'
>>> product.get('last_updated', 'not set')
not set
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
    ...
KeyError: 'lala'
>>> product.get('lala', 'unknown field')
'unknown field'
>>> 'name' in product  # is name field populated?
True
>>> 'last_updated' in product  # is last_updated populated?
False
>>> 'last_updated' in product.fields  # is last_updated a declared field?
True
>>> 'lala' in product.fields  # is lala a declared field?
False

获取Item中未定义的字段时抛出KeyError异常

3.4、设置字段值

>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

3.5、访问所有的字段或者值

# 获取所有的keys
>>> product.keys()
['price', 'name']
# 获取所有的键值对
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]

3.6、Item与字典相互转换

Item转字典

>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}

字典转Item

>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala

以上内容都是从官网文档中获取。

4、案例

继续之前的爬取迁木网，之前爬取的数据只是在控制台展示，并未进行提取以进行下一步的存储，现在我们来构建响应的Item

构建Item

import scrapy


class UniversityItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    rank = scrapy.Field()
    country = scrapy.Field()
    state = scrapy.Field()
    city = scrapy.Field()
    undergraduate_count = scrapy.Field()
    postgraduate_count = scrapy.Field()
    website = scrapy.Field()

修改Spider:usnews.py返回该Item

import scrapy

from qianmu.items import UniversityItem


class UsnewsSpider(scrapy.Spider):
    name = 'usnews'
    # 允许爬取的域名
    allowed_domains = ['www.qianmu.org']
    # 爬取起始url
    start_urls = ['http://www.qianmu.org/ranking/1528.htm']

    # 当框架请求start_urls成功时，自动调用该方法
    def parse(self, response):
        # 提取链接
        links = response.xpath('//div[@class="rankItem"]/table//tr[position()>1]/td[2]/a/@href').getall()

        # 3、解析大学链接获取表格数据
        for link in links:
            yield response.follow(link, self.parse_university)

    def parse_university(self, response):
        """解析大学链接获取详细信息"""
        # 解析并获取获取大学名称
        item = UniversityItem()
        data = {}
        item['name'] = response.xpath('//div[@id="wikiContent"]/h1/text()').get()
        # 获取表格第一列
        table = response.xpath('//div[@id="wikiContent"]/div[@class="infobox"]/table')
        if table:
            table = table[0]
            keys = table.xpath('.//td[1]/p/text()').getall()
            # 获取表格第二列，如果有多个p合并
            cols = table.xpath('.//td[2]')
            values = [''.join((col.xpath('.//text()').getall())).replace('\t', '') for col in cols]
            if len(keys) == len(values):
                data.update(zip(keys, values))
                item['rank'] = data.get('排名')
                item['country'] = data.get('国家')
                item['state'] = data.get('州省')
                item['city'] = data.get('城市')
                item['undergraduate_count'] = data.get('本科生人数')
                item['postgraduate_count'] = data.get('研究生人数')
                item['website'] = data.get('网址')
        yield item

Item构建完毕，下一步我们要吧提取的数据存入数据库中，这需要借助Item Pipeline。