scrapy琐碎知识

world_in_world

已于 2024-03-13 10:24:15 修改

阅读量79

点赞数

分类专栏： python爬虫文章标签： scrapy

于 2023-09-05 16:57:19 首次发布

本文链接：https://blog.csdn.net/world_in_world/article/details/131176159

版权

python爬虫专栏收录该内容

18 篇文章 1 订阅

订阅专栏

scrapy解析XML格式的数据

# 网站：http://www.huangshi.gov.cn/xxxgk/2020_zc/zcjd/

from scrapy.selector import Selector

class Spider(scrapy.Spider):
    ……    ……    ……
    ……    ……    ……

    def parse(self, response, **kwargs):
        msg = response.meta['msg']
        text = response.text.replace('<![CDATA[', '').replace(']]>', '')
        selector = Selector(text=text)
        itme_list = selector.xpath('//itme')
        for itme in itme_list:
            new_msg = deepcopy(msg)
            new_msg['title'] = itme.xpath('./filename/a/text()').extract_first()
            new_msg['publish_time'] = itme.xpath('./docreltime/text()').extract_first()
            new_msg['detail_url'] = 'http://www.huangshi.gov.cn/xxxgk/2020_zc/zcjd' + itme.xpath('./filename/a/@href').extract_first().strip('.')
            new_msg['req_type'] = 'detail'
            yield self.make_request(new_msg)

    ……    ……    ……
    ……    ……    ……

在scrapy中，获取response.xpath定位的元素的html字符串

match = response.xpath('//div[@class="xlcontainer"').get()

scrapy get()、getall()、extract_first()、extract()的区别

参考：滑动验证页面

'''
源码
'''    
def getall(self):
    """
    Call the ``.get()`` method for each element is this list and return
    their results flattened, as a list of unicode strings.
    """
    return [x.get() for x in self]
extract = getall

def get(self, default=None):
    """
    Return the result of ``.get()`` for the first element in this list.
    If the list is empty, return the default value.
    """
    for x in self:
        return x.get()
    return default
extract_first = get

scrapy运行爬虫

if __name__ == '__main__':
    from scrapy.cmdline import execute

    # 相同效果：execute(["scrapy", "crawl", Spider.name])
    execute('scrapy crawl spidername'.split())

scrapy通用设置

class Spider(scrapy.Spider):
    name = 'x--xxx'

    custom_settings = {
        "CONCURRENT_REQUESTS": 50,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 50,
        "DOWNLOAD_TIMEOUT": 30,
        'REDIRECT_ENABLED': False,
        'HTTPERROR_ALLOWED_CODES': [301, 302, 403, 404, 503],
        'RETRY_HTTP_CODECS': [301, 302, 403, 404, 503],
        'RETRY_TIMES': 3,
        'LOG_LEVEL': 'DEBUG'
    }

    def __init__(self, *args, **kwargs):
        super(Spider, self).__init__(*args, **kwargs)
        self.headers = {
            "Accept": "application/json, text/javascript, */*; q=0.01",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Content-Type": "application/json;charset=utf-8",
            "Pragma": "no-cache",
            "X-Requested-With": "XMLHttpRequest"
        }

name = 'x--xxx'：为爬虫设置了名称 x--xxx。这个名称在Scrapy的日志和统计报告中使用，以标识这个特定的爬虫。
custom_settings = { ... }：定义了一个包含自定义设置的字典，这些设置将覆盖settings.py 文件中的设置。
"CONCURRENT_REQUESTS": 50：设置了爬虫允许的最大并发请求数为50。这意味着爬虫将同时处理最多50个请求。
"CONCURRENT_REQUESTS_PER_DOMAIN": 50：设置了针对每个域名的最大并发请求数为50。这有助于限制对单个网站的请求频率。
"DOWNLOAD_TIMEOUT": 30：设置了下载超时时间为30秒。如果请求超过这个时间没有完成，Scrapy将视为超时并可能重试。
'REDIRECT_ENABLED': False：设置了是否允许重定向。设置为 False 表示在处理请求时不跟踪重定向（即不会自动跟随服务器返回的重定向响应）。
'HTTPERROR_ALLOWED_CODES': [301, 302, 403, 404, 503]：这个设置指定了爬虫允许的HTTP错误状态码列表。对于这些状态码，爬虫将继续处理响应，而不是将其视为错误（即不会触发中间件中的异常处理）。
'RETRY_HTTP_CODES': [301, 302, 403, 404, 503],：这个设置指定了在遇到这些HTTP错误状态码时，Scrapy应该重试请求的状态码列表。这与 'HTTPERROR_ALLOWED_CODES' 不同，因为它专门用于控制重试行为。
'LOG_LEVEL': 'DEBUG'：这个设置设置了爬虫的日志级别。当设置为 "DEBUG" 时，Scrapy将输出详细的调试信息，这对于开发和调试爬虫非常有用。
'RETRY_TIMES': 3：这个设置定义了当遇到重试HTTP状态码时，Scrapy应该重试请求的次数。在这个例子中，对于每个失败的请求，Scrapy将最多重试3次。