爬虫框架scrapy--4CrawISpider的使用（自动提取url）

最新推荐文章于 2022-09-26 10:50:15 发布

梦森(:

最新推荐文章于 2022-09-26 10:50:15 发布

阅读量281

点赞数

分类专栏：爬虫文章标签：爬虫 python

本文链接：https://blog.csdn.net/mengsenzhimeng/article/details/120438258

版权

爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

创建项目后，cd到项目文件夹，终端输入以下命令创建CrawISpider类爬虫：

scrapy genspider -t crawl itcast（项目名） itcast.cn（域名）

class CfSpider(CrawlSpider):
    name = 'cf'
    allowed_domains = ['circ.gov.cn']
    start_urls = ['http://www.circ.gov.cn/web/site0/tab5240/module14430/page1.htm']
    '''
    定义提取规则的地方
    allow的值用正则形式提取url地址
    callback:回调函数
    follow：真表示对当前相应的url继续进入rules来提取url,一般allow指向是翻页时值为True,详情页值为False
    
    '''
    rules = (
        Rule(LinkExtractor(allow=r'web/site0/tab5240/info\d+\.htm'), callback='parse_item', follow=False),#详情页
        Rule(LinkExtractor(allow=r'web/site0/tab5240/module14430/page\d+\htm'), follow=False),  # 下一页
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        return item

LinkExtractor更多常见参数∶

allow∶满足括号中"正则表达式"的URL会被提取，如果为空，则全部匹配。

deny∶满足括号中"正则表达式"的URL一定不提取（优先级高于allow）。

aLLow_domains∶会被提取的链接的domains。

deny_domains∶一定不会被提取链接的domains。

restrict_xpaths∶使用xpath表达式，和allow共同作用过滤链接，级xpath满足范围内的url地址会被提取

spiders.Rule常见参数∶

1ink_extractor∶是一个Link Extractor对象，用于定义需要提取的链接。caLlback∶从link_extractor中每获取到链接时，参数所指定的值作为回调函数

foLLow∶是一个布尔（boolean）值，指定了根据该规则从response提取的链接是否需要跟进。

如果callback为None，follow 默认设置为True，否则默认为False。

process_Links∶指定该spider中哪个的函数将会被调用，从link_extractor中获取到链接列表时将会调用该函数，，

该方法主要用来过滤url。