例子1.link_scrapy

最新推荐文章于 2020-11-07 15:51:29 发布

徐雄辉

最新推荐文章于 2020-11-07 15:51:29 发布

阅读量151

点赞数

分类专栏： python scrapy

本文链接：https://blog.csdn.net/qq_24651739/article/details/80625086

版权

python 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

scrapy

3 篇文章 0 订阅

订阅专栏

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
Topic: 爬取链接的蜘蛛
Desc : 
"""
import logging
from coolscrapy.items import CoolscrapyItem
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class LinkSpider(CrawlSpider):
    name = "link"
    allowed_domains = ["huxiu.com"]
    start_urls = [
        "http://www.huxiu.com/index.php"
    ]

    rules = (
        # 提取匹配正则式'/group?f=index_group'链接 (但是不能匹配'deny.html')
        # 并且会递归爬取(如果没有定义callback，默认follow=True).
        Rule(LinkExtractor(allow=('/group?f=index_group', ), deny=('deny\.html', ))),
        # 提取匹配'/article/\d+/\d+.html'的链接，并使用parse_item来解析它们下载后的内容，不递归
        Rule(LinkExtractor(allow=('/article/\d+/\d+\.html', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        detail = response.xpath('//div[@class="article-wrap"]')
        item = HuxiuItem()
        item['title'] = detail.xpath('h1/text()')[0].extract()
        item['link'] = response.url
        item['published'] = detail.xpath(
            'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
        logging.info(item['title'],item['link'],item['published'])
        yield item

徐雄辉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
例子1.link_scrapy

#!/usr/bin/env python# -*- encoding: utf-8 -*-"""Topic: 爬取链接的蜘蛛Desc : """import loggingfrom coolscrapy.items import CoolscrapyItemimport scrapyfrom scrapy.spiders import CrawlSpider, Rulefro...
复制链接

扫一扫