scrapy实战之爬取简书

这一节,我们利用scrapy来爬取简书整站的内容。对于一篇文章详情页面,我们发现许多内容是Ajax异步加载的,所以使用传统方式返回的response里并没有我们想要的数据,例如评论数,喜欢数等等。对于动态数据请求,我们使用selenium+chromedriver来完成

1.到淘宝镜像https://npm.taobao.org/mirrors/chromedriver选择对应的chromedriver。将解压后的chromedriver.exe放到chrome浏览器的安装目录下
2.安装 selenium pip install selenium


整个爬虫的执行流程

  • 首先从start_urls 里取出要执行的请求地址
  • 返回的内容经过下载中间件进行处理(selenium加载动态数据)
  • 经过中间件处理的数据(response)返回给爬虫进行提取数据
  • 提取出的数据返回给pipeline进行存储

创建爬虫项目

1.进入到虚拟环境下
scrapy startproject jianshu
2.进入项目(jianshu)下,新建spider,由于我们是整站爬虫,所以我们可以指定crawl模板,利用里面的rule来方便爬取
scrapy genspider -t crawl js jianshu.com
此时spiders文件夹下多了一个js.py


在items.py中定义字段

import scrapy

class JianshuItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    article_id = scrapy.Field()
    origin_url = scrapy.Field()
    author = scrapy.Field()
    avatar = scrapy.Field()
    pub_time = scrapy.Field()
    read_count = scrapy.Field()
    like_count = scrapy.Field()
    word_count = scrapy.Field()
    subjects = scrapy.Field()
    comment_count = scrapy.Field()

定义下载中间件

request请求和response响应都是要经过下载中间件,所以我们在这里将selenium集成到scrapy的中间件下载器中。定义完中间件之后,一定要在setting中开启。

# middlewares.py
from scrapy import signals
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse

class SeleniumDownloadMiddleware(object):
    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")

    def process_request(self, request, spider):
        self.driver.get(request.url)
        time.sleep(2)
        try:
            while True:
                showMore = self.driver.find_element_by_class_name('show-more')  #获取标签
                showMore.click()
                time.sleep(0.5)
                if not showMore:
                    break
        except:
            pass
        source = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url, body=source, request=request,encoding='utf-8')
        return response
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
}

 

爬虫的设计

发现简书上的文章的页面的url满足特点的规则:
https://www.jianshu.com/p/d65909d2173a
前面是域名,后面是uid,这样可以通过crawl模板来制定Rule。

# js.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from jianshu.items import JianshuItem


class JsSpider(CrawlSpider):
    name = 'js'
    allowed_domains = ['jianshu.com']
    start_urls = ['http://jianshu.com/']

    rules = (
        Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
    )

    def parse_detail(self, response):
        title = response.xpath("//h1[@class='title']/text()").get()
        avatar = response.xpath("//a[@class='avatar']/img/@src").get()
        author = response.xpath("//span[@class='name']/a/text()").get()
        pub_time = response.xpath("//span[@class='publish-time']/text()").get().replace("*","")
        url = response.url
        url1 = url.split("?")[0]
        article_id = url1.split('/')[-1]
        content = response.xpath("//div[@class='show-content']").get()
        word_count_list = response.xpath("//span[@class='wordage']/text()").get().split(' ') #字数 10000
        word_count = int(word_count_list[-1])
        comment_count_list = response.xpath("//span[@class='comments-count']/text()").get().split(' ') #评论 427
        comment_count = int(comment_count_list[-1])
        read_count_list = response.xpath("//span[@class='views-count']/text()").get().split(' ') #喜欢 427
        read_count=int(read_count_list[-1])
        like_count_list = response.xpath("//span[@class='likes-count']/text()").get().split(' ') #喜欢 3
        like_count = int(like_count_list[-1])
        subjects = ",".join(response.xpath("//div[@class='include-collection']/a/div/text()").getall())

        item = JianshuItem(
            title=title,
            avatar=avatar,
            author=author,
            pub_time=pub_time,
            origin_url=response.url,
            article_id=article_id,
            content=content,
            subjects=subjects,
            word_count=word_count,
            comment_count=comment_count,
            read_count=read_count,
            like_count=like_count,
        )
        print('y' * 100)
        yield item

pipelines的设计

这里是将爬虫(js.py)中返回的item保存到数据库中。下面代码操作数据库是同步的。

import pymysql

class JianshuPipeline(object):
    def __init__(self):
        dbparams = {
            'host':'127.0.0.1',
            'port':3306,
            'user':'root',
            'password':'123456',
            'database':'jianshu',
            'charset':'utf8'
        }
        self.conn = pymysql.connect(**dbparams)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self,item,spider):
        self.cursor.execute(self.sql,(item['title'],item['content'],
                                      item['author'],item['avatar'],
                                      item['pub_time'],item['origin_url'],
                                      item['article_id'],item['read_count'],
                                      item['like_count'],item['word_count'],
                                      item['subjects'],item['comment_count']))
        self.conn.commit()
        return item

    @property
    def sql(self):
        if not self._sql:
            self._sql = """
            insert into js(id,title,content,author,avatar,pub_time,
            origin_url,article_id,read_count,like_count,word_count,subjects,comment_count) values (null,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
            """
            return self._sql
        return self._sql

setting.py

为了使上面定义的中间件起作用,必须在setting中开启中间件

  • 设置user-agent
  • 关闭robot协议
  • 设置合理地下载延迟,否则会被服务器禁用
    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 3
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
    }
    DOWNLOADER_MIDDLEWARES = {
       'jianshu.middlewares.SeleniumDownloadMiddleware': 543,
    }
    ITEM_PIPELINES = {
       'jianshu.pipelines.JianshuPipeline': 300,
    }
    ......

     

  • 开启下载器中间件
  • 开启管道

 

 

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值