Scrapy学习心得+应用：爬取东方财富网中的要闻精华

本文链接：https://blog.csdn.net/hjhtchh/article/details/79572340

1.scrapy简单介绍

scarpy作为一个爬虫框架，为各类使用者提供了一种编写爬虫的模式（笔者是这么认为的）。scrapy的安装本文就不多介绍，网上到处都是。笔者本次讲解以scrapy1.4.0为例，python环境为3.6。当你成功安装scrapy后，在cmd中输入scrapy可以得到以下内容：

说明你已经成功安装scrapy。下面就是要使用scrapy来创建一个爬虫的项目。很简单，从上图我们就可以看到，只要输入”scrapy startproject 项目名”，就可以成功创建一个项目。项目的主要目录结构如下图所示：

--------dfcfnews

----dfcfnews

--spiders

-item.py

-middlewares.py

-pipelines.py

-settings.py

----scrapy.cfg

下面对上述文件简要介绍。scrapy.cfg是一个config配置文件，暂时不需要对其进行更改。spiders是一个文件夹，在里面存放用户编写的各种spider。item.py是一个类似于字典的功能，在里面订制用户需要爬取的内容字段。pipelines.py叫做管道，主要功能是对item进行处理，包含item的存储等。settings.py显然是一个设置文件，里面能够设置很多东西，如管道的优先级等。从项目的结构上我们不难看出scrapy在运行时的一个流程：

1.spider根据输入的url进行页面爬取并解析。

2.根据item将需要的内容以字典的形式进行存储，并传给pipeline进行处理。

3.根据需要对传送过来的item进行存储处理，可以存入到数据库，写入txt，建立elasticsearch索引存储等。

2.爬取东方财富网的要闻精华

废话不多说，直接上代码，结合着代码讲解思路。

1.爬取第一步首先要明确需求。显然本次任务的需求是对东方财富网中的要闻精华进行爬取新闻，需要得到新闻的标题，摘要，发布时间，内容。根据这些，我们便可以确定itrms.py中要写什么。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DfcfnewsItem(scrapy.Item):
    #新闻的标题
    news_title = scrapy.Field()
    #新闻的摘要
    news_summary = scrapy.Field()
    #新闻的内容
    news_content = scrapy.Field()
    #新闻的发布时间
    news_pubtime = scrapy.Field()

2.观察网站的url特点，分析网站结构，确定爬取的逻辑。东方财富网的要闻精华共有25页，且url前半段都是“http://finance.eastmoney.com/news/cywjh”，后面不同的只是页数。进一步观察发现，每一页都有一个list，list中存放着新闻的url。根据新闻的url，我们可以得到新闻的其他相关数据。这里定位元素用到了xpath，用谷歌chrome的小伙伴右键检查，然后把鼠标放在要定位的元素上右键，选择xpath即可。

#coding:utf-8
"""
爬取东方财富网要闻精华的主程序，共25页
"""
from scrapy.spiders import Spider
from ..items import DfcfnewsItem
from scrapy.selector import Selector
from scrapy import Request
import re
class YaowenSpider(Spider):
    name = 'dfcf_yaowen'
    start_urls = []

    def start_requests(self):
        #获取东方财富网的要闻精华的url（总共有25页）
        url_head = 'http://finance.eastmoney.com/news/cywjh'
        for i in range(25):
            if i == 0:
                self.start_urls.append(url_head+'.html')
            self.start_urls.append(url_head+'_'+str(i)+'.html')
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    #解析每个要闻页面，获取该页面上所有新闻的url
    def parse(self, response):
        sel = Selector(response)
        main_path = sel.xpath('//*[@id="newsListContent"]/li')
        for path in main_path:
            link = path.xpath('./div[2]/p[1]/a/@href').extract_first()
            yield Request(link,callback=self.parse_link)

    #解析每个新闻的链接，获取新闻的标题，内容，摘要等相关信息
    def parse_link(self,response):
        item = DfcfnewsItem()
        sel = Selector(response)
        title = sel.xpath('/html/body/div[1]/div/div[3]/div[1]/div[2]/div[1]/h1/text()').extract_first()
        item['news_title'] = title.strip()
        summary = sel.xpath('//*[@id="ContentBody"]/div[2]/text()').extract_first()
        item['news_summary'] = summary.strip()
        time = sel.xpath('/html/body/div[1]/div/div[3]/div[1]/div[2]/div[1]/div[1]/div[1]/div[1]/text()').extract_first()
        item['news_pubtime'] = self.TransferTime(time)
        contents = sel.xpath('//*[@id="ContentBody"]/p[not(@class)]/text()').extract()
        content = ''
        for i in contents:
            content += i.strip().replace(" ", "")
        content = content.strip().replace(" ", "")
        item['news_content'] = content

        yield item

    #小工具：对时间进行处理，利用正则匹配获得满足格式要求的时间
    @staticmethod
    def TransferTime(time_str):
        qq = re.compile(r'\d{4}年\d{2}月\d{2}日')
        out = re.findall(qq,time_str)
        return out[0]

3.爬到新闻之后，便对新闻进行存储，于是要编写pipeline.py了。笔者对新闻的标题，时间和摘要进行数据库的存储，而对新闻内容写入txt，进行存储。不多说，直接上代码。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
import os
import codecs
class DfcfnewsPipeline(object):
    def __init__(self):
        #连接mysql数据库的参数配置
        self.dbparam =dict(
            host = 'localhost',
            port = 3306,
            db = 'demo',
            user = 'root',
            passwd = 'root',
            use_unicode = False,
            charset = 'utf8'
        )
        self.conn = pymysql.connect(**self.dbparam)
        self.cor = self.conn.cursor()

    def process_item(self, item, spider):
        #新闻内容以txt形式存储的路径
        path = 'D:/PycharmProjects/dfcfnews/data'
        title = item['news_title']
        # 文件名中不能出现的英文符号有\/:*?"<>|，如果出现，需要将其替换，暂定全部替换为空格
        title = title.replace(' ', '').replace('\\', '') \
            .replace('/', '').replace('*', '').replace('?', '') \
            .replace(':', '').replace('"', '').replace('<', '') \
            .replace('>', '').replace('|', '')
        summary = item['news_summary']
        content = item['news_content']
        time = item['news_pubtime']
        txt_path = path+'/'+title+'.txt'
        if not os.path.isfile(txt_path):
            sql = (("insert into dfcf_news (title,summary,pubtime) values (%s,%s,%s)"))
            params = (title.encode('utf-8'),summary.encode('utf-8'),time.encode('utf-8'))
            self.cor.execute(sql, params)
            self.conn.commit()
            file = codecs.open(txt_path, 'wb', encoding='utf-8')
            file.write(content)
            file.close()
        else:
            print("该条新闻已经抓取！")
        return item

4.最后别忘了在settings.py中将pipeline的注释去掉，不然pipeline是不会运行的。就是下面这段要去掉注释。

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'dfcfnews.pipelines.DfcfnewsPipeline': 300,
}

3.一点小心得吧

1.每次执行scrapy都要去命令行里面太麻烦了，于是笔者写了一个小的run.py，直接运行该文件就可以执行scrapy了。

#coding:utf-8
"""
爬虫启动方法：通过输入爬虫的name进行仿cmd的方式进行启动
"""
from scrapy import cmdline

crawl_name = 'dfcf_yaowen'
cmd = 'scrapy crawl {0}'.format(crawl_name)
cmdline.execute(cmd.split())

2.scrapy是一个分布式的爬虫框架，每爬取得到一个item就直接送到pipeline中进行处理，而不是等所有item都爬取完后再进行处理，这一点要注意。

3.settings中许多设置的用法，给个链接好了，这里不做赘述。

https://www.cnblogs.com/cnkai/p/7399573.html

4.还有其他暂时也没想到=。=，如果对scrapy感兴趣的小伙伴可以在评论中和我进行讨论。