Python爬虫实战 | (21) Scrapy+Selenium爬取新浪滚动新闻

在本篇博客中,我们将使用Scrapy对接Selenium来爬取新浪滚动新闻,之前我们用Selenium爬取过滚动新闻,它是由javascript动态渲染的页面,Scrapy 抓取页面的方式和requests 库类似,都是直接模拟HTTP 请求,所以Scrapy 也不能直接抓取JavaScript 动态渲染的页面。所以需要使用Selenium。

抓取JavaScript 渲染的页面有两种方式:

1)一种是分析Ajax 请求,找到其对应的接口抓取, Scrapy 同样可以用此种方式抓取。

2)另一种是直接用Selenium 模拟浏览器进行抓取,不需要关心页面后台发生的请求,也不需要分析渲染过程,只需要关心页面最终结果即可,可见即可爬。

 

  • 在命令行创建scrapy项目

首先在命令行进入PyCharm的项目目录,然后执行 scrapy startproject ScrapySinaRollNews,生成爬虫项目。会自动生成项目结构和一些文件:

  • 在命令行创建Spider

Spider 是一个自定义的类, Scrapy 用它来从网页里抓取内容,并解析抓取的结果。这个类必须继承Spider 类(scrapy.Spider) ,需定义Spider 的名称和起始请求,以及解析爬取结果的方法。

命令:scrapy  genspider  Spider名称  网站域名

例:scrapy genspider sinanews    

进入之前生成的spiders目录,执行上述命令:

此时会在spiders目录下生成一个以爬虫名字命名的.py文件:

  • 创建item

Item 是保存爬取数据的容器。创建Item 需要继承scrapy.Item 类,并且定义类型为scrapy.Field 的字段。

我们主要获取每篇新闻的链接、标题、时间、来源、正文这些字段。接下来我们要自定义items.py(原本是空的,只有主要结构),定义我们想要的字段,items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapysinarollnewsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    link = scrapy.Field()
    title = scrapy.Field()
    date = scrapy.Field()
    source = scrapy.Field()
    article = scrapy.Field()
    pass
  • 对接Selenium

在middlewares.py中定义SeleniumDownloaderMiddleware类:

class SeleniumDownloaderMiddleware():
    def __init__(self, timeout=None):
        self.logger = getLogger(__name__)
        self.timeout = timeout
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')   #无界面浏览器
        self.browser = webdriver.Chrome(options=chrome_options)
        self.wait = WebDriverWait(self.browser, self.timeout)

    def __del__(self):
        self.browser.close()

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            timeout=crawler.settings.get('SELENIUM_TIMEOUT') #在配置文件中拿到SELENIUM_TIMEOUT 需要自己定义
        )

    def process_request(self, request, spider):
        self.logger.debug('------------Chrome is starting-------------' + request.url)
        try:
            self.browser.get(request.url)
            #需要爬两次 第一次在滚动新闻页面 爬取所有新闻的url;第二次在爬取新闻的详细信息
            if 'https://news.sina.com.cn/roll' in request.url:  #如果是滚动新闻页面
                news_list = ''   #存储所有新闻的url
                page = 0
                while page < 2:  #只爬两页
                    try:
                        page = page + 1
                        '''
                        <div class="d_list_txt" id="d_list" style="width:100%;">
                        <ul>
                        <li onmouseover="this.className='hover'" onmouseout="this.className=''" class="">
                        <span class="c_chl">[全部]</span><span class="c_tit">
                        <a href="https://finance.sina.com.cn/money/bank/gsdt/2019-07-24/doc-ihytcerm5959531.shtml" target="_blank">招商银行:上半年实现净利润506.12亿 同比增13.08%</a></span><span class="c_time" s="1563959946">07-24 17:19</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/w/2019-07-24/doc-ihytcerm5973095.shtml" target="_blank">为寻失踪36年少女 梵蒂冈掘公主墓发现数千根人骨</a></span><span class="c_time" s="1563959920">07-24 17:18</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexanaly/2019-07-24/doc-ihytcitm4320345.shtml" target="_blank">李鼎缘:黄金原油怎么操作 日内走势分析及操作建议</a></span><span class="c_time" s="1563959917">07-24 17:18</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/s/2019-07-24/doc-ihytcerm5961028.shtml" target="_blank">5000元欠了六年才还上 背后的故事却这么温暖</a></span><span class="c_time" s="1563959862">07-24 17:17</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/future/roll/2019-07-24/doc-ihytcitm4320124.shtml" target="_blank">沪镍下滑震荡 需求疲弱打压</a></span><span class="c_time" s="1563959860">07-24 17:17</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/relnews/us/2019-07-24/doc-ihytcerm5959813.shtml" target="_blank">美股科技股盘前走低 美司法部启动大范围反垄断调查</a></span><span class="c_time" s="1563959797">07-24 17:16</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/relnews/hk/2019-07-24/doc-ihytcerm5965807.shtml" target="_blank">中信建投证券完成兑付30亿元本年度第一期短期融资券</a></span><span class="c_time" s="1563959760">07-24 17:16</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexanaly/2019-07-24/doc-ihytcitm4319691.shtml" target="_blank">陈一铭:美元三连阳非美承压 黄金多空拉锯如过山车</a></span><span class="c_time" s="1563959755">07-24 17:15</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/c/2019-07-24/doc-ihytcitm4327565.shtml" target="_blank">交通部:新申请的跨省客运班线不得超过800公里</a></span><span class="c_time" s="1563959640">07-24 17:14</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://tech.sina.com.cn/i/2019-07-24/doc-ihytcitm4326190.shtml" target="_blank">澎湃新闻:孙宇晨是黑是白,谁来说清楚?</a></span><span class="c_time" s="1563959640">07-24 17:14</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcerm5958736.shtml" target="_blank">子公司对外追讨逾4亿货款 *ST尤夫五跌停后收获3连板</a></span><span class="c_time" s="1563959640">07-24 17:14</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/o/2019-07-24/doc-ihytcitm4319399.shtml" target="_blank">澎湃:孙宇晨是黑是白 谁来说清楚?</a></span><span class="c_time" s="1563959640">07-24 17:14</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/jhzx/2019-07-24/doc-ihytcerm5958217.shtml" target="_blank">深交所投教:详解股东的基本权利</a></span><span class="c_time" s="1563959640">07-24 17:14</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcerm5963095.shtml" target="_blank">欧元区、德法7月制造业PMI惨淡 市场押注欧央行降息</a></span><span class="c_time" s="1563959630">07-24 17:13</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexanaly/2019-07-24/doc-ihytcerm5958168.shtml" target="_blank">方威铭:降息前夕黄金上蹿下跳 唯白银独秀</a></span><span class="c_time" s="1563959629">07-24 17:13</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcerm5962740.shtml" target="_blank">7月土地市场降温:热点一二线城市溢价率走低</a></span><span class="c_time" s="1563959597">07-24 17:13</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a class="videoNewsLeft" href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcitm4320009.shtml" target="_blank">共享汽车途歌董事长卸任 拖欠的押金还退得了吗?</a></span><span class="c_time" s="1563959580">07-24 17:13</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcitm4319187.shtml" target="_blank">摩拜单车又涨价!上海起步价涨至1.5元(视频)</a></span><span class="c_time" s="1563959580">07-24 17:13</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/s/2019-07-24/doc-ihytcerm5957739.shtml" target="_blank">招商银行:上半年净利润506.12亿 同比增长13.08%</a></span><span class="c_time" s="1563959540">07-24 17:12</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/future/roll/2019-07-24/doc-ihytcerm5957518.shtml" target="_blank">AP910期价下探回升 短期或延续弱势</a></span><span class="c_time" s="1563959498">07-24 17:11</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexanaly/2019-07-24/doc-ihytcerm5957450.shtml" target="_blank">周品源:黄金最新走势分析 今日最新黄金操作建议</a></span><span class="c_time" s="1563959487">07-24 17:11</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/c/2019-07-24/doc-ihytcitm4318741.shtml" target="_blank">公安部督办特大制毒案告破 23人落网</a></span><span class="c_time" s="1563959480">07-24 17:11</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcerm5962314.shtml" target="_blank">刚兑打破后债市违约不再稀奇 市场风险正被重新定价</a></span><span class="c_time" s="1563959468">07-24 17:11</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexanaly/2019-07-24/doc-ihytcerm5956829.shtml" target="_blank">戴鑫伟:早间黄金原油走势分析 实时操作策略</a></span><span class="c_time" s="1563959380">07-24 17:09</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/bond/research/2019-07-24/doc-ihytcerm5956719.shtml" target="_blank">社科院学部委员王国刚:逐步实现利率市场化改革</a></span><span class="c_time" s="1563959355">07-24 17:09</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcerm5975917.shtml" target="_blank">广州酒家收购陶陶居 能否“盘活”老字号?</a></span><span class="c_time" s="1563959329">07-24 17:08</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/future/roll/2019-07-24/doc-ihytcitm4317600.shtml" target="_blank">外盘提振 期价大幅反弹</a></span><span class="c_time" s="1563959326">07-24 17:08</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/c/2019-07-24/doc-ihytcerm5956286.shtml" target="_blank">海外网:岂止是香烟 这才是民进党最大的私货</a></span><span class="c_time" s="1563959251">07-24 17:07</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/china/2019-07-24/doc-ihytcitm4317284.shtml" target="_blank">美企对特定儿童安全型可开闭密封条提起337调查申请</a></span><span class="c_time" s="1563959241">07-24 17:07</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexfxyc/2019-07-24/doc-ihytcerm5956226.shtml" target="_blank">邦达亚洲:欧洲央行有望率先降息 欧元刷新8周低位</a></span><span class="c_time" s="1563959238">07-24 17:07</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcerm5957520.shtml" target="_blank">华为回应美国子公司裁员:这是个困难决定 涉600余人</a></span><span class="c_time" s="1563959159">07-24 17:05</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/world/gjcj/2019-07-24/doc-ihytcerm5955654.shtml" target="_blank">欧元区制造业健康程度明显恶化 经济前景黯淡</a></span><span class="c_time" s="1563959103">07-24 17:05</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/c/2019-07-24/doc-ihytcerm5956204.shtml" target="_blank">今年征兵工作下月开始:将多征集大学生毕业生</a></span><span class="c_time" s="1563959100">07-24 17:05</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcerm5958094.shtml" target="_blank">papi酱公司被诉侵权 律师:原告证明为权利人成关键</a></span><span class="c_time" s="1563959100">07-24 17:05</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/hkstock/ggscyd/2019-07-24/doc-ihytcerm5955599.shtml" target="_blank">卡宾:8月5日举行董事会会议 批准中期业绩</a></span><span class="c_time" s="1563959087">07-24 17:04</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/chanjing/gsnews/2019-07-24/doc-ihytcitm4317048.shtml" target="_blank">或参与负债累累的托马斯库克重组谈判 复星图啥?</a></span><span class="c_time" s="1563959078">07-24 17:04</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/relnews/cn/2019-07-24/doc-ihytcerm5959394.shtml" target="_blank">江南化工全资控股中金立华 能否缓解盾安危机成疑</a></span><span class="c_time" s="1563959065">07-24 17:04</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/s/2019-07-24/doc-ihytcerm5955502.shtml" target="_blank">淮南市委通报巡视情况:支持淮南矿业集团整体上市</a></span><span class="c_time" s="1563959054">07-24 17:04</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/fund/jjsy/2019-07-24/doc-ihytcitm4316467.shtml" target="_blank">明星挂名受监管 几类基金要小心</a></span><span class="c_time" s="1563959052">07-24 17:04</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcitm4316339.shtml" target="_blank">广西准入准营退出 实现企业开办1个工作日内办结目标</a></span><span class="c_time" s="1563958980">07-24 17:03</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/hkstock/marketalerts/2019-07-24/doc-ihytcerm5954921.shtml" target="_blank">盘后部署:投资者以观望态度为主 港股28800点料受阻</a></span><span class="c_time" s="1563958921">07-24 17:02</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://tech.sina.com.cn/it/2019-07-24/doc-ihytcitm4315862.shtml" target="_blank">TCL回应拟并购日本JDI传闻:暂无一致性意向和协议</a></span><span class="c_time" s="1563958917">07-24 17:01</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/chanjing/gsnews/2019-07-24/doc-ihytcerm5955297.shtml" target="_blank">福布斯中国慈善榜:许家印居首 近半数来自房地产业</a></span><span class="c_time" s="1563958886">07-24 17:01</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexfxyc/2019-07-24/doc-ihytcitm4315669.shtml" target="_blank">牛汇:美联储再受抨击金价又该如何</a></span><span class="c_time" s="1563958886">07-24 17:01</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/o/2019-07-24/doc-ihytcitm4316476.shtml" target="_blank">将毒品渗入纤维逃避检测 澳门破2019最大宗毒品案</a></span><span class="c_time" s="1563958865">07-24 17:01</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcitm4316232.shtml" target="_blank">TCL回应拟并购日本JDI传闻:暂无一致性意向和协议</a></span><span class="c_time" s="1563958862">07-24 17:01</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/s/2019-07-24/doc-ihytcitm4317606.shtml" target="_blank">汇通能源一季度盈转亏 控股股东仍溢价收购谋控制权</a></span><span class="c_time" s="1563958860">07-24 17:01</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/hkstock/marketalerts/2019-07-24/doc-ihytcerm5955900.shtml" target="_blank">汇丰控股将于9月26日派发第二次股息</a></span><span class="c_time" s="1563958785">07-24 16:59</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexfxyc/2019-07-24/doc-ihytcerm5953996.shtml" target="_blank">牛汇:API多空交织 EIA枕戈待旦</a></span><span class="c_time" s="1563958746">07-24 16:59</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''"><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/roll/2019-07-24/doc-ihytcitm4315229.shtml" target="_blank">TCL回应并购日本JDI:进行交流 暂无一致性意向和协议</a></span><span class="c_time" s="1563958740">07-24 16:59</span></li></ul><div class="pagebox"> <span class="pagebox_pre"><a href="javascript:void(0)" onclick="newsList.page.pre();return false;">上一页</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(1);return false;">1</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(2);return false;">2</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(3);return false;">3</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(4);return false;">4</a></span> <span class="pagebox_num_nonce">5</span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(6);return false;">6</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(7);return false;">7</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(8);return false;">8</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(9);return false;">9</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(10);return false;">10</a></span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(11);return false;">11</a></span> <span class="pagebox_num_ellipsis">..</span> <span class="pagebox_num"><a href="javascript:void(0)" onclick="newsList.page.goTo(16277);return false;">16277</a></span> <span class="pagebox_pre"><a href="javascript:void(0)" onclick="newsList.page.next();return false;">下一页</a></span></div></div>
                        '''
                        
                        self.wait.until(EC.presence_of_element_located(
                            (By.XPATH, '//div[@class="d_list_txt"]/ul/li/span/a')))
                        elements = self.browser.find_elements_by_xpath('//div[@class="d_list_txt"]/ul/li/span/a')
                        for i in elements:
                            news_list = news_list + ',' + i.get_attribute('href')#用,拼接
                        # <a href="javascript:void(0)" onclick="newsList.page.next();return false;">下一页</a>
                        #找到并点击下一页
                        next = self.wait.until(EC.presence_of_element_located((By.XPATH, '//a[@onclick="newsList.page.next();return false;"]')))
                        next.click()
                        self.wait.until(EC.presence_of_element_located(
                            (By.XPATH, '//div[@class="d_list_txt"]/ul/li/span/a')))
                        print('------------Chrome is starting-------------' + self.browser.current_url)
                    except TimeoutException:
                        break
                        # return HtmlResponse(url=request.url, body=news_list, request=request, encoding='utf8', status=200)
                return HtmlResponse(url=request.url, body=news_list, request=request, encoding='utf8', status=200) #返回所有新闻url
            #如果不是滚动新闻页面 具体新闻页面  则返回页面源码
            return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf8', status=200)
        except TimeoutException:
            return HtmlResponse(url=request.url, status=500, request=request)

需要在setting.py中进行注册:

DOWNLOADER_MIDDLEWARES = {
    'ScrapySinaRollNews.middlewares.SeleniumDownloaderMiddleware':543,
}
  • 编辑spider中的parse方法(用于解析response)

对response 变量包含的内容进行解析,可以使用CSS选择器或Xpath选择器,解析结果赋值给Item中的字段。

# -*- coding: utf-8 -*-
import scrapy
from ScrapySinaRollNews.items import ScrapysinarollnewsItem

class SinanewsSpider(scrapy.Spider):
    name = 'sinanews'
    #新浪滚动新闻页面 包含很多类别的新闻 域名各不相同 需要都添加上
    allowed_domains = ['news.sina.com.cn', 'finance.sina.com.cn', 'sports.sina.com.cn', 'ent.sina.com.cn',
                       'mil.news.sina.com.cn', 'tech.sina.com.cn']
    start_urls = ['https://news.sina.com.cn/roll'] #注意起始地址加上/roll

    def parse(self, response):
        news_list = response.text.split(',') #用,切分 得到url
        print('---------------------' + str(len(news_list)))
        for news_url in news_list:
            if news_url:  #第一个是空串 不考虑
                yield scrapy.Request(url=news_url,callback=self.parse_news) #对每个新闻url再进行请求 用parse_news进行解析 得到新闻详细信息



    def parse_news(self,response):
        item = ScrapysinarollnewsItem()
        item['link'] = response.url
        #解析标题
        '''
        <h1 class="main-title">招商银行:上半年实现净利润506.12亿 同比增13.08%</h1>
        '''
        title = response.css('.main-title::text')
        #可能有少部分新闻的标题不符合上述格式 单独处理
        if not title:
            title = response.css('#artibodyTitle::text')
        if title:
            title = title.extract()[0] #title = title.extract_first()
        item['title'] = title

        #解析日期
        '''
        <span class="date">2019年07月24日 17:19</span>
        '''
        date = response.css('.date::text')
        # 可能有少部分新闻的日期不符合上述格式 单独处理
        if not date:
            date = response.css('#pub_date::text')
        if date:
            date = date.extract()[0]
        item['date'] = date

        #解析来源
        '''
        <span class="source ent-source">新浪财经</span>
        '''
        source = response.css('.source::text')
        # 可能有少部分新闻的来源不符合上述格式 单独处理
        if not source:
            # <a href="http://tech.sina.com.cn/" target="_blank" data-sudaclick="media_name">新浪科技</a>
            source = response.css('[data-sudaclick="media_name"]::text')
        if source:
            source = source.extract()[0]
        item['source'] = source

        #解析正文
        article = response.xpath('//div[@class="article"]/p//text()')
        # 可能有少部分新闻的正文不符合上述格式 单独处理
        if not article:
            article = response.xpath('//div[@id="artibody"]/p//text()')
        if article:
            article_list = article.extract() #列表
        item['article'] = ''.join(article_list)

        yield item





  • 使用item pipeline

将数据存入mongodb数据库,pipeline.py:

定义一个类并实现process_item(),必须返回包含数据的字典或Item 对象,或者抛出Dropltem 异常。process_item()方法主要用到了两个参数:一个参数是item ,每次Spider 生成的Item 都会作为参数传递过来;一个参数是spider ,就是Spider 的实例。启用Item Pipeline后, Item Pipeline 会自动调用process_item()方法。
 

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

class ScrapysinarollnewsPipeline(object):
    def process_item(self, item, spider):
        return item

import pymongo

# 定义数据库存储类 将数据存储到mongodb数据库
class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    # 从配置文件setting.py中获取mongo_uri,mongo_db 需要自己在setting.py中定义
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    # 连接并打开数据库
    def open_spider(self, spider):
        self.client=pymongo.MongoClient(self.mongo_uri)
        self.db=self.client[self.mongo_db]

    # 该方法必须定义,而且必须要有item和spider两个参数 其他方法可以随便写
    def process_item(self, item, spider):
        name = item.__class__.__name__ #集合名为类名
        if not self.db[name].find_one({'link': item['link']}): #不重复
            self.db[name].insert(dict(item))  # 将数据插入数据库 要转换为字典形式 键值对
        return item

    def close_spider(self, spider):
        self.client.close()

注意要把pipeline在setting.py里面进行注册,告诉scrapy增加了pipeline(把下面的代码加到setting.py中):

ITEM_PIPELINES = {
    'ScrapySinaRollNews.pipelines.MongoPipeline': 400,
}

运行scrapy crawl sinanews

爬取效果:

完整项目

 

 

 

 

 

 

 

 

 

 

 

 


 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值