Python爬虫入门——3.9 Scrapy爬虫实战

最新推荐文章于 2024-08-03 14:18:54 发布

酸辣粉不要辣

最新推荐文章于 2024-08-03 14:18:54 发布

阅读量1.5k

点赞数

分类专栏： Python爬虫入门 Python算法入门 Python爬虫

本文链接：https://blog.csdn.net/lpp5406813053/article/details/84585846

版权

Python爬虫入门同时被 3 个专栏收录

16 篇文章 3 订阅

订阅专栏

Python爬虫

16 篇文章 2 订阅

订阅专栏

Python算法入门

10 篇文章 1 订阅

订阅专栏

声明：搬运自“ 从零开始学Python网络爬虫 ”作者：罗攀，蒋仟机械工业出版社ISBN：9787111579991

上一节我们讲了Scrapy框架的安装以及基本信息，这一节我们就开始使用Scrapy框架进行知乎数据的爬取。

首先利用命令管理器创建一个知乎的项目项目。具体做法是在打开的命令管理器输入

˚F:(我要创建项目的盘）
cd F：\ soft_exercise \ python（我要创建项目的目录）
scrapy startproject zhihu（利用scrapy startproject命令创建名为zhihu的项目）

结果如下：

创建完成后，使用pycharm打开可以看到如下文件：其中zhihuspiders是我创建的文件

现在开始项目的分析与实现

1，我们要分析一下我们的目的，以及技术路线。我们的目的是爬取知乎蟒话题的相关信息。然后将其存储在MongoDB的数据库中。

2，我们需要爬去的信息有：蟒问题，点赞数，回答用户数，用户信息和回答内容

3，代码编写

3.1，items.py文件的编写

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ZhihuItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    question = scrapy.Field()
    favour = scrapy.Field()
    user = scrapy.Field()
    user_info = scrapy.Field()
    content = scrapy.Field()
    pass

3.2，zhihuspiders.py文件的编写

#导入相应的库文件
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from zhihu.items import ZhihuItem

class zhihu(CrawlSpider):
    #爬虫的唯一名称
    name = 'zhihu'
    start_url = ['https://www.zhihu.com/topic/19552832/top-answers?page=1']
    def parse(self, response):
        item = ZhihuItem()
        selector = Selector(response)
        infos = selector.xpath('//div[@class="zu-top-feed-list"]/div')
        for info in infos:
            try:
                question = info.xpath('div/div/h2/a/text()').exextract()[0].strip()
                favour = info.xpath('div/div/div[1]div[1]/a/text()').extract()[0]
                user = info.xpath('div/div/div[1]/div[3]/span/span[1]/a/text()').extract()[0]
                user_info = info.xpath('div/div/div[1]/div[3]/span/span[2]/text()').extract()[0].strip()
                content = info.xpath('div/div/div[1]/div[5]/div/text()').extract()[0].strip()
                item['question'] = question
                item['favour'] = favour
                item['user'] = user
                item['user_info'] = user_info
                item['content'] = content
                yield item
            except IndexError:
                pass

3.3，pipelines.py文件的编写

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class ZhihuPipeline(object):
    def __init__(self):
        #连接数据库
        client = pymongo.MongoClient('localhost', 27017)
        test = client['test']
        tieba = test['zhihu']
        self.post = zhihu

    #插入数据库
    def process_item(self, item, spider):
        info = dict(item)
        self.post.insert(info)
        return item

3.4，settings.py文件的编写

ROBOTSTXT_OBEY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
DOWNLOAD_DELAY=2
ITEM_PIPELINES = {'zhihu.pipelines.ZhihuPipeline':300}  #指定处理文件

3.5，main.py文件的编写，主要文件是新建的文件

from scrapy import cmdline
cmdline.execute("scrapy crawl zhihu".split())

酸辣粉不要辣

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录