爬虫之将Scrapy爬取数据保存至Mongodb数据库

最新推荐文章于 2024-03-17 23:02:42 发布

猿心不灭

最新推荐文章于 2024-03-17 23:02:42 发布

阅读量1.6k

点赞数

分类专栏： Python Spider 文章标签： python mongodb 爬虫

本文链接：https://blog.csdn.net/weixin_46297209/article/details/110851293

版权

Python Spider 专栏收录该内容

15 篇文章 4 订阅

订阅专栏

爬虫之将Scrapy爬取数据保存至Mongodb数据库

需求：以1药网中中西药品分类中的所有页面为目标，爬取每件商品的单价，名称以及评论

在上一篇博客中，我们讲了Scrapy的基本使用以及各个文件该如何配置，与上篇博客中的案例相比，不同的地方就是在pipelines.py中对数据的处理不同。

创建爬虫文件

scrapy genspider yiyaowang yiyaowang.com

在yiyaowang.py文件中先编写回调函数，先爬取一页的数据

# -*- coding: utf-8 -*-
import scrapy

class YaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yaowang.com']
    start_urls = [https://www.111.com.cn/categories/953710]

    def parse(self, response):

        # 提取数据
        li_list = response.xpath('//ul[@id="itemSearchList"]/li')
        for li in li_list:
			# 获取单价
			good_price = good.xpath('.//p[@class="price"]//span/text    ()').get().strip()
			
			# 获取标题 
			# good_title = good.xpath('.//p[@class="titleBox"]//a/te    xt()').get()
			# 发现问题：
			# 并没有返回None，而是返回一片空白
			# 分析：返回空白而不是返回None说明不是xpath路径，可能是>    返回的列表的第一个元素是一个空字符串
			# 解决：先用getall()全部取出来，然后再取我们需要的数据
			# 获取标题
			good_title = good.xpath('.//p[@class="titleBox"]//a/text    ()').getall()[1].strip()
			
			# 获取评论
			good_comment = good.xpath('.//a[@id="pdlink3"]//em/text(    )').get()

查找每一页url的规律，循环爬取所有页数

第一页：https://www.111.com.cn/categories/953710-j1.html
第二页：https://www.111.com.cn/categories/953710-j2.html
...
最后一页：https://www.111.com.cn/categories/953710-j50.html

总结发现：页数一共为50页，唯一变化的为j后面的数字，并且数字与页数对应

在原有代码上进行添加

# -*- coding: utf-8 -*-
import scrapy

class YaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yaowang.com']
    # -------------------------------------------------------------------------
    start_urls = []
	base_url = "https://www.111.com.cn/categories/953710-j{}.html"
		# 得到每一页的url
		for i in range(1,51):
			start_urls.append(base_url.format(i))
	# -------------------------------------------------------------------------

    def parse(self, response):

        # 提取数据
        li_list = response.xpath('//ul[@id="itemSearchList"]/li')
        for li in li_list:
			# 获取单价
			good_price = good.xpath('.//p[@class="price"]//span/text    ()').get().strip()
			
			# 获取标题
			good_title = good.xpath('.//p[@class="titleBox"]//a/text    ()').getall()[1].strip()
			
			# 获取评论
			good_comment = good.xpath('.//a[@id="pdlink3"]//em/text(    )').get()

至此数据已经爬取数据，接下来要先进行数据的处理

在items.py中编写相应的类

class YiYaoWang(scrapy.Item):
    # 定义标题
    title = scrapy.Field()
    # 定义单价
    price = scrapy.Field()
    # 定义评价
    comment = scrapy.Field()

将数据放入item中准备让管道调用

import scrapy

# 读入item中的类
# ------------------------------------------------------------
from ..items import YiYaoWang
# ------------------------------------------------------------

class YiyaowangSpider(scrapy.Spider):
    name = 'yiyaowang'
    # allowed_domains = ['yiyaowang.com']
    start_urls = []
    base_url = "https://www.111.com.cn/categories/953710-j{}.html"
    # 得到每一页的url
    for i in range(1,51):
        start_urls.append(base_url.format(i))

    def parse(self, response):
        """从链接中获取数据"""
        good_list = response.xpath('//ul[@id="itemSearchList"]/li')

        # 实例化item对象
        # ----------------------------------------------------------
        item = YiYaoWang()
        # ----------------------------------------------------------

        # 获取数据
        for good in good_list:
            # 获取单价
            price = good.xpath('.//p[@class="price"]//span/text()').get().strip()

            # 获取标题
            # good_title = good.xpath('.//p[@class="titleBox"]//a/text()').get()
            # 发现问题：
            # 并没有返回None，而是返回一片空白
            # 分析：返回空白而不是返回None说明不是xpath路径，可能是返回的列表的第一个元素是一个空字符串
            # 解决：先用getall()全部取出来，然后再取我们需要的数据
            # 获取标题
            title = good.xpath('.//p[@class="titleBox"]//a/text()').getall()[1].strip()

            # 获取评论
            comment = good.xpath('.//a[@id="pdlink3"]//em/text()').get()
            
            # ---------------------------------------------------------------
            # 处理数据
            item["title"] = title
            item["price"] = price
            item["comment"] = comment
    		# ---------------------------------------------------------------
    		
            yield item

在pipelines.py管道文件中编写数据保存的类

class YiYaoWangPipeline:
    def open_spider(self,spider):
        # 创建链接
        self.client = pymongo.MongoClient(host="127.0.0.1",port=27017)
        # 进入数据库
        self.db = self.client["first_text"]
        # 进入集合
        self.col = self.db["yiyaowang"]

    def process_item(self,item,spider):
        # 插入数据
        self.col.insert({"标题":item["title"],"单价":item["price"],"评论>数":item["comment"]})
        return item

    def close_spider(self,spider):
       self.client.close()

将写好的管道加入到settings.py配置文件中

必须把以前爬虫文件的管道设置注释掉，不然以前爬虫文件的管道也会在现在的爬虫文件中运行一次，保存数据的参数不一样时就会报错

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    # 'reptile.pipelines.ReptilePipeline': 300,
	# 'reptile.pipelines.HuPuPipeline': 300,
 'reptile.pipelines.YiYaoWangPipeline': 300,
}

执行爬虫文件
```
scrapy crawl yiyaowang
```
打开命令窗口查看是否保存到了数据库
```
MongoDB Enterprise > show tables
yiyaowang
```

猿心不灭

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
1
评论
爬虫之将Scrapy爬取数据保存至Mongodb数据库

爬虫之将Scrapy爬取数据保存至Mongodb数据库需求：以1药网中中西药品分类中的所有页面为目标，爬取每件商品的单价，名称以及评论在上一篇博客中，我们讲了Scrapy的基本使用以及各个文件该如何配置，与上篇博客中的案例相比，不同的地方就是在pipelines.py中对数据的处理不同。创建爬虫文件scrapy genspider yiyaowang yiyaowang.com在yiyaowang.py文件中先编写回调函数，先爬取一页的数据# -*- coding: utf-8 -*-
复制链接

扫一扫

专栏目录