scrapy爬虫框架实现传智播客师资库信息爬取-入门案例详解（二）

fallwind_of_july

已于 2024-04-15 15:11:55 修改

阅读量2.4k

点赞数 9

分类专栏： python 文章标签： python scrapy

于 2019-07-26 15:09:16 首次发布

本文链接：https://blog.csdn.net/fallwind_of_july/article/details/97390118

版权

python 专栏收录该内容

10 篇文章 4 订阅

订阅专栏

上一篇文章详细地介绍了如何用python的爬虫框架scrapy对网页数据进行简单爬取。文末会给出可执行源码，有兴趣的小伙伴可以自行下载。
基于上一篇文章，进行了优化，可以通过编写pipelines管道文件来保存数据到本地
上一篇文章地址：
https://blog.csdn.net/fallwind_of_july/article/details/97246577

爬取的基本详情请参考以上链接的文章，本问主要对一些文件进行修改即可，其他内容均保持一致。只需修改两个文件pipelines.py和test.py。
.

一、修改部分：

1.`pipelines.py`文件内容

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class TeacherPipeline(object):
    def __init__(self):
        self.f = open("detail.json","w")    
    
    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii=False) +",\n"
        self.f.write(content)
        return item        
    
    def close_spider(self,spider):
        self.f.close()

解释：管道文件中
def __init__(self):主要初始化打开一个文件夹，并表示将要向其中输入数据。只执行一次

def process_item(self, item, spider):主要用于把json格式转化为unicode编码，并把数据写入文件中。

def close_spider(self,spider):用于写数据结束后关闭文件。
.

2.`test.py`文件内容

# -*- coding: utf-8 -*-
import scrapy
from Teacher.items import TeacherItem

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.itcast.cn']
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']

    def parse(self, response):
        node_list = response.xpath("//div[@class='li_txt']")


        #items=[]
        for node in node_list:

            item = TeacherItem()

            name = node.xpath("./h3/text()").extract()
            title = node.xpath("./h4/text()").extract()
            info = node.xpath("./p/text()").extract()
            
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]
            #items.append(item)
            yield item
            
        #return items

解释：
yield返回提取到的每个item数据，给管道文件处理，同时还回来继续执行后面的代码，很好地替代了return。
.

二、运行：

cmd输入：scrapy crawl test
回车即可
.

三、运行结果：

格式可以在pipelines.py文件中的self.f = open("detail.json","w") 部分进行修改，改为detail.cvs时输出即为cvs格式。
cvs格式：
在这里插入图片描述

json格式：
在这里插入图片描述
这样，我们就又学会了如何编写管道文件pipelines了。

源码下载地址(New文件夹中)：
https://github.com/AndyofJuly/scrapyDemo

最近一看，被自动设置为VIP文章了，感觉不方便大家学习，于因此取消掉了VIP观看。
觉得不错可以点赞或关注哟~

fallwind_of_july

关注

9
点赞
踩
20

收藏

觉得还不错? 一键收藏
6
评论
scrapy爬虫框架实现传智播客师资库信息爬取-入门案例详解（二）

上一篇文章详细地介绍了如何用python的爬虫框架scrapy对网页数据进行简单爬取。文末会给出可执行源码，有兴趣的小伙伴可以自行下载。基于上一篇文章，进行了优化，可以通过编写pipelines管道文件来保存数据到本地上一篇文章地址： https://blog.csdn.net/fallwind_of_july/article/details/97246577 爬取的基本详情请...
复制链接

扫一扫