Scrapy项目实例及详解（1）

最新推荐文章于 2024-06-22 16:33:22 发布

独角兽小马

最新推荐文章于 2024-06-22 16:33:22 发布

阅读量965

点赞数 1

本文链接：https://blog.csdn.net/weixin_44457673/article/details/118709326

版权

基础知识请跳转：
Scrapy基础详解
 Scrapy持久化存储

=
本次实例为菜鸟教程中scrapy入门练习实例
抓取传智教育师资简介（http://www.itcast.cn/channel/teacher.shtml#aandroid）
网页没有做反爬措施，就不进行网页分享啦

在这里插入图片描述

首先创建项目及spider

创建项目
scrapy startproject itcastPro
创建spider
scrapy genspider itcast www.xxx.com

先写items文件

import scrapy


class ItcastproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    title = scrapy.Field()
    info = scrapy.Field()

需要爬取三个信息都要定义为Field类型
Field类代表一个属性的元数据信息。对于要接受的值没有任何限制。
Field类仅是内置字典类（dict）的一个别名，并没有提供额外的方法和属性。被用来基于类属性的方法来支持item生命语法。

=
编辑spider文件

import scrapy

# 导入items里的定义过Field的类
from itcastPro.items import ItcastproItem

class ItcastSpider(scrapy.Spider):
    name = 'itcast'
    # allowed_domains = ['www.baidu.com']
    # 初始爬取地址
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml#aandroid']

    def parse(self, response):
        # xpath解析
        li_list = response.xpath('/html/body/div[10]/div/div[2]/ul/li')
        for li in li_list:
            # extract_first()返回的是一个string字符串，是list数组里面的第一个字符串。
            li_name = li.xpath('./div[3]/h2/text()').extract_first()
            li_job = li.xpath('./div[3]/h2/span/text()').extract_first()
            li_introduce = li.xpath('./div[3]/p/text()').extract_first()

            # print(li_name,li_job,li_introduce)
            # 创建导入类的对象
            item = ItcastproItem()
            # 赋值
            item['name'] = li_name
            item['title'] = li_job
            item['info'] = li_introduce
            # 用yield传回
            yield item

其中yield作用类似于return，不同的在于yield生成的是一个迭代器，
在这个程序中，yield item返回第一个值后，但再次调用此函数时，会接着执行，就是继续for循环！！！！！

=
最后是管道文件piplines

import json
# 处理item字段
class ItcastproPipeline:
    fp = None

    def open_spider(self, spider):
        self.fp = open('./itcast.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.fp.write(content)
        return item

    def close_spider(self, spider):
        self.fp.close()

json.dumps将一个Python数据结构转换为JSON
最后储存的格式为json类型。
最后结果：
在这里插入图片描述

独角兽小马

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
1
评论
Scrapy项目实例及详解（1）

基础知识请跳转：Scrapy基础详解Scrapy持久化存储==本次实例为菜鸟教程中scrapy入门练习实例抓取传智教育师资简介（http://www.itcast.cn/channel/teacher.shtml#aandroid）网页没有做反爬措施，就不进行网页分享啦==首先创建项目及spider创建项目scrapy startproject itcastPro创建spiderscrapy genspider itcast www.xxx.com先写items文件imp
复制链接

扫一扫