Python爬虫——7-1.scrapy框架案例-爬取内涵段子

最新推荐文章于 2024-05-31 21:24:28 发布

一杯海风

最新推荐文章于 2024-05-31 21:24:28 发布

阅读量785

点赞数

分类专栏：基础篇

本文链接：https://blog.csdn.net/liyahui_3163/article/details/79082083

版权

基础篇专栏收录该内容

47 篇文章 1 订阅

订阅专栏

案例分析：

1.该网站的数据是保存的Json文件中的，所以要首先使用抓包工具，抓取包含该json文件的url地址作为爬取入口。爬取到的Json文件比较适合使用re正则表达式进行数据的筛选。若是数据直接渲染在网页中，则比较推荐使用Xpath语法。

2.分析items.py中的数据类型，这里只爬取了段子，即只有一个字段：content

3.写爬虫程序，可以首先将筛选的数据保存在本地文件表格中，观察数据是以u开头的字符串

4.将数据保存在数据库中，显示出的数据也是以u字符开头的，所以这里要注意编码问题，在sql语句中插入数据时将数据使用decode('unicode_escape')解码后存入数据库，即可得到中文数据。关于u开头的字符串转换为中文的问题，在python2中是直接使用

.decode('unicode_escape')解码，而在python3中需要先encode（‘utf-8’）编码，然后使用.decode('unicode_escape')解码。
下面是本案例的详细代码：
创建项目命令: python2 -m scrapy startproject duanzi
一、定义数据模型:items.py文件，这里只定义了一个字段content：
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DuanziItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


# 定义一个类型，确定要爬取数据的字段,继承自scrapy.Item类型，用于封装采集的数据，
class NeihanItem(scrapy.Item):
    # 定义属性
    content=scrapy.Field()
二、定义路由和编写爬虫程序neihan.py
# coding:utf-8
'''
使用scrapy框架对内涵段子的数据爬取
'''
# 定义一个爬取内涵段子的类型，给属性设置值
'''为了可以直接使用scrapy内置的爬虫操作，让scrapy自动采集数据，我们需要定义一个爬虫处理类
在spiders/zhilianspider.py模块中定义ZhilianSpider类型,继承自scrapy.Spider'''
import scrapy
import re
from .. import items
class Neihan(scrapy.Spider):
    name='neihan'
    start_urls=('https://neihanshequ.com/joke/?is_json=1&app_name=neihanshequ_web&max_time=1516151019.0',)
    allowed_domains=['neihanshequ.com']
    # 定义parse函数,用于接收下载模块获取的数据
    def parse(self,response):
        filename=response.url.split('/')[-3]+'.txt'
        with open(filename,'w') as f:
            f.write(response.body.encode('utf-8'))
        reg=r'"content": "(.*?)"'
        duanzi_list=re.compile(reg).findall(response.body)
        duanzis=[]
        for duanzi in  duanzi_list:
            new_duanzi = items.NeihanItem()
            new_duanzi['content']=duanzi
            duanzis.append(new_duanzi)
            yield new_duanzi
        # 在命令行运行python2 -m scrapy crawl neihan -o neihan.csv,即可将文件保存在
        # 本地的neihan.csv表格文件中
        #return duanzis

三、保存数据到数据库-Pipeline.py文件，红色部分为手动添加内容：
 -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class DuanziPipeline(object):
    def process_item(self, item, spider):
        return item
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
import pymysql
pymysql.install_as_MySQLdb()

class NeihanPipeline(object):
    '''
    定义pipeline类型，用于数据的存储
    '''
    def __init__(self):
        '''
        初始化函数，一般用于打开文件、与数据库建立连接
        '''
        self.engine=create_engine('mysql://root:0@localhost/python_spider?charset=utf8')
        Session=sessionmaker(bind=self.engine)
        self.session=Session()

    def close_spider(self,spider):
        '''
        爬虫程序关闭时自动调用的函数，一般用于资源的回收，如关闭与数据的连接
        :param spider:
        :return:
        '''
        self.session.close()

    def process_item(self,item,spider):
        '''
        数据存储的核心函数，在爬虫程序传递数据过来时自动调用的函数，用于数据的存储
        :param item:
        :param spider:
        :return:
        '''
        print("正在保存数据")
        # 要保存的内容为以u开头的字符串，此时.decode('unicode_escape')一下，即可存储为中文
        sql="insert into neihan(id,content) values(null,'%s')"%item['content'].decode('unicode_escape')
        self.session.execute(sql)
        self.session.commit()
在自定义NeihanPipeline类型后，一定要在settings.py文件中注册，才使得爬虫程序中yield 时将数据传递给指定类型的核心函数。
.....
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'duanzi.pipelines.DuanziPipeline': 300,
    'duanzi.pipelines.NeihanPipeline': 500,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
......

运行程序命令：python2 -m scrapy crawl neihan，即可将数据保存在指定的数据库中，且为中文。

一杯海风

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫——7-1.scrapy框架案例-爬取内涵段子

案例分析：1.该网站的数据是保存的Json文件中的，所以要首先使用抓包工具，抓取包含该json文件的url地址作为爬取入口。爬取到的Json文件比较适合使用re正则表达式进行数据的筛选。若是数据直接渲染在网页中，则比较推荐使用Xpath语法。2.分析items.py中的数据类型，这里只爬取了段子，即只有一个字段：content3.写爬虫程序，可以首先将筛选的数据保存在本地文件表格中，观
复制链接

扫一扫

专栏目录