scrapy爬虫框架学习

最新推荐文章于 2024-07-14 22:18:32 发布

莫空0000

最新推荐文章于 2024-07-14 22:18:32 发布

阅读量253

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/weixin_42462552/article/details/103597911

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

一、配置环境：

1.安装插件：

（1）twisted

虽然安装scrapy时会自动安装，但是安装的不全，所以还是先自己安装比较好

下载。根据自己的python版本和系统版本下载twisted
安装。dos命令进入到twisted安装包的文件路径下，执行以下命令
pip install 文件名.whl

（2）scrapy

使用管理员身份运行cmd,执行以下命令
pip install scrapy
出现Successfully installed字样就代表安装成功了

二、创建项目

scrapy startproject 项目名称

如

scrapy startproject SearchSpider

然后进入项目中spiders目录下
如：\\SearchSpider\SearchSpider\spiders
输入

scrapy genspider 爬虫名称 "爬取的域名"

如

scrapy genspider search "baidu.com"

三、实战

我们来爬取网易财经的新股数据：http://quotes.money.163.com/data/ipo/shengou.html
在这里插入图片描述

1.创建项目：

scrapy startproject ShengouSpider

2.创建爬虫

进入ShengouSpider/ShengouSpider/spiders文件夹下面，创建爬虫，ShengouSpider就是你的项目名

scrapy genspider shengou quotes.money.163.com/data/ipo/shengou.html

scrapy genspider 是创建爬虫的命令，shengou是爬虫的名字，后面的连接就是我们要爬去的网址
在这里插入图片描述

3.打开项目

使用编辑器打开项目，我这里使用的是PyCharm。打开项目，我们可以看到我们的项目结构
在这里插入图片描述

4.定义字段

我们先来编辑items.py文件，在这里可以定义需要保存的字段。

根据网页上的表格，确定我们需要的字段

import scrapy
class ShengouspiderItem(scrapy.Item):
    xh = scrapy.Field()  # 序号
    sgdm = scrapy.Field()  # 申购代码
    zqdm = scrapy.Field()  # 证券代码
    name = scrapy.Field()  # 证券简称
    wsfxr = scrapy.Field()  # 网上发行日
    ssr = scrapy.Field()  # 上市日
    fxl = scrapy.Field()  # 发行量
    wsfxl = scrapy.Field()  # 网上发行量
    sgsx = scrapy.Field()  # 申购上限
    fxj = scrapy.Field()  # 发行价
    syl = scrapy.Field()  # 市盈率
    zql = scrapy.Field()  # 中签率

5.编写爬虫文件

确定好需要的字段之后，开始编写爬虫文件shengou.py

# -*- coding: utf-8 -*-
import scrapy

from ShengouSpider.items import ShengouspiderItem


class ShengouSpider(scrapy.Spider):
    name = 'shengou'       #爬虫的名字
    allowed_domains = ['quotes.money.163.com/data/ipo/shengou.html']     #规定爬取的域名
    start_urls = ['http://quotes.money.163.com/data/ipo/shengou.html/']   #爬取的网址

    def parse(self, response):
        for table_primary in response.xpath('//div[@class="fn_rp_list"]/table/tr'):
            item = ShengouspiderItem()
            # fn_rp_list = table_primary.xpath('./tr')
            item['xh'] = table_primary.xpath('./td[1]/text()').extract()
            item['sgdm'] = table_primary.xpath('./td[2]/text()').extract()
            item['zqdm'] = table_primary.xpath('./td[3]/text()').extract()
            item['name'] = table_primary.xpath('./td[4]/a/text()').extract()
            item['wsfxr'] = table_primary.xpath('./td[5]/text()').extract()
            item['ssr'] = table_primary.xpath('./td[6]/text()').extract()
            item['fxl'] = table_primary.xpath('./td[7]/text()').extract()
            item['wsfxl'] = table_primary.xpath('./td[8]/text()').extract()
            item['sgsx'] = table_primary.xpath('./td[9]/text()').extract()
            item['fxj'] = table_primary.xpath('./td[10]/text()').extract()
            item['syl'] = table_primary.xpath('./td[11]/text()').extract()
            item['zql'] = table_primary.xpath('./td[12]/text()').extract()
            yield item

response是访问url后的得到的响应，也就是整个网页,其中for循环就是将整张表格的数据获取，然后对每一行进行遍历，获取行数据。
根据网页源代码，我们可以通过xpath来定位数据,这句xpath的意思是寻找所有class="fn_rp_list"的div标签，然后div下的table下的tr标签数据，其中的tr就是网页上一行行的数据了
----->学习xpath语法

for table_primary in response.xpath('//div[@class="fn_rp_list"]/table/tr'):

在这里插入图片描述
然后采集行的数据，其中item = ShengouspiderItem()就是我们在items文件中定义的东西，如果要使用先添加引用，从项目中的items中引入ShengouspiderItem类，
from ShengouSpider.items import ShengouspiderItem

然后解释一下，行数据的定位
如“申购代码”
在这里插入图片描述
如图，申购代码在tr标签下的第2个td中，./td[2]/text(), .表示当前标签，这里表示tr，在第2个td中，就写做td[2]，text()表示获取文本

item['sgdm'] = table_primary.xpath('./td[2]/text()').extract()

学会了一个其他的就可以举一反三啦
然后yield将items数据返回

6.数据处理

定位好标签之后，我们来看看结果，打开pipelines.py，对返回的数据进行处理
我们将结果使用print打印出来

class ShengouspiderPipeline(object):
    def process_item(self, item, spider):
        print(item['xh'],'  ',item['sgdm'],'  ',item['zqdm'],'  ',item['name'],'  ',item['wsfxr'],'  ',item['ssr'],
        '  ',item['fxl'],'  ',item['wsfxl'],'  ',item['sgsx'],'  ',item['fxj'],'  ',item['syl'],'  ',item['zql'])

7.更改配置

到这一步别先急着运行程序，还需要去strings.py中设置一些东西,有些设置是已经写好的，只需要取消注释就好了

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   "User-Agent" : "Mozilla/5.0 (Windows NT 6.1;Win64;x64;rv:61.0) Gecko/20102001 Firefox/61.0",
}
ITEM_PIPELINES = {
   'ShengouSpider.pipelines.ShengouspiderPipeline': 300,
}

8.运行程序

接下来就可以去运行我们的程序查看结果了，
在这里插入图片描述
点开左下角的 Terminal,输入命令：scrapy crawl shengou
shengou就是我们爬虫的名字
也可以在cmd命令行中运行，目录需要切换到项目文件下，才可以执行
然后可以看到爬虫已经爬到了数据了

一共有50条数据，因为这一页只有50条数据，这显然是不够的，所以我们需要翻页，将所有页的数据都爬取下来。

9.翻页

我们回到爬虫文件shengou.py中，在后面加上翻页的代码

# -*- coding: utf-8 -*-
import scrapy

from ShengouSpider.items import ShengouspiderItem


class ShengouSpider(scrapy.Spider):
    name = 'shengou'       #爬虫的名字
    allowed_domains = ['quotes.money.163.com']     #规定爬取的域名
    start_urls = ['http://quotes.money.163.com/data/ipo/shengou.html']   #爬取的网址

    def parse(self, response):
        for table_primary in response.xpath('//div[@class="fn_rp_list"]/table/tr'):
            item = ShengouspiderItem()
            # fn_rp_list = table_primary.xpath('./tr')
            item['xh'] = table_primary.xpath('./td[1]/text()').extract()
            item['sgdm'] = table_primary.xpath('./td[2]/text()').extract()
            item['zqdm'] = table_primary.xpath('./td[3]/text()').extract()
            item['name'] = table_primary.xpath('./td[4]/a/text()').extract()
            item['wsfxr'] = table_primary.xpath('./td[5]/text()').extract()
            item['ssr'] = table_primary.xpath('./td[6]/text()').extract()
            item['fxl'] = table_primary.xpath('./td[7]/text()').extract()
            item['wsfxl'] = table_primary.xpath('./td[8]/text()').extract()
            item['sgsx'] = table_primary.xpath('./td[9]/text()').extract()
            item['fxj'] = table_primary.xpath('./td[10]/text()').extract()
            item['syl'] = table_primary.xpath('./td[11]/text()').extract()
            item['zql'] = table_primary.xpath('./td[12]/text()').extract()
            yield item
            # 翻页
            # new_links = response.xpath('//a[@class="pages_flip"]/@href').extract()
            new_links = response.xpath('//a[contains(text(), "下一页")]/@href').extract()

            if new_links and len(new_links) > 0:
                new_link = new_links[0]
                yield scrapy.Request("http://quotes.money.163.com" + new_link, callback=self.parse)

通过xpath定位到下一页按钮，将链接获取，可以看到下一页按钮在a标签中，下一页链接在href中，使用xpath获取所有a标签文本是“下一页”的href//a[contains(text(), "下一页")]/@href,一般是定位class属性的，但是在这里不起作用，所以就定位文字了
获取链接之后，使用if判断链接是否存在以此判断是是否到了最后一页，把链接拼装，设置callback为parse，继续回调parse函数来收集数据，最后yield返回新的请求
在这里插入图片描述
然后运行程序，可以看到爬到了所有数据，现在是201条数据

10.数据保存到MySQL

到这一步，我们已经爬取到所有的数据，但是我们仅仅是把结果展示在控制台，这显然是不够的，下面我们将数据保存到MySQL中。
我们回到爬虫文件pipelines.py中，将数据保存到数据库

# -*- coding: utf-8 -*-
import mysql.connector

class ShengouspiderPipeline(object):
    # def process_item(self, item, spider):
    #     print(item['xh'],'  ',item['sgdm'],'  ',item['zqdm'],'  ',item['name'],'  ',item['wsfxr'],'  ',item['ssr'],
    #     '  ',item['fxl'],'  ',item['wsfxl'],'  ',item['sgsx'],'  ',item['fxj'],'  ',item['syl'],'  ',item['zql'])

    # 将数据写入mysql数据库
    def __init__(self):
        try:
            self.conn = mysql.connector.connect(
                host='localhost', user='root', database='stocks', port='3306', password='123456',
                use_unicode=True)
            self.cur = self.conn.cursor()
        except Exception as  e:
            print(e)

    # 重写close_spider 回调方法，用于关闭数据库
    def close_spider(self, spider):
        try:
            print('--------关闭数据库资源----------')
            # 关闭游标
            self.cur.close()
            # 关闭连接
            self.conn.close()
        except Exception as  e:
            print(e)

    def process_item(self, item, spider):
        # 1.Python 'list' cannot be converted to a MySQL type
        # self.cur.execute("insert into info values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",
        #                  (item['xh'],item['sgdm'],item['zqdm'],item['name'],item['wsfxr'],item['ssr'],
        #                   item['fxl'],item['wsfxl'],item['sgsx'],item['fxj'],item['syl'],item['zql']))

        # 2.ok
        try:
            values = [item['xh'][0], item['sgdm'][0], item['zqdm'][0], item['name'][0], item['wsfxr'][0],
                      item['ssr'][0],item['fxl'][0],item['wsfxl'][0], item['sgsx'][0], item['fxj'][0],
                      item['syl'][0], item['zql'][0]]
            self.cur.execute("insert into info values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)", values)
            self.conn.commit()
            print('正在插入数据')
        except Exception as  e:
            print(e)

首先我们mysql.connector.connect连接数据库，创建游标，然后使用execute执行插入语句，commit提交事务，最后关闭游标和数据库。

到此为止，我们整个项目就完成了。
完整代码：https://github.com/morkong/Crawlers.git
文章推荐：scrapy爬虫框架多个spider指定pipeline

莫空0000

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy爬虫框架学习

一、配置环境：安装插件：1.twisted虽然安装scrapy时会自动安装，但是安装的不全，所以还是先自己安装比较好下载。根据自己的python版本和系统版本下载twisted安装。dos命令进入到twisted安装包的文件路径下，执行以下命令pip install 文件名.whl2.scrapy使用管理员身份运行cmd,执行以下命令pip install scrapy出现...
复制链接

扫一扫