scrapy学习小结

最新推荐文章于 2022-04-10 00:28:45 发布

京金

最新推荐文章于 2022-04-10 00:28:45 发布

阅读量457

点赞数 1

分类专栏： python爬虫文章标签：爬虫

python爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

1.scrapy新建爬虫项目：

scrapy startproject 项目名
如：scrapy startproject itcast

2.items.py文件

使用scrapy中的item对象可以保存爬取到的数据，相当于存储爬取道德数据的容器。

[root@VM_131_54_centos itcast]# cat items.py   #定义
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ItcastItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # age = scrapy.Field()
    # email = scrapy.Field()  
    pass
可以看到要定义一个结构花数据，只需要将scrapy下的Field类实例化即可。
我们可以通过python shell命令行来实际使用一下items，更深入地理解items。

3.写（创建）爬虫文件

scrapy genspider 爬虫名主域名
scrapy genspider myspider itcast.cn

会自动创建如下：

[root@VM_131_54_centos spiders]# cat testitcast.py 
# -*- coding: utf-8 -*-
import scrapy


class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'
    allowed_domains = ['itcast.cn']
    start_urls = ['http://itcast.cn/']

    def parse(self, response):
        pass

开始编写爬虫类：TestitcastSpider，这个类名可以变，类中的其他变量和方法不能变

[root@VM_131_54_centos spiders]# cat testitcast.py 
# -*- coding: utf-8 -*-
import scrapy


class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'
    allowed_domains = ['itcast.cn']
    start_urls = ['http://itcast.cn/']

    def parse(self, response):
        pass

[root@VM_131_54_centos spiders]# vi testitcast.py
[root@VM_131_54_centos spiders]# cat testitcast.py

# -*- coding: utf-8 -*-
import scrapy


class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'  #爬虫名
    allowed_domains = ['itcast.cn']  #主域名
    start_urls = ['http://itcast.cn/channel/teacher.shtml#']  #起始url，默认和域名相同，可以修改。这里修改过

    def parse(self, response):
        with open("getteacher.html",'w') as f:
            f.write(response.body)  #响应的内容。

4.执行爬虫：

scrapy crawle 爬虫名
scrapy crawle testitcast

5.extrace()和scrapy中的xpath #附上json的格式转换网站：http://www.json.cn/

[root@VM_131_54_centos spiders]# cat testitcast.py

# -*- coding: utf-8 -*-
import scrapy
class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'  #爬虫名
    allowed_domains = ['itcast.cn']  #主域名
    start_urls = ['http://itcast.cn/channel/teacher.shtml#']  #起始url，默认和域名相同，可以修改。这里修改过

    def parse(self, response):
        teacher_list = response.xpath("//div[@class='li_txt']") #scrapy自带的xpath
        for each in teacher_list:
            #name
            name = each.xpath("./h3/text()").extract() #xpath返回的都是list，只不过只有一个文本元素的列表

            #title                                  #而如果使用each.xpath("./h4")就返回所有集合的list。
            title = each.xpath("./h4/text()").extract()

            #info
            info = each.xpath("./p/text()").extract() #xpath取出的对象进行转换，将匹配出来的结果转换成unicode字符串。

            #print name[0]
            #print title[0]
            #print info[0]

6.使用文件itcast.py文件

[root@VM_131_54_centos spiders]# cat testitcast.py

# -*- coding: utf-8 -*-
import scrapy
from itcast.items import ItcastItem  #导入item.py文件配置。

class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'  #爬虫名
    allowed_domains = ['itcast.cn']  #主域名
    start_urls = ['http://itcast.cn/channel/teacher.shtml#']  #起始url，默认和域名相同，可以修改。这里修改过

    def parse(self, response):
        item = ItcastItem()  #将导入的类实例化。
        dataset = []
        teacher_list = response.xpath("//div[@class='li_txt']") #scrapy自带的xpath
        for each in teacher_list:
            #name
            name = each.xpath("./h3/text()").extract() #xpath返回的都是list，只不过只有一个文本元素的列表

            #title                                  #而如果使用each.xpath("./h4")就返回所有集合的list。
            title = each.xpath("./h4/text()").extract()

            #info
            info = each.xpath("./p/text()").extract() #xpath取出的对象进行转换，将匹配出来的结果转换成unicode字符串。

            #print name[0]
            #print title[0]
            #print info[0]

            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]


            dataset.append(item)

        return dataset

导出为json文件
scrapy crawl testitcast -o it.json #还可以导出为其他文件，比如说csv

7.使用pycharm进行scrapy爬虫编写。

8.使用管道文件

[root@VM_131_54_centos spiders]# cat testitcast.py

# -*- coding: utf-8 -*-
import scrapy
from itcast.items import ItcastItem  #导入item.py文件配置。

class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'  #爬虫名
    allowed_domains = ['itcast.cn']  #主域名
    start_urls = ['http://itcast.cn/channel/teacher.shtml#']  #起始url，默认和域名相同，可以修改。这里修改过

    def parse(self, response):
        item = ItcastItem()  #将导入的类实例化。
        teacher_list = response.xpath("//div[@class='li_txt']") #scrapy自带的xpath
        for each in teacher_list:
            #name
            name = each.xpath("./h3/text()").extract() #xpath返回的都是list，只不过只有一个文本元素的列表

            #title                                  #而如果使用each.xpath("./h4")就返回所有集合的list。
            title = each.xpath("./h4/text()").extract()

            #info
            info = each.xpath("./p/text()").extract() #xpath取出的对象进行转换，将匹配出来的结果转换成unicode字符串。

            #print name[0]
            #print title[0]
            #print info[0]

            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            yield item  #使用生成器函数,他将数据个pipelines.py文件处理

修改settings.py文件中的

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'itcast.pipelines.ItcastPipeline': 300, 
    #项目名，pipelines.py中的ItcastPipline类
}

修改pipelines.py

[root@VM_131_54_centos itcast]# cat pipelines.py

#encoding:utf-8
import json

class ItcastPipline(object):
    #可选
    def __init__(self):
        self.filename = open('te.json','w')

    #必须的，这个处理数据的方法
    def process_item(self,item,spider):
        jsontext = json.dumps(dict(item),ensure_ascii=False)+"\n"
        self.filename.write(jsontext.encode('utf-8'))
        return item

    #可选的，这个在结束爬虫的时候会自动调用。
    def close_spider(self,spider):
        self.filename.close()

9.常见工具命令：

使用日志：

scrapy    startproject -h查看帮助
scrapy    startproject hexian    --logfile="./spiderlog.txt"
scrapyproject --loglevel=DEBUG  myfirstproject
    日志常用级别：
        CRITICAL
        ERROR
        WARNING
        INFO
        DEBUG
    也可以使用 --nolog参数输出日志
    scrapy startproject  --nolog  mysecond

全局命令（scrapy -h出现的可用命令）：

（1）fetch命令主要用来显示爬虫爬去的过程。
    如：scrapy  fetch http://www.baidu.com
    此时如果在scrapy项目目录之外使用该命令，则会调用scrapy默认的爬虫进行网页的爬取。
    同样，我们可以使用scrapy fetch -h列出所有可以使用的fetch相关参数。
（2）runspider命令：
    通过scrapy中的runspider命令我们可以实现不依托scrapy的爬虫项目，直接运行一个爬虫文件。
    如我们在自己的爬虫项目下写了一个scrapy爬虫文件，如下：
    $cat first.py 
    from scrapy.spiders import Spider

    class FirstSpider(Spider):
        name = "first"
        allowed_domains = ["baidu.com"]
        start_urls = ["http://www.baidu.com",]
        def parse(self,response):
            pass

    此时，我们可以通过scrapy runspider运行该爬虫文件，并将日志等级设置为INFO
    scrapy runspider --loglevel=INFO  first.py
(3)settings命令：
我们可以通过scrapy中的settings命令查看scrapy对应的配置信息。
settings命令获取的是settings.py文件中的配置信息
scrapy  settings  --get BOT_NAME

(4)shell命令：
通过shell命令可以启动scrapy交互终端（scrapy shell）
(5)startproject命令（常用）
(6)version命令：
通过version命令，可以直接显示scrapy的版本相关信息。
比如，如果要查看scrapy的版本信息，可以通过一下代码实现。
scrapy version 
加上v之后参数可以查看更多信息，如下所示：
scrapy version -v
(7)view

项目命令：(全局命令既可以在非scrapy爬虫项目文件夹下使用，同时也可以在scrapy爬虫项目文件家下使用。而scrapy命令一般只能在scrapy爬虫项目文件夹中使用。)

(1)Beach命令：
使用beach命令可以测试本地硬件的性能。可以得出每分钟能爬去多少个页面。实际情况还是有皮查的。
(2)Genspider命令：
可以使用genspider命令来创建scrapy爬虫文件，这是一种快速创建爬虫文件的方式。
使用该命令可以基于现有的爬虫模板直接生成一个新的爬虫文件，非常方便。同样，需要在scrapy爬虫项目目录中，才能使用该命令。
scrapy  genspider -l    
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
(3)check命令：
爬虫的测试比较麻烦，所以scrapy中使用合同（contract）的方式对爬虫进行测试。
使用scrapy check命令，可以实现对某个爬虫文件进行合同（contract）检查。
比如要对刚才基于模板创建的爬虫文件weisuen.py进行合同检查，我们可以用
“scrapy  check  爬虫名”实现，注意此时“check”后面的是爬虫名，不是爬虫文件名，
所以是没有后缀。
scrapy  check  weisuen  
(4)crawl命令  （常用）ls

可以通过crawl命令来启动某个爬虫，启动格式是“scrapy  crawl 爬虫名”
比如有爬虫项目myspider中的weisuen爬虫，启动：
    scrapy crawl weisuen
(5)list命令：
通过scrapy中的list命令，可以列出当前可使用的爬虫文件。
比如，我们在命令行中进入爬虫项目，执行：
scrapy list  就能列出可以使用的爬虫文件。
(6)edit命令
scrapy list的结果中的爬虫，可以使用edit直接编辑。
scrapy edit first
(7)parse命令：(重要)
通过parse命令，我们可以实现获取指定的url网址，并使用对应的爬虫文件进行处理和分析。

10.spider类和爬虫文件：

spider类是scrapy中于爬虫相关的一个基类，所有爬虫文件都必须继承该类。在一个爬虫项目中，爬虫文件是一个及其重要的部分，爬虫所进行的爬去动作以及数据提取等操作都是在该文件中进行定义和编写的。

$pwd /home/jokerzhang/python/scrapylearn/gupiao/gupiao/spiders$ cat first.py

#from scrapy.spiders import Spider
import scrapy
class FirstSpider(scrapy.Spider):
    name = "first"
    allowed_domains = ["baidu.com"]
    start_urls = ["http://www.baidu.com",]
    def parse(self,response):
        pass

同时，name属性的值为“first”，name属性代表爬虫名称，所以此时爬虫名称为weisuen.(必须是gensipder命令创建的爬虫项目).
allowed_domains属性代表的是允许爬行的域名，如果启动OffsiteMiddleware，非允许的域名对应的网址则会自动过滤，不再更进。
start_urls属性代表的是允许爬行的域名，如果没有特别指定爬去的url网址，则会从该属性中定义的网页的网址开始进行爬去，在该属性中，我们可以定义多个起始网址，网址与网址之间通过逗号隔开。
在这里，还拥有一个parse的方法，如果没有特别指定回调函数，该方法是处理scrapy爬虫爬行到的网页响应（response）的默认方法，通过该方法，可以对响应进行处理并返回处理后的数据，同时该方法也负责链接的跟进。除了这些默认俄生成的属性和方法外，scrapy的spider中还有一些常用的属性和方法，具体如表：

start_request(): 方法：该方法会读取默认的start_urls属性中定义的网址，
                为每一个网址生成一个Request请求对象，并返回可迭代对象
make_requests_from_url(url):方法：该方法会被start_requests()调用，该方法负责实现生成Request请求对象。
closed(reason)：关闭spider时，该方法会被调用。
log(message[,level,component]):方法：使用该方法可以实现在spider中添加log
__init__():方法：该方法主要负责爬虫的初始化。

如果不使用start_urls这个变量名，那么可以重写start_requests方法

#encoding:utf8
import scrapy
from myfirstpjt.items import MyfirstpjtItem

class Gupiao(scrapy.Spider):
    name = "gupiao"
    start_urls=(
        "http://www.baidu.com",
        "http://www.csdn.net",
        "http://www.oschina.net",
        )
    urls2 = ("http://www.jd.com","http://sina.com.cn",
    "http://yum.iqianyue.com",)
    def start_requests(self):
        #在该方法中将起始网址设置为从新属性urls2中读取。
        for url in self.urls2:
            yield self.make_requests_from_url(url)
    def parese(self,response):
        item=MyfirstpjtItem()
        item["urlname"]=response.xpath("/html/head/title/test()")
        print item["urlname"]

11.)

京金

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy学习小结

1.scrapy新建爬虫项目：scrapy startproject 项目名如：scrapy startproject itcast[root@VM_131_54_centos pachong]# tree itcast itcast |– itcast | |– init.py | |– items.py #项目的数据容器文件，主要用来定义我们要获取的数据 |
复制链接

扫一扫