Python爬虫练习（二）

最新推荐文章于 2022-04-01 13:45:46 发布

Mr_Stutter

最新推荐文章于 2022-04-01 13:45:46 发布

阅读量607

点赞数

分类专栏： Python网络爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_53715621/article/details/113806905

版权

Python网络爬虫专栏收录该内容

14 篇文章 0 订阅

订阅专栏

本文档介绍了如何使用Scrapy框架爬取宝可梦列表，包括设置items.py、settings.py，编写pipelines.py和spiders.py。在pipelines中对数据进行处理并写入文件，spiders中发起请求并解析内容。最后，对爬取结果进行排序。Scrapy具有多线程优势，速度较快，适合大规模网页抓取。

摘要由CSDN通过智能技术生成

前言

使用scrapy框架爬取宝可梦列表，使用xpath,re,css查找对应宝可梦的编号，名称，属性和分类

一、修改items.py

添加属性

class DictionaryItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    num = scrapy.Field()#序号
    name = scrapy.Field()#名称
    atr = scrapy.Field()#属性
    cla = scrapy.Field()#分类
    pass

二、修改settings.py

ITEM_PIPELINES = {
    'dictionary.pipelines.DictionaryPipeline': 300, #打开PIPELINES
}
HTTPERROR_ALLOWED_CODES = [404] #防止报错

三、编写pipelines.py

对爬取到的数据进行处理

1、引入库

from itemadapter import ItemAdapter
import time

2、起始函数

class DictionaryPipeline:
    def __init__(self):
        self.count=0#计数
        path='C:/Users/lenovo/Desktop/py/其他/pokemon-s.txt'
        self.f=open(path,'w')
        self.time=time.perf_counter()#计时
    def open_spider(self,spider):
        pa='{0:{4:}<2}{1:{4:}^10}{2:{4:}^9}{3:{4:}^10}'#模板
        self.f.write(pa.format('序号','名称','属性','分类',chr(12288)))#表头
    def close_spider(self,spider):
        self.f.close()
        print('程序结束,共写入{}项'.format(self.count))
        print('用时{:.2f}s'.format(time.perf_counter()-self.time))
        time.sleep(10)
        self.f.close()

3、处理函数

    def process_item(self, item, spider):
        try:
            pa='{0:{4:}>3}{1:{4:}^10}{2:{4:}^10}{3:{4:}^10}'#模板
            self.f.write('\n'+pa.format(item['num'],item['name'],item['atr'],item['cla'],chr(12288)))
            self.count+=1
            print('\r已写入{}项'.format(self.count),end=' ')
            print('用时{:.2f}s'.format(time.perf_counter()-self.time),end='     ')
        except:
            print('\r写入失败',end=' ')
            print('用时{:.2f}s'.format(time.perf_counter()-self.time),end='     ')
        return item

四、编写spiders.py

发出请求，查找数据

1、引入库

import scrapy
import re
from dictionary.items import DictionaryItem

2、起始网址

class DictSpider(scrapy.Spider):
    name = 'dict'
    start_urls = ['https://wiki.52poke.com/wiki/宝可梦列表%EF%BC%88按全国图鉴编号%EF%BC%89/简单版']

3、提交网址列表

    def parse(self, response):
        start,end='妙蛙种子','时拉比'#始末宝可梦
        flag=0
        flags=0#控制始末
        for title in response.css('a::attr("title")').extract():
            try:
                if title==start:
                    flag=1
                if title<'\u4e00' or title>'\u9fff':#筛选汉字
                    continue
                if re.match(r'第.世代',title):#排除非宝可梦
                    continue
                if flags:
                    break
                if flag:
                    if title==end:
                        flags=1
                    url='https://wiki.52poke.com/wiki/'+title
                    yield scrapy.Request(url=url,callback=self.parse_pet)#回调parse_pet
            except:
                print(title)#打印提交失败的宝可梦
                continue

4、查找内容

对每个属性分别进行try-except，缺失的用“–”补齐，可以增加获取的内容。

    def parse_pet(self, response):
        item=DictionaryItem()#生成类
        try:
            item['num']=response.css('a[title="宝可梦列表（按全国图鉴编号）"]::text').extract()[0]
        except:
            item['num']='--'
        try:
            item['name']=response.css('h1[id="firstHeading"]::text').extract()[0]
        except:
            item['name']='--'
        try:
            atrs=response.xpath('//span/a').re(r'LPLE .+? icon.png',str(response.body))
            if len(atrs)==1:
                item['atr']=atrs[0].split()[1]
            elif  atrs[0]==atrs[1]:
                item['atr']=atrs[0].split()[1]
            else :
                item['atr']=atrs[0].split()[1]+','+atrs[1].split()[1]
        except:
            item['atr']='--'
        try:
            item['cla']=response.css('td[class^="roundy b"][class*="bgwhite bw"]::text').extract()[0].strip('\n')
        except:
            item['cla']='--'
        yield item#提交类

五、编写执行函数

from scrapy import cmdline
#cmdline.execute("scrapy crawl dict".split())
cmdline.execute("scrapy crawl dict -s LOG_ENABLED=False".split())#隐藏日志

六、排序

f=open('C:/Users/lenovo/Desktop/py/其他/pokemon-s.txt','r')
fa=open('C:/Users/lenovo/Desktop/py/其他/pokemon-s-a.txt','w')
txt=f.read()
lines=txt.split('\n')#分成列表
line=lines.pop(0)#取出表头
lines.sort(key=lambda x:x[1:4],reverse=False)#升序排列
lines.insert(0,line)#放入表头
for l in lines:
    fa.write(l+'\n')#逐行写入
f.close()
fa.close()

总结

在这里插入图片描述
爬取结果

排序结果

scrapy框架适合爬取多网页，采取多线程爬取，速度较快。
scrapy爬取结果无序，可以添加一个序号属性，再对结果进行排序。
scrapy使用spiders提交请求，并通过回调函数查找数据，将所得数据赋值给item，并将item返回给pipelines，在pipelines处理数据。
scrapy框架爬取比requests上手和调试较难，但条理更加清晰，效率也更高。
与 Python爬虫练习（一）相比，scrapy是requests单线程速度的十倍多，也没有出现ip被封的情况。使用xpath和css比BeautifulSoup库查找更加方便和准确。