Scrapy学习总结，通俗易懂，但是文笔可能不太好

百度pkq

于 2021-11-04 11:28:56 发布

阅读量598

点赞数 2

分类专栏： python爬虫文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/wangaolong0427/article/details/121137092

版权

python爬虫专栏收录该内容

5 篇文章 1 订阅

订阅专栏

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
一、Scrapy是什么？
- - - - 既然有这么一个流程，我们需要在这个流程里做什么，如果都是规定死的，那框架有啥用？在流程中可以发现有可塑性的东西就是在创建时添加想访问的网址，在Spiders中获取想要的数据，在Item Pipeline保存想要保存的东西
二、使用步骤

前言

随着人工智能的不断发展，大数据处理分析和数据集的建立都需要使用爬虫去采集大量数据，本文就介绍使用Scrapy的学习总结。

一、Scrapy是什么？

官网文件官方文档介绍
Scrapy是一种快速高层次的网页抓取和网页抓取框架，用于抓取网站和从其网页中提取结构化数据。它可以用于多种用途，从数据收集到监视和自动化测试。

自己的感悟：在使用了一些爬虫urllib,request等第三方库后，明显发现这些爬虫方式从信息查询到信息抓取都有各式各样的写法，而且还得用各种请求头，防盗链啥的来避免被反爬，Scrapy形成了一个体系和一套流程，让爬取的过程更加的系统化。

在这里插入图片描述
Scrapy将数据爬取的操作更加细化的分成了一些模块，而这些模块都是围绕着Scrapy Engine引擎来操作的，执行的流程就是最开始引擎将某个网站网址信息发送给scheduler调度器调用，Downloader就是下载Scrapy Engine(引擎)发送的所有Requests请求，并将其获取到的Responses交还给Scrapy Engine(引擎)，Spiders爬取操作就是引擎在Downloader中得到的Response的内容，也就是网页的源码，然后自己操作获取一些想要的数据，在获取时兵分两路：1.如果在爬取过程中被反爬或者请求失败后将请求返回给调度器重新执行流程；2.如果成功了就把拿到的数据结果传输给Item Pipeline，管道的作用主要是将数据打包，存储。大致流程分析就是这样的

既然有这么一个流程，我们需要在这个流程里做什么，如果都是规定死的，那框架有啥用？在流程中可以发现有可塑性的东西就是在创建时添加想访问的网址，在Spiders中获取想要的数据，在Item Pipeline保存想要保存的东西

二、使用步骤

1.安装框架

pip install scrapy

2.创建项目

#创建项目写法和Django命令差不多
Django: django-admin startproject  项目名
Scrapy: scrapy startproject 项目名


#自动生成爬虫文件，手写的话需要有一定的代码经验，像新手的我们就得使用命令行创建了
scrapy genspider 文件名称 域名
例如我要创建一个用来爬取电影 http://movie.xxx.com的网址
scrapy genspider movieSpider movie.xxx.com
然后就会在目录结构中的spider文件夹创建一个爬虫文件

在这里插入图片描述
创建项目后一定要记得cd到当前项目文件夹下
因为目录结构是两层的movies–movies,创建爬虫py就得在第一层目录下

在这里插入图片描述

在这里插入图片描述
在Setting.py中的配置，一般只需要修改一个值

3.爬取数据

在具体爬取的py文件中操作：


class MoviespiderSpider(scrapy.Spider):
    name = 'movieSpider'
    allowed_domains = ['movie.xxx.com']   #域名范围
    start_urls = ['http://movie.xxx.com/']  #修改成需要爬的具体网址，最好是域名和上面的一样

    def parse(self, response):  #这个回调函数可以对爬取到的数据做操作
        pass

例如，在某个网址中得到的js请求路径，在这个路径中获取到相应的json数据，但是这个数据格式比较恶心，键没有双引号，在一番百度后无果，就自己手动处理这些键值对了。

class CsSpider(scrapy.Spider):
    name = 'cs'
    allowed_domains = ['cs.58.com']

    # start_urls = ['https://j1.58cdn.com.cn/job/pc/full/cate/0.1/jobCates.js?v=0']
    def start_requests(self):
        url = 'https://j1.58cdn.com.cn/job/pc/full/cate/0.1/jobCates.js?v=0'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        baseurl = 'https://cs.58.com/'
        da = str(response.text).replace('var ____catelist=', '').replace(';', '')

        da = da.replace("category:", '"category":')
        da = da.replace("cates:", '"cates":')
        da = da.replace("{name:", '{"name":')
        da = da.replace("dsid:", '"dsid":')
        da = da.replace("listname:", '"listname":')
        da = da.replace("comms_getcatelist:", '"comms_getcatelist":')
        da = da.replace("PID:", '"PID":')
        da = da.replace("businessType:", '"businessType":')
        da = da.replace("cateID:", '"cateID":')
        da = da.replace("cateName:", '"cateName":')
        da = da.replace("cateidList:", '"cateidList":')
        da = da.replace("conditions", '"conditions"')
        da = da.replace("depth", '"depth"')
        da = da.replace("dispBiz", '"dispBiz"')
        da = da.replace("dispCategoryGroup", '"dispCategoryGroup"')
        da = da.replace("dispCategoryID", '"dispCategoryID"')
        da = da.replace("filter", '"filter"')
        da = da.replace("fullPath", '"fullPath"')
        da = da.replace("isVisible", '"isVisible"')
        da = da.replace("listName", '"listName"')
        da = da.replace("order", '"order"')
        da = da.replace("system", '"system"')
        da = da.replace(",type", ',"type"')
        da = da.replace("catelist:", '"catelist":')
        da = da.replace("catename", '"catename"')
        da = da.replace("cateid:", '"cateid":')
        da = da.replace("pid", '"pid"')
        da = da.replace("idpaths", '"idpaths"')
        # print(da)
        data = json.loads(da)
        # pprint.pprint(data)

        position = data[0]['cates']
        flag = 0
        for i in position:
            category = i['name']
            # print('category:',category)
            for j in i['comms_getcatelist']:
                item = Info1()
                position = j['catename']
                url = baseurl + j['listName'] + '/'
                item['position'] = position
                item['category'] = category
                item['url'] = url
                # print(url)
                yield item
                flag += 1
                yield scrapy.Request(url, meta={'position': position, 'category': category, 'url': url},
                                     callback=self.MainUrl)


    def MainUrl(self, response):
        # print(response.text)
        data = response.text
        # print(data)
        position = response.meta['position']
        category = response.meta['category']
        soup = BeautifulSoup(data, 'html.parser')
        div = soup.find('div', class_="filter")
        # i = div.find('i', class_="fontOrange")
        ul = soup.find('ul', id="list_con")
        lis = ul.findAll('li')
        allData = []
        count = re.findall('''site_name = "\d+", post_count = "\d+", page_type''', data)
        post_count = str(count[0]).replace("\n", "").split('"')[3]
        for i in lis:
            item = Info2()
            p = str(i.find('p', class_='job_salary').text).strip()
            print([position, category, post_count, p])
            allData.append([position, category, post_count, p])
            item['position'] = position
            item['category'] = category
            if post_count is not None:
                item['post_count'] = post_count
            else:
                item['post_count'] = 0
            item['p'] = p
            yield item
        page = str(soup.find('i', class_="total_page").text).strip()
        url = response.meta['url']
        url = url.replace("\n", "") + 'pn' + str(page) + '/'
        yield scrapy.Request(url, meta={'position': position, 'category': category, 'url': url}, callback=self.MainUrl2)

        pass

    def MainUrl2(self, response):
        # print(response.text)
        data = response.text
        # print(data)
        position = response.meta['position']
        category = response.meta['category']
        soup = BeautifulSoup(data, 'html.parser')
        try:
            ul = soup.find('ul', id="list_con")
            lis = ul.findAll('li')
            allData = []
            count = re.findall('''site_name = "\d+", post_count = "\d+", page_type''', data)
            post_count = str(count[0]).replace("\n", "").split('"')[3]
            for i in lis:
                item = Info2()  # 打包的item内容，将结果按键值对的形式放入，最后用yield返回，yield的好处就是不会停止操作，后面还可以继续yield,但是如果返回值里面有回调函数就不好说了
                p = str(i.find('p', class_='job_salary').text).strip()
                print([position, category, post_count, p])
                allData.append([position, category, post_count, p])
                item['position'] = position
                item['category'] = category
                item['post_count'] = post_count
                item['p'] = p
                yield item
        except:
            print("尾页没有数据")

在爬取过程中会反复的请求，当被反爬做验证的时候，框架也会用自己的一些方法来反复请求，最后拿到数据，但是可能数据的有序性就比较差了。

数据拿到后，写好item的内容便于打包


class Info2(scrapy.Item):
    # 职业
    position = scrapy.Field()
    # 行业
    category = scrapy.Field()
    # 招聘企业数
    post_count = scrapy.Field()
    # 薪资
    p = scrapy.Field()

管道中的保存方式，由于保存的时候发现有个参数不需要了，所以在取item的内容的时候就直接把那个参数省去了，比较方便：

class ScrapyPipeline:
    def __init__(self):
        # self.f = open("itcast_pipeline.json", "w", encoding='utf-8')
        self.f = open("itcast_pipeline.csv", "w", encoding='utf-8', newline='')
        self.writer = csv.writer(self.f)
        self.writer.writerow(('position', 'category', 'url'))

    def process_item(self, item, spider):
        # content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        # self.f.write(content)
        self.writer.writerow((item['position'], item['category'], item['url']))
        return item

    def close_spider(self, spider):
        self.f.close()