爬虫框架Scrapy的学习记录

最新推荐文章于 2022-07-31 08:54:48 发布

blackeagleoht

最新推荐文章于 2022-07-31 08:54:48 发布

阅读量608

点赞数 1

分类专栏： python学习爬虫学习总结

本文链接：https://blog.csdn.net/blackeagleoht/article/details/85264903

版权

学习总结同时被 3 个专栏收录

37 篇文章

订阅专栏

python学习

19 篇文章

订阅专栏

爬虫

7 篇文章

订阅专栏

本文记录了使用Scrapy爬虫框架在Centos7环境下安装与配置的过程，并详细介绍了如何创建项目、修改爬虫文件、执行爬虫任务，包括爬取美剧天堂最新更新页面的美剧名字和校花网的图片。通过实例展示了Scrapy的使用方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本次实验以爬取美剧天堂最近更新页面的美剧名字为目的 https://www.meijutt.com/new100.html

1、环境

Centos7 x64
python2或者python3(本次实验用python3版本)
virtualenvwrapper 虚拟环境

2、安装Scrapy

mkvirtualenv learnScrapypython3 --python=python3 #创建一个python3版本的虚拟环境
cd ~/.virtualenvs/learnScrapypython3/
pip install scrapy
pip list

安装Scrapy后会自动安装如下模块：

(learnScrapypython3) [root@vps movie]# pip list
Package Version

asn1crypto 0.24.0
attrs 18.2.0
Automat 0.7.0
cffi 1.11.5
constantly 15.1.0
cryptography 2.4.2
cssselect 1.0.3
hyperlink 18.0.0
idna 2.8
incremental 17.5.0
lxml 4.2.5
parsel 1.5.1
pip 18.1
pyasn1 0.4.4
pyasn1-modules 0.2.2
pycparser 2.19
PyDispatcher 2.0.5
PyHamcrest 1.9.0
pyOpenSSL 18.0.0
queuelib 1.5.0
Scrapy 1.5.1
service-identity 18.1.0
setuptools 40.6.3
six 1.12.0
Twisted 18.9.0
w3lib 1.19.0
wheel 0.32.3
zope.interface 4.6.0

3、创建项目

scrapy startproject movie
cd movie
scrapy genspider meiju meijutt.com

此时整个项目目录结构如下：

/root/.virtualenvs/learnScrapypython3
				├── bin
				├── include
				├── lib
				└── movie
					├── movie
					│   ├── __init__.py
					│   ├── items.py
					│   ├── middlewares.py
					│   ├── pipelines.py
					│   ├── __pycache__
					│   ├── settings.py
					│   └── spiders
					│       ├── __init__.py
					│       ├── meiju.py
					│       └── __pycache__
					└── scrapy.cfg

因为之前有学过Django，可以看出上面这个目录结构和django的文件目录结构非常相似。
目录文件说明：

下文用./ 代替/root/.virtualenvs/learnScrapypython3/

./bin ./include ./lib是创建虚拟环境时生成的，暂时不用管
./scrapy.cfg 项目的配置信息，主要为Scrapy命令行工具提供一个基础的配置信息。目前只记录爬虫setting文件的路径和爬虫程序的名字
./movie是存放整个爬虫文件的地方
./movie/items.py 设置数据存储模板，用于结构化数据，类似于Django的model.py文件
./movie/pipelines.py 数据处理行为，本次实验：把爬取的信息存到一个文件里
./movie/setting.py 配置文件，如：递归的层数、并发数，延迟下载等
./movie/spiders/ 爬虫目录，如：创建文件，编写爬虫规则
./movie/middlewares.py 中间件，处理xxx与xxx请求和相应（爬虫、调度、下载器、引擎）

4、修改爬虫程序的各个文件

./movie/items.py：

# -*- coding: utf-8 -*-
  
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass
    name = scrapy.Field()

爬虫文件./movie/spiders/meiju.py：

# -*- coding: utf-8 -*-
import scrapy
from movie.items import MovieItem


class MeijuSpider(scrapy.Spider):
    name = 'meiju'
    allowed_domains = ['meijutt.com']
    start_urls = ['http://www.meijutt.com/new100.html']

    def parse(self, response):
        movies = response.xpath('/html/body/div[2]/div[4]/div[1]/ul/li')
        for each_movie in movies:
            item = MovieItem()
            item['name'] = each_movie.xpath('./h5/a/@title').extract()[0]
            yield item

爬虫设置文件./movie/setting.py增加如下内容：

ITEM_PIPELINES = {'movie.pipelines.MoviePipeline':100}

数据处理脚本 ./movie/pipelines.py:

# -*- coding: utf-8 -*-
  
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class MoviePipeline(object):
    def process_item(self, item, spider):
        with open("my_meiju.txt",'a',encoding='utf-8') as fp:
            fp.write(str(item['name'])+ '\n') 
        return item

5、执行爬虫

在/root/.virtualenvs/learnScrapypython3/movie目录下执行如下命令开启爬虫程序（其实只要是在./movie项目下的任意目录中都可以成功执行）

scrapy crawl meiju --nolog

为方便查看错误也可执行下面的命令

scrapy crawl meiju

如果没错误，会在当前目录下产生一个名为my_meiju.txt的文件，里面是 https://www.meijutt.com/new100.html 的美剧的名字
截图如下：
在这里插入图片描述

6、实例二：爬取校花网图片

网站：http://www.xiaohuar.com/

创建爬虫程序

scrapy startproject xhwpic
cd xhwpic
scrapy genspider xiaohuar xiaohuar.com

vim ./xhwpic/items.py

# -*- coding: utf-8 -*-
  
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class XhwpicItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    addr = scrapy.Field()

vim ./xhwpic/pipelines.py

# -*- coding: utf-8 -*-
  
import requests
import os

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class XhwpicPipeline(object):
    def process_item(self, item, spider):
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}

        res = requests.get(url=item['addr'],headers=headers,stream=True)
        if not os.path.exists('./mypic'):
            os.makedirs('./mypic')
        filename = os.path.join('./mypic',item['name']+'.jpg')
        with open(filename,'wb') as fp:
            fp.write(res.content)

vim ./xhwpic/setting.py增加一行：

ITEM_PIPELINES = {'xhwpic.pipelines.XhwpicPipeline':100}

vim ./xhwpic/spiders/xiaohuar.py

# -*- coding: utf-8 -*-
import scrapy
import os

from xhwpic.items import XhwpicItem

class XiaohuarSpider(scrapy.Spider):
    name = 'xiaohuar'
    allowed_domains = ['xiaohuar.com']
    start_urls = ['http://www.xiaohuar.com']
    url_set = set()
    def parse(self, response):
        #if response.url.startswith("http://www.xiaohuar.com/list-"):
        if response.url.startswith("http://www.xiaohuar.com"):
            allPics = response.xpath('//div[@class="img"]/a')
            for pic in allPics:
                item = XhwpicItem()
                name = pic.xpath('./img/@alt').extract()[0]
                addr = pic.xpath('./img/@src').extract()[0]
                addr = 'http://www.xiaohuar.com'+addr
                item['name'] = name
                item['addr'] = addr
                yield item

        urls = response.xpath("//a/@href").extract()
        for url in urls:
            #if url.startswith("http://www.xiaohuar.com/list-"):
            if url.startswith("http://www.xiaohuar.com"):
                if url in XiaohuarSpider.url_set:
                    pass
                else:
                    XiaohuarSpider.url_set.add(url)
                    yield self.make_requests_from_url(url)
            else:
                pass