爬虫框架scrapy——爬取豆瓣网电影top25名字

1、环境准备

python环境
scrapy模块

2、检查模块

在DOS模式下查看

E:\>pip list
Package          Version
---------------- ---------
asn1crypto       0.24.0
attrs            19.1.0
Automat          0.7.0
beautifulsoup4   4.8.0
bs4              0.0.1
certifi          2019.6.16
cffi             1.12.3
chardet          3.0.4
constantly       15.1.0
cryptography     2.7
cssselect        1.1.0
get              2019.4.13
hyperlink        19.0.0
idna             2.8
incremental      17.5.0
lxml             4.4.1
parsel           1.5.2
Pillow           6.1.0
pip              19.2.2
post             2019.4.13
public           2019.4.13
pyasn1           0.4.6
pyasn1-modules   0.2.6
pycparser        2.19
PyDispatcher     2.0.5
pygame           1.9.6
PyHamcrest       1.9.0
pyOpenSSL        19.0.0
pypiwin32        223
pywin32          224
query-string     2019.4.13
queuelib         1.5.0
requests         2.22.0
Scrapy           1.7.3
service-identity 18.1.0
setuptools       40.8.0
six              1.12.0
soupsieve        1.9.3
Twisted          19.7.0
urllib3          1.25.3
w3lib            1.21.0
wheel            0.33.4
zope.interface   4.6.0
3、搭建scrapy项目

创建项目目录和爬虫程序模板

E:\巫师\workspace>scrapy startproject douban
New Scrapy project 'douban', using template directory 'd:\program files (x86)\python\lib\site-packages\scrapy\templates\project', created in:
    E:\巫师\workspace\douban

You can start your first spider with:
    cd douban
    scrapy genspider example example.com

查看

E:\巫师\workspace\douban>tree /f
卷 新加卷 的文件夹 PATH 列表
卷序列号为 06E1-F8E8
E:.
│  scrapy.cfg
│
└─douban
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  __init__.py
    │  │
    │  └─__pycache__
    └─__pycache__

核心爬虫程序创建,并查看

E:\>cd e:\巫师\workspace\douban

e:\巫师\workspace\douban>scrapy genspider moviespider douban.com
Created spider 'moviespider' using template 'basic' in module:
  douban.spiders.moviespider
  
e:\巫师\workspace\douban>tree /f
卷 新加卷 的文件夹 PATH 列表
卷序列号为 06E1-F8E8
E:.
│  scrapy.cfg
│
└─douban
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  moviespider.py
    │  │  __init__.py
    │  │
    │  └─__pycache__
    │          __init__.cpython-37.pyc
    │
    └─__pycache__
            settings.cpython-37.pyc
            __init__.cpython-37.pyc

测试网站是否连通

e:\巫师\workspace\douban>scrapy shell https://www.douban.com/

测试之后出现403错误

2020-03-24 09:15:28 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.douban.com/robots.txt> (referer: None)
2020-03-24 09:15:28 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.douban.com/> (referer: None)

解决方法:设置用户代理轮询检索器
在这里插入图片描述 将rotate_useragent.py文件放到E:\巫师\workspace\douban\douban下

在文件中配置:
搜索downloder,添加如下信息

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'douban.middlewares.DoubanDownloaderMiddleware': 543,
    'scrapy.contrib.downlodermiddleware.useragent.UserAgentMiddleware' : None,
    'douban.rotate_useragent.RotateUserAgentMiddleware' :400
}

再次测试网站

2020-03-24 10:45:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.douban.com/robots.txt> (referer: None)
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6
2020-03-24 10:45:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.douban.com/> (referer: None)

成功解决403错误

设置选项settings.py配置:
开启相应设置:

DOWNLOADER_MIDDLEWARES = {
    'douban.middlewares.DoubanDownloaderMiddleware': 543,
    'scrapy.contrib.downlodermiddleware.useragent.UserAgentMiddleware' : None,
    'douban.rotate_useragent.RotateUserAgentMiddleware' :400
}

ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}

items.py内容

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    rank = scrapy.Field()
    title = scrapy.Field()
    pass

moviespider.py内容

import scrapy
from ..items import DoubanItem

class MoviespiderSpider(scrapy.Spider):
    name = 'moviespider'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        movie_items = response.xpath('//div[@class="item"]')
        for item in movie_items:
            movie = DoubanItem()
            movie["rank"] = item.xpath('div[@class="pic"]/em/text()').extract_first()
            movie["title"] = item.xpath('div[@class="info"]/div[@class="hd"]/a/span[@class="title"][1]/text()').extract_first()

            yield movie
        pass

在DOS中执行命令爬取

e:\巫师\workspace\douban>scrapy crawl moviespider -o moviespider.csv

查看内容
在这里插入图片描述成功!

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值