1、环境准备
python环境
scrapy模块
2、检查模块
在DOS模式下查看
E:\>pip list
Package Version
---------------- ---------
asn1crypto 0.24.0
attrs 19.1.0
Automat 0.7.0
beautifulsoup4 4.8.0
bs4 0.0.1
certifi 2019.6.16
cffi 1.12.3
chardet 3.0.4
constantly 15.1.0
cryptography 2.7
cssselect 1.1.0
get 2019.4.13
hyperlink 19.0.0
idna 2.8
incremental 17.5.0
lxml 4.4.1
parsel 1.5.2
Pillow 6.1.0
pip 19.2.2
post 2019.4.13
public 2019.4.13
pyasn1 0.4.6
pyasn1-modules 0.2.6
pycparser 2.19
PyDispatcher 2.0.5
pygame 1.9.6
PyHamcrest 1.9.0
pyOpenSSL 19.0.0
pypiwin32 223
pywin32 224
query-string 2019.4.13
queuelib 1.5.0
requests 2.22.0
Scrapy 1.7.3
service-identity 18.1.0
setuptools 40.8.0
six 1.12.0
soupsieve 1.9.3
Twisted 19.7.0
urllib3 1.25.3
w3lib 1.21.0
wheel 0.33.4
zope.interface 4.6.0
3、搭建scrapy项目
创建项目目录和爬虫程序模板
E:\巫师\workspace>scrapy startproject douban
New Scrapy project 'douban', using template directory 'd:\program files (x86)\python\lib\site-packages\scrapy\templates\project', created in:
E:\巫师\workspace\douban
You can start your first spider with:
cd douban
scrapy genspider example example.com
查看
E:\巫师\workspace\douban>tree /f
卷 新加卷 的文件夹 PATH 列表
卷序列号为 06E1-F8E8
E:.
│ scrapy.cfg
│
└─douban
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─spiders
│ │ __init__.py
│ │
│ └─__pycache__
└─__pycache__
核心爬虫程序创建,并查看
E:\>cd e:\巫师\workspace\douban
e:\巫师\workspace\douban>scrapy genspider moviespider douban.com
Created spider 'moviespider' using template 'basic' in module:
douban.spiders.moviespider
e:\巫师\workspace\douban>tree /f
卷 新加卷 的文件夹 PATH 列表
卷序列号为 06E1-F8E8
E:.
│ scrapy.cfg
│
└─douban
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─spiders
│ │ moviespider.py
│ │ __init__.py
│ │
│ └─__pycache__
│ __init__.cpython-37.pyc
│
└─__pycache__
settings.cpython-37.pyc
__init__.cpython-37.pyc
测试网站是否连通
e:\巫师\workspace\douban>scrapy shell https://www.douban.com/
测试之后出现403错误
2020-03-24 09:15:28 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.douban.com/robots.txt> (referer: None)
2020-03-24 09:15:28 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.douban.com/> (referer: None)
解决方法:设置用户代理轮询检索器
将rotate_useragent.py文件放到E:\巫师\workspace\douban\douban下
在文件中配置:
搜索downloder,添加如下信息
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'douban.middlewares.DoubanDownloaderMiddleware': 543,
'scrapy.contrib.downlodermiddleware.useragent.UserAgentMiddleware' : None,
'douban.rotate_useragent.RotateUserAgentMiddleware' :400
}
再次测试网站
2020-03-24 10:45:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.douban.com/robots.txt> (referer: None)
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6
2020-03-24 10:45:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.douban.com/> (referer: None)
成功解决403错误
设置选项settings.py配置:
开启相应设置:
DOWNLOADER_MIDDLEWARES = {
'douban.middlewares.DoubanDownloaderMiddleware': 543,
'scrapy.contrib.downlodermiddleware.useragent.UserAgentMiddleware' : None,
'douban.rotate_useragent.RotateUserAgentMiddleware' :400
}
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
items.py内容
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
rank = scrapy.Field()
title = scrapy.Field()
pass
moviespider.py内容
import scrapy
from ..items import DoubanItem
class MoviespiderSpider(scrapy.Spider):
name = 'moviespider'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
movie_items = response.xpath('//div[@class="item"]')
for item in movie_items:
movie = DoubanItem()
movie["rank"] = item.xpath('div[@class="pic"]/em/text()').extract_first()
movie["title"] = item.xpath('div[@class="info"]/div[@class="hd"]/a/span[@class="title"][1]/text()').extract_first()
yield movie
pass
在DOS中执行命令爬取
e:\巫师\workspace\douban>scrapy crawl moviespider -o moviespider.csv
查看内容
成功!