目标环境:
python 3.6
scrapy 1.5.0
准备:
安装scrapy ,参考:http://blog.csdn.net/yctjin/article/details/70658811
检查是否安装成功 ,在命令行输入:scrapy -version
如图所示则安装成功~
开始新建项目
在准备好的文件夹打开命令行,分别输入
scrapy startproject doubanMovie
cd doubanMovie
scrapy genspider doubanMovieSpider movie.douban.com/cinema/nowplaying/guangzhou/
得到结果如下所示:
D:\pythonProject\python\scrapy\test>scrapy startproject doubanMovie
New Scrapy project ‘doubanMovie’, using template directory ‘e:\python\lib\site-packages\scrapy\templates\project’, created in:
D:\pythonProject\python\scrapy\test\doubanMovieYou can start your first spider with:
cd doubanMovie
scrapy genspider example example.comD:\pythonProject\python\scrapy\test>cd doubanMovie
D:\pythonProject\python\scrapy\test\doubanMovie>scrapy genspider doubanMovieSpider movie.douban.com/cinema/nowplaying/guangzhou/
Created spider ‘doubanMovieSpider’ using template ‘basic’ in module:
doubanMovie.spiders.doubanMovieSpider
在文件夹中输入命令 tree/f , 如文件目录如下所示这说明成功:
D:.
│ scrapy.cfg
│
└─doubanMovie
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ init.py
│
├─spiders
│ │ doubanMovieSpider.py
│ │ init.py
│ │
│ └─pycache
│ init.cpython-36.pyc
│
└─pycache
settings.cpython-36.pyc
init.cpython-36.pyc
目录文件解释:
其中最主要用到的文件有4个:分别是items.py,pipelines.py,settings.py,doubanMovieSpider.py
items.py:
定义爬虫最终需要哪些项,相当于python里的字典
settings.py
配置项目,决定由谁去处理爬取的内容
pipelines.py
当scrapy爬虫抓取到网页数据后,数据如何处理取决于该文件如何设置
doubanMovieSpider.py
决定怎么爬取目标网站
其他文件
.pyc文件后缀的为python程序编译得到的字节码文件,_ init _.py文件将上级目录变成一个模块,middlewares.py 中间件,暂时没有用到。
目标url:
https://movie.douban.com/cinema/nowplaying/guangzhou/
抓取该网站上的电影名称
爬虫编写
- 由于我们只需要爬取电影名称,故修改items.py:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class DoubanmovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
moiveName = scrapy.Field()
与最初对比只是去掉了pass,增加了moiveName = scrapy.Field() 成员,相当于将类当成字典作用
2 . 定义怎么爬取 doubanMovie.py: