一、创建项目:
1.桌面创建一个xiaohua文件夹,在xiaohua文件夹内打开命令窗口;
2.在命令运行scrapy startproject downimages ,创建downimages项目
二、给项目添加爬虫模块:
Scrapy 中所有的爬虫模块都是存放在spiders文件夹中,所以要在downimages/spiders下创建爬虫模块
1.在命令行运行cd spiders 进入spiders目录下;
2.继续运行命令scrapy genspider DownImages xiaohuar.com ,其中DownImages是模块名,xiaohuar.coma是校花网的主域名。执行该命令会在downimages/spiders目录下生成DownImages.py文件。
三、定义Item:
Item主要是从非结性数据源提取结构性数据,items.py代码如下:
import scrapy
class DownimagesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls=scrapy.Field()
images=scrapy.Field()
images_path=scrapy.Field()#本地路径
name=scrapy.Field()#存照片名
四、编写spiders.py文件:
spiders.py文件主要是用来解析网页,代码如下:
import re
import scrapy
from scrapy.selector import Selector
from downimages.items import DownimagesItem#注意导入的方式
class DownimagesSpider(scrapy.Spider):
name = 'DownImages'
allowed_domains = ['xiaohuar.com']
start_urls = ['http://www.xiaohuar.com/hua/']
def parse(self, response):
imgs=response.xpath("//div[@class='img']/a/img//@src").extract()
names=response.xpath("//div[@class='img']/a/img//@alt").extract()
http="http://www.xiaohuar.com"
t=-1
for img in imgs:
if not img[0]=="h":
img=http + img
t=t+1
item = DownimagesItem(image_urls=img,name=names[t])
yield item
next_paper=Selector(response).re('<a href="(\S*)">下一页</a>')[0]#start_urls中的其他目标网址
if next_paper:
yield scrapy.Request(url=next_paper,callback=self.parse)
五、编写piplelines.py文件:
piplelines.py文件主要用于将数据持久化,代码如下:
class ImgPipeline(ImagesPipeline):
def file_path(self,request,response=None,info=None):
image_guid=request.meta['name']+'.jpg'#通过request.meta中的参数传递了文件名
return 'full/%s' % (image_guid)
def get_media_requests(self,item,info)
yield Request(url=item['image_urls'],meta={'name':item['name']})
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images") # 如果没有路径则抛出异常
item['image_paths'] = image_paths
return item
六、设置settings.py文件:
settings.py主要是进行爬虫模块相关设置,代码如下:
# -*- coding: utf-8 -*-
BOT_NAME = 'downimages'
SPIDER_MODULES = ['downimages.spiders']
NEWSPIDER_MODULE = 'downimages.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'downimages (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
ITEM_PIPELINES = {
# 'downimages.pipelines.DownimagesPipeline': 300,
"downimages.pipelines.ImgPipeline":300,
}
IMAGES_URL_FIELD="image_urls"
IMAGES_RESULT_FIELD="images"
IMAGE_EXPIRES=30
七、运行爬虫:
在命令行运行scrapy crawl DownImage,运行爬虫,等待一段时间后所有图片将全部爬取下来,如下: