这几天心血来潮想研究研究一下scrapy,想测试下其在linux下爬取的速度,于是选取了妹子网来练手(之前爬过),但是获取的链接的竟然在解析下载图片时出现错误,于是换了一个素材网站!
话不多说,贴上代码:
# -*- coding: utf-8 -*-
"""
Created on Mon Nov 21 23:14:09 2016
@author: alis
"""
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from MeiZi.items import MeiziItem
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import urllib
sys.stdout=open('urls.txt','w') #将打印信息输出在相应的位置下
b = '/media/alis/个人文件资料/Spider/MeiZi/photo/'
x = 0
class MeiZiSpider(CrawlSpider):
name = "meizi"
allowed_domains = ["tooopen.com"]
'''start_urls=["http://www.meizitu.com/a/xinggan.html",
"http://www.meizitu.com/a/sifang.html" ,
"http://www.meizitu.com/a/qingchun.html",
"http://www.meizitu.com/a/meizi.html",
"http://www.meizitu.com/a/xiaoqingxin.htm",
"http://www.meizitu.com/a/nvshen.html",
"http://www.meizitu.com/a/qizhi.html",
"http://www.meizitu.com/a/mote.html",
"http://www.meizitu.com/a/bijini.html",
"http://www.meizitu.com/a/wangluo.html"
]'''
start_urls = ['http://www.tooopen.com/img/88.aspx']
rules=[
Rule(SgmlLinkExtractor(allow=(r'http://www.tooopen.com/img/88_(\d+)_(\d+)_(\d+).aspx' ))),
#Rule(SgmlLinkExtractor(allow=(r'http://www.meizitu.com/a/meizi_\d+_\d+.html' ))),
Rule(SgmlLinkExtractor(allow=(r'http://www.tooopen.com/view/(\d+).html')),callback="parse_item"),
]
def parse_item(self,response):
global x
sel=Selector(response)
# Item=MeiziItem()
#print add
image_urls = sel.xpath('//div[@class="hindendiv"]/a/@data-img').extract()
for url in image_urls:
#print add
print url
x += 1
#urllib.urlretrieve(url,b+'%d.jpg'%x)
解释:一开始进去初始网页,然后发现规律是
http://www.tooopen.com/img/88_(\d+)_(\d+)_(\d+).aspx
,最后面我们需要爬取的图片,调用函数下载图片!
接下来将讲解下载解析妹子网图片的方法,见下篇