scrapy是简单易用的爬虫框架,python语言实现。具体去官网看看吧:http://scrapy.org/
之前想抓一些图片制作拼贴马赛克(见 拼贴马赛克算法),没有找到顺手的爬虫软件,就自己diy一个。使用scrapy抓取非常简单,因为它居然内置了图片抓取的管道 ImagesPipeline。简单到几行代码就可以搞定一个图片爬虫。
scrapy的使用更ruby有点儿类似,创建一个project,然后框架就有了,只要在相应的文件中填写自己的内容就ok了。
spider文件中添加爬取代码:
1
|
<
/
p> <p>
class
ImageDownloaderSpider(CrawlSpider):<br>name
=
"image_downloader"
<br>allowed_domains
=
[
"sina.com.cn"
]<br>start_urls
=
[<br>
"http://www.sina.com.cn/"
<br>]<br>rules
=
[Rule(SgmlLinkExtractor(allow
=
[]),
'parse_item'
)]<
/
p> <p>
def
parse_item(
self
, response):<br>
self
.log(
'page: %s'
%
response.url)<br>hxs
=
HtmlXPathSelector(response)<br>images
=
hxs.select(
'//img/@src'
).extract()<br>items
=
[]<br>
for
image
in
images:<br>item
=
ImageDownloaderItem()<br>item[
'image_urls'
]
=
[image]<br>items.append(item)<br>
return
items<
/
p> <p>
|
item中添加字段:
1
|
<
/
p> <p>
class
ImageDownloaderItem(Item):<br>image_urls
=
Field()<br>images
=
Field()<
/
p> <p>
|
pipelines中过滤并保存图片:
1
|
<
/
p> <p>
class
ImageDownloaderPipeline(ImagesPipeline):<
/
p> <p>
def
get_media_requests(
self
, item, info):<br>
for
image_url
in
item[
'image_urls'
]:<br>
yield
Request(image_url)<
/
p> <p>
def
item_completed(
self
, results, item, info):<br>image_paths
=
[x[
'path'
]
for
ok, x
in
results
if
ok]<br>
if
not
image_paths:<br>
raise
DropItem(
"Item contains no images"
)<br>
return
item<
/
p> <p>
|
settings文件中添加project和图片过滤设置:
1
|
<
/
p> <p>IMAGES_MIN_HEIGHT
=
50
<br>IMAGES_MIN_WIDTH
=
50
<br>IMAGES_STORE
=
'image-downloaded/'
<br>DOWNLOAD_TIMEOUT
=
1200
<br>ITEM_PIPELINES
=
[
'<a href="http://lzhj.me/archives/tag/scrapy" class="st_tag internal_tag" rel="tag" title="Posts tagged with scrapy">scrapy</a>.contrib.pipeline.images.ImagesPipeline'
,<br>
'image_downloader.pipelines.ImageDownloaderPipeline'
]<
/
p> <p>
|
代码下载:@github
scrapy优美的数据流: