手把手教你爬美女图入门篇－如何查看我刚爬到的图片(Item, Pipeline)

最新推荐文章于 2023-07-24 00:05:26 发布

莫失莫忘

最新推荐文章于 2023-07-24 00:05:26 发布

阅读量510

点赞数

分类专栏： Scrapy

Scrapy 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

你将学会

将Scrapy爬到的图片链接变成可以任意浏览的东东

你需要了解

Scrapy的Item的简单使用
Scrapy的Pipeline的简单使用

通过上一篇，我们已经能够爬到美空网首页的图片链接，这当然是不够的。今天我们先来解决一个刚需，就是如何把链接变成可以看的东东，同时最好还不要占用自己的硬盘（没事被人搜到电脑里面有很多美女图也不好嘛，对不？）

上次跑完之后，我们能够看到一堆链接，我想你看到的是这样的东西：

http://img2.moko.cc/users/0/12/3642/post/8e/img2_cover_9395217.jpg

http://img2.moko.cc/users/3/1123/337162/post/15/img2_cover_9397946.jpg

http://img2.moko.cc/users/0/9/2979/post/00/img2_cover_9398441.jpg

http://img2.moko.cc/users/0/39/11717/post/09/img2_cover_9399183.jpg

http://img2.moko.cc/users/0/3/1195/post/29/img2_cover_9407603.jpg

http://img2.moko.cc/users/0/28/8655/post/06/img2_cover_9412871.jpg

http://img2.moko.cc/users/0/14/4494/post/a3/img2_cover_9413495.jpg

http://img2.moko.cc/users/0/53/15961/post/29/img2_cover_9414331.jpg

http://img2.moko.cc/users/5/1513/453930/post/e5/img2_cover_9414928.jpg

http://img2.moko.cc/users/3/933/280178/post/25/img2_cover_9408934.jpg

http://img2.moko.cc/users/20/6037/1811345/post/3a/img2_cover_9410594.jpg

http://img2.moko.cc/users/13/4147/1244226/post/02/img2_cover_9411853.jpg

http://img2.moko.cc/users/0/55/16695/post/a1/img2_mokoshow_9337921.jpg

http://img2.moko.cc/users/15/4787/1436175/post/62/img2_mokoshow_9417045.jpg

http://img2.moko.cc/users/21/6365/1909654/post/a3/img2_mokoshow_9210602.jpg

http://img2.moko.cc/users/15/4646/1394082/post/3e/img2_mokoshow_9196025.jpg

http://img2.moko.cc/users/21/6352/1905871/post/3b/img2_mokoshow_8797685.jpg

http://img2.moko.cc/users/15/4787/1436175/post/e0/img2_mokoshow_9351262.jpg

http://img2.moko.cc/users/21/6324/1897300/project/77/img2_mokoshow_9273297.jpg

http://img2.moko.cc/users/20/6153/1846107/project/de/img2_mokoshow_9174923.jpg

http://img2.moko.cc/users/21/6498/1949497/project/71/img2_mokoshow_9115531.jpg

http://img2.moko.cc/users/0/12/3604/project/88/img2_mokoshow_8170347.jpg

http://img2.moko.cc/users/12/3882/1164727/project/6b/img2_mokoshow_8797891.jpg

http://img2.moko.cc/users/20/6052/1815737/project/3d/img2_mokoshow_8299559.jpg

http://img1.moko.cc/users/6/1901/570354/logo/img1_des_4751873.jpg

http://img2.moko.cc/users/3/955/286676/logo/img2_des_9257459.jpg

http://img2.moko.cc/users/0/148/44474/logo/img2_des_8430030.jpg

http://img2.moko.cc/users/0/67/20175/logo/img2_des_9035655.jpg

http://img2.moko.cc/users/0/18/5503/logo/img2_des_9114555.jpg

http://img2.moko.cc/users/0/15/4629/logo/img2_des_9262920.jpg

http://img2.moko.cc/users/0/19/5932/logo/img2_des_9039246.jpg

http://img2.moko.cc/users/0/42/12745/logo/img2_des_8992854.jpg

http://img1.moko.cc/users/3/1181/354499/logo/img1_des_4966539.jpg

http://img2.moko.cc/users/0/36/10924/logo/img2_des_8566131.jpg

http://img1.moko.cc/users/3/1023/306970/logo/img1_des_1000989.jpg

http://img1.moko.cc/users/0/4/1476/logo/img1_des_7541033.jpg

http://img1.moko.cc/users/0/1/363/logo/img1_des_6695256.jpg

http://img1.moko.cc/users/2/626/187854/logo/img1_des_6211549.jpg

http://img2.moko.cc/users/12/3863/1158905/logo/img2_des_9332634.jpg

http://img2.moko.cc/users/21/6324/1897300/logo/img2_des_9202964.jpg

http://img2.moko.cc/users/15/4579/1373912/logo/img2_des_8423039.jpg

http://img2.moko.cc/users/20/6052/1815737/logo/img2_des_8005855.jpg

http://img2.moko.cc/users/19/5997/1799351/logo/img2_des_7965559.jpg

http://img1.moko.cc/users/12/3882/1164727/logo/img1_des_5806364.jpg

http://img1.moko.cc/users/18/5409/1622983/logo/img1_des_7618588.jpg

而我看到的是：

道理极其简单，就是我把

http://img2.moko.cc/users/0/12/3642/post/8e/img2_cover_9395217.jpg

变成了

<img src="http://img2.moko.cc/users/0/12/3642/post/8e/img2_cover_9395217.jpg" alt="" />

然后放在一个html文件里面，打开，看图~

我知道图片内容可能会让人比较失望，不要着急，我们这只是爬主页的图片，后面我们会根据点击量把最受欢迎的美图爬出来让你看，先慢慢学技术吧。

这里其实主要涉及到一个问题，就是如何把爬到的内容写成文件。
我们需要使用到Scrapy的两个重要组件，Item和Pipeline，先简单介绍一下吧。

Item

Item简单说就是定义我们要爬取的内容

Pipeline

把提取到的Item对象进行处理，这里主要都是用来保存

开始干吧~

Step 1

首先我们看代码目录下的items.py文件，内容如下：

class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass

我们需要定义我们需要爬取的内容，此处我们要存的是图片的url，所以我们可以定义一个url字段，改为如下

class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
url = scrapy.Field()

step 2

然后，我们需要在spider中把item记录下来，打开spiders/example01.py
将上一次写的
print img_url改为

urlItem = TutorialItem()
urlItem['url'] = img_url
yield urlItem

当然，在文件的头部需要

import items.TutorialItem

这样，Item的配置就完成了。下一步，我们就要用Pipeline来把链接保存下来

step 3

首先，我们编辑pipelines.py文件，修改以后如下

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from tutorial.items import TutorialItem

class TutorialPipeline(object):
def process_item(self, item, spider):
print '-----------' + item['url']

step 4

我们需要在settings.py中注册一下pipeline，修改settings.py,添加如下内容

ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 1,}

至此，pipeline已经可以使用了，试着运行一下scrapy crawl example01，如果看到以——为前缀的log打出来，说明流程已经跑通

the final step

打印是满足不了要求的，我们要的是保存一个html文件，来看图片，图片！
这里涉及到用python写文件，我就直接贴代码吧，有问题的话，需要查一查python的文档，其实很简答啦~
还是pipeline.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from tutorial.items import TutorialItem

class TutorialPipeline(object):
def __init__(self):
self.mfile = open('test.html', 'w')

def process_item(self, item, spider):
text = '<img src="' + item['url'] + '" alt="" />'
self.mfile.writelines(text)

def close_spider(self, spider):
self.mfile.close()