python-scrapy-爬取图片笔记

最新推荐文章于 2024-07-19 14:23:02 发布

huahuizaikai1

最新推荐文章于 2024-07-19 14:23:02 发布

阅读量734

点赞数

本文链接：https://blog.csdn.net/huahuizaikai1/article/details/51475583

版权

小白笔记：

Python的scrapy爬虫框架

实例1：爬去贴吧图片

小结：

爬取图片需要Python的pillow库

其实关键的地方在图片处理的管道pipelines

如果出现编码问题先试一试用Notepad++转换编码为utf-8无什么什么

1、新建项目：命令行下 scrapy startproject tiebapic

	│  scrapy.cfg
	│
	└─tiebapic
    	│  items.py
    	│  pipelines.py
    	│  settings.py
    	│  __init__.py
    	│
    	└─spiders
            	__init__.py

这些文件主要是：

scrapy.cfg: 项目配置文件
tiebapic/: 项目python模块, 呆会代码将从这里导入
tiebapic/items.py: 项目items文件
tiebapic/pipelines.py: 项目管道文件
tienapic/settings.py: 项目配置文件
tiebapic/spiders: 放置spider的目录

定义ITEM：

from scrapy.item import Item, Field 
class tiebapicItem(Item):
    tieba_image_urls = Field() #图片地址

通过scrapy.item.Item进行声明，定义属性为scrapy.item.Filed 的对象

把item模型化，使他成为你想要获得的数据的容器

编写spider:

# -*- coding : utf-8 -*-

from scrapy.selector import Selector  #选择器，从response（html）里选择我们需要的数据
from scrapy.spiders import Spider  #基类爬虫对象，貌似还要有CrawlSpider BaseSpider
from tiebapic.items import TiebapicItem  #调用容器
from scrapy.http import Request  #爬取多个网页时，可用Request(url,callback=方法),
#或者CrawlSpider的rules[
#                   Rlue(LinkExtractor(
#                           allow='网页链接的基本规则'，restrict_xpath('具体位置限制')),
#                           callback='parse_item',---回调方法
#                           follow=True)---是否跟进  ]
import re

class tieba_spider(Spider):
    name = 'tiebapic'  #爬虫名字
    download_delay = 0.2  #设置下载延时
    s = [] #因为要爬取的是整个吧里的图片，所以开始地址用一个列表来装
    for i in range(10):#这里我们只爬取前10页里的所有帖子的所有图片
        s.append("http://tieba.baidu.com/f?kw=二次元&ie=utf-8&pn="+"%s" % (50*i))

    start_urls = s

    def parse(self,response):
        sel = Selector(response)
        
        #这里的url_list是装每一页的每个帖子的地址
        url_list = sel.xpath('//div[@class="threadlist_title pull_left j_th_tit "]/a/@href').extract()
        for url in url_list:
            url =  'http://tieba.baidu.com' + url
            yield Request(url,callback=self.parse_item,dont_filter=True)
            #  注意dont_filter=True是因为图片地址会超出start_urls 就是那个allowed_domains什么的

    def parse_item(self,response):
        sel = Selector(response)
        item=TiebapicItem()
        image_url = []
        image_url_list=sel.xpath(".//img['BDG_Image']/@src").extract()
        #在所有图片链接里面用正则选取符合我们要求的
        for i in image_url_list:
            q=re.findall(r'(http://imgsrc.baidu.com[^"]+\.jpg)',i)
            if q:
                image_url.append(q[0])
                item['tieba_image_urls']=image_url
                yield item

最重要的一步，编写pipelines（处理item的数据）:

# -*- coding: utf-8 -*-

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline #图片处理
from scrapy.exceptions import DropItem #爬虫管理
from scrapy import Request


class TiebapicPipeline(ImagesPipeline):
    def get_media_requests(self,item,info): #图片处理的一个方法，会自动爬取item里的image_url
        for image_url in item['tieba_image_url']: #item里，我们用的不是image_url 所以需要这个方法
            yield Request(iamge_url)
    def item_completed(self, results,item,info):
        iamge_paths=[x['path'] for ok,x in results if ok]
        if not image_paths:
            raise DropItem('图片未下载好 %s' % iamge_paths)

编写settings:

BOT_NAME = 'tiebapic'

SPIDER_MODULES = ['tiebapic.spiders']
NEWSPIDER_MODULE = 'tiebapic.spiders'
ITEM_PIPELINES={
    'tiebapic.pipelines.TiebapicPipeline':1
    }
IMAGES_STORE = 'D:/img' #图片保存的地址
IMAGES_EXPIRES = 90 #过期日期
IMAGES_MIN_HEIGHT = 200 #过滤图片高和宽
IMAGES_MIN_WIDTH = 100




# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tiebapic (+http://www.yourdomain.com)'

# Obey robots.txt rules
'''ROBOTSTXT_OBEY = True'''  #被这个整惨了  有些网站会在目录下设置robots.txt文件，爬虫必须遵守这个规定