Python爬虫

之前一直在做车牌识别,所以想弄个爬虫来爬图片,参考了GitHub上的程序,但是没成功过。今天偶然看到了python实现批量图片的抓取,在他的基础上改进了下(他好像用的Python2),改动如下:

import urllib2

改为:

import urllib.request

 我用的是BeautifulSoup4.6.0,所以

from BeautifulSoup import BeautifulSoup

 改为:

from bs4 import BeautifulSoup

运行之后发现警告,警告如下:

User Warning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 63 of the file c:/Users/Administrator/Desktop/ppyy/test_mul.py. To get rid of this warning, change code that looks like this: 
BeautifulSoup(YOUR_MARKUP})  
to this: 
BeautifulSoup(YOUR_MARKUP, "html5lib") 
markup_type=markup_type))

 根据上述提示将:

 soup = BeautifulSoup(self.getPageContent())

 改为:

soup = BeautifulSoup(self.getPageContent(),"html5lib")

Python3中已经没有cmp函数了,它的功能在operator包里,所以要添加operator

import operator

 然后将

if cmp(fTail, 'jpg') == 0 or cmp(fTail, 'png') == 0:

改为:

if operator.eq(fTail, 'jpg') or operator.eq(fTail, 'png') :

 还有要将Python2中的语句

 except BaseException, e:

 改为:

 except BaseException as e:

最后就是改动print,然后就可以放心运行了。

运行结果:

完整代码:

#!/usr/bin/env python
import urllib.request 
import urllib
import re
import os
from bs4 import BeautifulSoup
import operator

class Spider:
    #目标网站地址
    def __init__(self):
        self.url = 'https://movie.douban.com/'

    #获取网页内容
    def getPageContent(self):
        response = urllib.request.urlopen(self.url)
        return response.read()

    #获取图片
    def getImages(self):
        soup = BeautifulSoup(self.getPageContent(),"html5lib")
        items = soup.findAll('img')
        index = 1
        pathName = "iOSAnimationTutorial"
        self.mkdir(pathName)

        for item in items:
            imageUrl = item.get('src')
            fTail = imageUrl.split('.').pop()
            if operator.eq(fTail, 'jpg') or operator.eq(fTail, 'png') :
                savePath = pathName + '/' + str(index) + '.' + fTail
                self.saveImage(imageUrl, savePath)
            print('已获取%d张图'%index)
            index = index + 1
            
    #保存图片
    def saveImage(self, imageUrl, fileNamePath):
        try:
            u = urllib.request.urlopen(imageUrl)
            data = u.read()
            f = open(fileNamePath, "wb")
            f.write(data)
            f.close()
        except BaseException as e:
            print(e)

    #判断路径
    def mkdir(self, path):
        path = path.strip()
        isExist = os.path.exists(path)
        if not isExist:
            print("Not exist path:",path)
            os.makedirs(path)
            return True
        else:
            print("Already Exist path:", path)
            return False

spider = Spider()
spider.getImages()

参考:

安装BeautifulSoup库成功但是改为为什么导入出错

python3中替换python2中cmp函数

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值