之前一直在做车牌识别,所以想弄个爬虫来爬图片,参考了GitHub上的程序,但是没成功过。今天偶然看到了python实现批量图片的抓取,在他的基础上改进了下(他好像用的Python2),改动如下:
import urllib2
改为:
import urllib.request
我用的是BeautifulSoup4.6.0,所以
from BeautifulSoup import BeautifulSoup
改为:
from bs4 import BeautifulSoup
运行之后发现警告,警告如下:
User Warning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 63 of the file c:/Users/Administrator/Desktop/ppyy/test_mul.py. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html5lib")
markup_type=markup_type))
根据上述提示将:
soup = BeautifulSoup(self.getPageContent())
改为:
soup = BeautifulSoup(self.getPageContent(),"html5lib")
Python3中已经没有cmp函数了,它的功能在operator包里,所以要添加operator
import operator
然后将
if cmp(fTail, 'jpg') == 0 or cmp(fTail, 'png') == 0:
改为:
if operator.eq(fTail, 'jpg') or operator.eq(fTail, 'png') :
还有要将Python2中的语句
except BaseException, e:
改为:
except BaseException as e:
最后就是改动print,然后就可以放心运行了。
运行结果:
完整代码:
#!/usr/bin/env python
import urllib.request
import urllib
import re
import os
from bs4 import BeautifulSoup
import operator
class Spider:
#目标网站地址
def __init__(self):
self.url = 'https://movie.douban.com/'
#获取网页内容
def getPageContent(self):
response = urllib.request.urlopen(self.url)
return response.read()
#获取图片
def getImages(self):
soup = BeautifulSoup(self.getPageContent(),"html5lib")
items = soup.findAll('img')
index = 1
pathName = "iOSAnimationTutorial"
self.mkdir(pathName)
for item in items:
imageUrl = item.get('src')
fTail = imageUrl.split('.').pop()
if operator.eq(fTail, 'jpg') or operator.eq(fTail, 'png') :
savePath = pathName + '/' + str(index) + '.' + fTail
self.saveImage(imageUrl, savePath)
print('已获取%d张图'%index)
index = index + 1
#保存图片
def saveImage(self, imageUrl, fileNamePath):
try:
u = urllib.request.urlopen(imageUrl)
data = u.read()
f = open(fileNamePath, "wb")
f.write(data)
f.close()
except BaseException as e:
print(e)
#判断路径
def mkdir(self, path):
path = path.strip()
isExist = os.path.exists(path)
if not isExist:
print("Not exist path:",path)
os.makedirs(path)
return True
else:
print("Already Exist path:", path)
return False
spider = Spider()
spider.getImages()
参考: