Python爬虫

最新推荐文章于 2023-05-24 17:39:44 发布

阿杜依诺

最新推荐文章于 2023-05-24 17:39:44 发布

阅读量365

点赞数 1

分类专栏： Python学习

本文链接：https://blog.csdn.net/qq_40250862/article/details/81220312

版权

Python学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

之前一直在做车牌识别，所以想弄个爬虫来爬图片，参考了GitHub上的程序，但是没成功过。今天偶然看到了python实现批量图片的抓取，在他的基础上改进了下（他好像用的Python2），改动如下：

import urllib2

改为：

import urllib.request

我用的是BeautifulSoup4.6.0，所以

from BeautifulSoup import BeautifulSoup

改为：

from bs4 import BeautifulSoup

运行之后发现警告，警告如下：

User Warning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 63 of the file c:/Users/Administrator/Desktop/ppyy/test_mul.py. To get rid of this warning, change code that looks like this: 
BeautifulSoup(YOUR_MARKUP})  
to this: 
BeautifulSoup(YOUR_MARKUP, "html5lib") 
markup_type=markup_type))

根据上述提示将：

 soup = BeautifulSoup(self.getPageContent())

改为：

soup = BeautifulSoup(self.getPageContent(),"html5lib")

Python3中已经没有cmp函数了，它的功能在operator包里，所以要添加operator

import operator

然后将

if cmp(fTail, 'jpg') == 0 or cmp(fTail, 'png') == 0:

改为：

if operator.eq(fTail, 'jpg') or operator.eq(fTail, 'png') :

还有要将Python2中的语句

 except BaseException, e:

改为：

 except BaseException as e:

最后就是改动print，然后就可以放心运行了。

运行结果：

完整代码：

#!/usr/bin/env python
import urllib.request 
import urllib
import re
import os
from bs4 import BeautifulSoup
import operator

class Spider:
    #目标网站地址
    def __init__(self):
        self.url = 'https://movie.douban.com/'

    #获取网页内容
    def getPageContent(self):
        response = urllib.request.urlopen(self.url)
        return response.read()

    #获取图片
    def getImages(self):
        soup = BeautifulSoup(self.getPageContent(),"html5lib")
        items = soup.findAll('img')
        index = 1
        pathName = "iOSAnimationTutorial"
        self.mkdir(pathName)

        for item in items:
            imageUrl = item.get('src')
            fTail = imageUrl.split('.').pop()
            if operator.eq(fTail, 'jpg') or operator.eq(fTail, 'png') :
                savePath = pathName + '/' + str(index) + '.' + fTail
                self.saveImage(imageUrl, savePath)
            print('已获取%d张图'%index)
            index = index + 1
            
    #保存图片
    def saveImage(self, imageUrl, fileNamePath):
        try:
            u = urllib.request.urlopen(imageUrl)
            data = u.read()
            f = open(fileNamePath, "wb")
            f.write(data)
            f.close()
        except BaseException as e:
            print(e)

    #判断路径
    def mkdir(self, path):
        path = path.strip()
        isExist = os.path.exists(path)
        if not isExist:
            print("Not exist path:",path)
            os.makedirs(path)
            return True
        else:
            print("Already Exist path:", path)
            return False

spider = Spider()
spider.getImages()

参考：

安装BeautifulSoup库成功但是改为为什么导入出错

python3中替换python2中cmp函数