如何用python爬取图片数据_Python3.6爬虫入门练手小项目之一：使用爬虫爬取糗事百科网站图片项目...-CSDN博客

1.步骤解析

这个小项目的目的是让大家学会如何爬取网站上的图片，下面是简单介绍。

实现步骤如下：

1.首先要写最简单的代码，确定能通过url访问糗百

2.将糗百服务器返回的数据进行解码，得到完整HTML代码

3.查看糗百HTML代码，进行正则匹配操作，爬取用户发布的图片

上面就是我们本次小项目案例的实现流程，下面我们将进行实验。

2.逐步实现

1.使用urlopen编写最简单的访问代码

Python

import urllib.request

# 糗百第一页的链接

url = "https://www.qiushibaike.com/8hr/page/1/"

# 通过urlopen打开url

response = urllib.request.urlopen(url)

# 读取返回的结果

data = response.read()

print(data)

importurllib.request

# 糗百第一页的链接

url="https://www.qiushibaike.com/8hr/page/1/"

# 通过urlopen打开url

response=urllib.request.urlopen(url)

# 读取返回的结果

data=response.read()

print(data)

运行结果如下，报错：

Python

D:\python\python.exe E:/技术学习/Python代码/15.爬取糗事百科小项目/easy_1.py

Traceback (most recent call last):

File "E:/技术学习/Python代码/15.爬取糗事百科小项目/easy_1.py", line 7, in

response = urllib.request.urlopen(url)

File "D:\python\lib\urllib\request.py", line 223, in urlopen

return opener.open(url, data, timeout)

File "D:\python\lib\http\client.py", line 266, in _read_status

raise RemoteDisconnected("Remote end closed connection without"

http.client.RemoteDisconnected: Remote end closed connection without response

D:\python\python.exeE:/技术学习/Python代码/15.爬取糗事百科小项目/easy_1.py

Traceback(mostrecentcalllast):

File"E:/技术学习/Python代码/15.爬取糗事百科小项目/easy_1.py",line7,in

response=urllib.request.urlopen(url)

File"D:\python\lib\urllib\request.py",line223,inurlopen

returnopener.open(url,data,timeout)

File"D:\python\lib\http\client.py",line266,in_read_status

raiseRemoteDisconnected("Remote end closed connection without"

http.client.RemoteDisconnected:Remoteendclosedconnectionwithoutresponse

2.添加异常捕获代码

呀！！报错了，最简单的代码访问不了，出现错误异常情况……这里我们优化一下，添加异常捕获代码吧，别直接崩溃。咱们添加上次上次说到的URLError和HTTPError试试吧……

Python

import urllib.request

import urllib.error

# 糗百第一页的链接

url = "https://www.qiushibaike.com/8hr/page/1/"

try:

# 通过urlopen打开url

response = urllib.request.urlopen(url)

except urllib.error.HTTPError as e:

print(e.code)

except urllib.error.URLError as e:

print(e.reason)

else:

# 读取返回的结果

data = response.read()

print(data)

importurllib.request

importurllib.error

# 糗百第一页的链接

url="https://www.qiushibaike.com/8hr/page/1/"

try:

# 通过urlopen打开url

response=urllib.request.urlopen(url)

excepturllib.error.HTTPErrorase:

print(e.code)

excepturllib.error.URLErrorase:

print(e.reason)

else:

# 读取返回的结果

data=response.read()

print(data)

运行试一下：发现还是直接出错崩溃。。。。。仔细看一下崩溃原因：

http.client.RemoteDisconnected: Remote end closed connection without response

哦，明白了。原来这不是url链接访问出错，而是http.client下爆出的异常错误，继续修改一下：

Python

import urllib.request

import urllib.error

import http.client

# 糗百第一页的链接

url = "https://www.qiushibaike.com/8hr/page/1/"

try:

# 通过urlopen打开url

response = urllib.request.urlopen(url)

except urllib.error.HTTPError as e:

print(e.code)

except urllib.error.URLError as e:

print(e.reason)

except http.client.error as e:

print(e)

else:

# 读取返回的结果

data = response.read()

print(data)

importurllib.request

importurllib.error

importhttp.client

# 糗百第一页的链接

url="https://www.qiushibaike.com/8hr/page/1/"

try:

# 通过urlopen打开url

response=urllib.request.urlopen(url)

excepturllib.error.HTTPErrorase:

print(e.code)

excepturllib.error.URLErrorase:

print(e.reason)

excepthttp.client.errorase:

print(e)

else:

# 读取返回的结果

data=response.read()

print(data)

再运行一下，就可以OK了;

Python

D:\python\python.exe E:/技术学习/Python代码/15.爬取糗事百科小项目/easy_1.py

Remote end closed connection without response

进程已结束,退出代码0

D:\python\python.exeE:/技术学习/Python代码/15.爬取糗事百科小项目/easy_1.py

Remoteendclosedconnectionwithoutresponse

进程已结束,退出代码0

打印出来的错误是，远程计算机没有响应，链接结束。。也就是说我们链接失败了……

3.添加header，假装成浏览器

上面最简单的爬虫代码出现错误，错误原因是：Remote end closed connection without response。。也就是服务器不理我……我的可爱小爬虫去招惹他，它对我不理睬……很伤心啊，为啥呢？一般情况下这是header请求头的原因，服务器接到我写的小爬虫的访问请求了，但是并没有任何关于header的信息，服务器感觉这小子可能有鬼，不是个好东西，还是不回复为妙。。。于是乎，它就拒绝了响应。要解决这个问题，就需要在请求头中添加header信息。把我们的小爬虫伪装一下，打扮成浏览器的模样。（小小爬虫，乔装打扮一下，变成假浏览器，让服务器的门卫放行小爬虫）

Python

import urllib.request

import http.client

import ssl

# 取消全局的ssl认证

ssl._create_default_https_context = ssl._create_unverified_context

# 封装头信息，伪装成浏览器

header = {

'Connection': 'Keep-Alive',

'Accept-Language': 'zh-CN,zh;q=0.8',

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',

'Accept-Encoding': 'gzip, deflate',

'X-Requested-With': 'XMLHttpRequest',

}

url = "https://www.qiushibaike.com/8hr/page/1/"

try:

# 使用包含header的信息，进行请求

request = urllib.request.Request(url, headers=header)

# 通过urlopen打开包含header的url链接，用来获取数据

response = urllib.request.urlopen(request)

except urllib.error.HTTPError as e:

print(e.code)

except urllib.error.URLError as e:

print(e.reason)

except http.client.error as e:

print(e)

else:

# 读取返回的结果

data = response.read()

print(data)

importurllib.request

importhttp.client

importssl

# 取消全局的ssl认证

ssl._create_default_https_context=ssl._create_unverified_context

# 封装头信息，伪装成浏览器

header={

'Connection':'Keep-Alive',

'Accept-Language':'zh-CN,zh;q=0.8',

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',

'Accept-Encoding':'gzip, deflate',

'X-Requested-With':'XMLHttpRequest',

}

url="https://www.qiushibaike.com/8hr/page/1/"

try:

# 使用包含header的信息，进行请求

request=urllib.request.Request(url,headers=header)

# 通过urlopen打开包含header的url链接，用来获取数据

response=urllib.request.urlopen(request)

excepturllib.error.HTTPErrorase:

print(e.code)

excepturllib.error.URLErrorase:

print(e.reason)

excepthttp.client.errorase:

print(e)

else:

# 读取返回的结果

data=response.read()

print(data)

好的，改写完成之后，我们继续运行：

Python

D:\python\python.exe E:/技术学习/Python代码/15.爬取糗事百科小项目/easy.py

b'\x1f\x8b\x08\x00\xba\xabF[\x02\xff\xed}……

进程已结束,退出代码0

D:\python\python.exeE:/技术学习/Python代码/15.爬取糗事百科小项目/easy.py

b'\x1f\x8b\x08\x00\xba\xabF[\x02\xff\xed}……

进程已结束,退出代码0

哈哈，成功访问了，并且得到服务器返回的数据了。。。呀，还是不对？？这返回的啥玩意东西？不是HTML代码呀，这咋回事？且慢且慢，让我们打开360浏览器，开发者工具抓一下数据包。

1.打开360浏览器的开发者工具

2.点击network菜单

3.此时左侧为空，然后刷新url栏目重新链接

4.找到一开始的那个，也就糗百主域

5.点击进去，查看response响应的信息。

这下明白了，原来返回的数据是使用了gzip进行压缩了，我们要想获取HTML页码，还要进行解压缩操作。下面升级我们的代码：

Python

import urllib.request

import http.client

import ssl

import gzip

import zlib

#取消全局ssl验证

ssl._create_default_https_context = ssl._create_unverified_context

def gzip_decompress(data):

try: # 尝试解压

print('正在使用gzip解压.....')

data = gzip.decompress(data)

print('gzip解压完毕!')

except:

print('未经gzip压缩, 无需解压')

return data

# deflate压缩算法

def deflate_decompress(data):

try:

print('正在使用deflate解压.....')

return zlib.decompress(data, -zlib.MAX_WBITS)

print('deflate解压完毕!')

except zlib.error:

return zlib.decompress(data)

# 封装头信息，伪装成浏览器

header = {

'Connection': 'Keep-Alive',

'Accept-Language': 'zh-CN,zh;q=0.8',

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',

'Accept-Encoding': 'gzip, deflate',

'X-Requested-With': 'XMLHttpRequest',

}

url = "https://www.qiushibaike.com/8hr/page/1/"

try:

# 使用包含header的信息，进行请求

request = urllib.request.Request(url, headers=header)

# 通过urlopen打开包含header的url链接，用来获取数据

response = urllib.request.urlopen(request)

except urllib.error.HTTPError as e:

print(e.code)

except urllib.error.URLError as e:

print(e.reason)

except http.client.error as e:

print(e)

else:

# 读取返回的结果

content = response.read()

# 用于判断是何种压缩算法，如果是gzip则调用gzip算法

encoding = response.info().get('Content-Encoding')

# 判断使用的是不是gzip压缩算法

if encoding == 'gzip':

print(content)

content = gzip_decompress(content)

print(content)

# deflate很少有人在用了，老网站可能用，这里也判断一下

elif encoding == 'deflate':

content = deflate_decompress(content)

importurllib.request

importhttp.client

importssl

importgzip

importzlib

#取消全局ssl验证

ssl._create_default_https_context=ssl._create_unverified_context

defgzip_decompress(data):

try:# 尝试解压

print('正在使用gzip解压.....')

data=gzip.decompress(data)

print('gzip解压完毕!')

except:

print('未经gzip压缩, 无需解压')

returndata

# deflate压缩算法

defdeflate_decompress(data):

try:

print('正在使用deflate解压.....')

returnzlib.decompress(data,-zlib.MAX_WBITS)

print('deflate解压完毕!')

exceptzlib.error:

returnzlib.decompress(data)

# 封装头信息，伪装成浏览器

header={

'Connection':'Keep-Alive',

'Accept-Language':'zh-CN,zh;q=0.8',

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',

'Accept-Encoding':'gzip, deflate',

'X-Requested-With':'XMLHttpRequest',

}

url="https://www.qiushibaike.com/8hr/page/1/"

try:

# 使用包含header的信息，进行请求

request=urllib.request.Request(url,headers=header)

# 通过urlopen打开包含header的url链接，用来获取数据

response=urllib.request.urlopen(request)

excepturllib.error.HTTPErrorase:

print(e.code)

excepturllib.error.URLErrorase:

print(e.reason)

excepthttp.client.errorase:

print(e)

else:

# 读取返回的结果

content=response.read()

# 用于判断是何种压缩算法，如果是gzip则调用gzip算法

encoding=response.info().get('Content-Encoding')

# 判断使用的是不是gzip压缩算法

ifencoding=='gzip':

print(content)

content=gzip_decompress(content)

print(content)

# deflate很少有人在用了，老网站可能用，这里也判断一下

elifencoding=='deflate':

content=deflate_decompress(content)

这里咱们已经明确了糗百使用的是gzip压缩算法，可以直接只使用gzip进行解压缩。但是这里考虑到大家可能以后抓的网站属于老网站，，老网站一般使用的是古老的deflate压缩算法，现在很少人用了。关于gzip的知识，大家

运行一下打印结果如下：

Python

D:\python\python.exe E:/技术学习/Python代码/test/test.py

b'\x1f\x8b\x08\x00\xda\xcbF[\x02\xff\xed}\xebW\x1bI\x96……

正在使用gzip解压.....

gzip解压完毕!

b'\n\n

D:\python\python.exeE:/技术学习/Python代码/test/test.py

b'\x1f\x8b\x08\x00\xda\xcbF[\x02\xff\xed}\xebW\x1bI\x96……

正在使用gzip解压.....

gzip解压完毕!

b'\n\n\n

这里距离我们想要的HTML代码数据，格式按网页格式输出，还需要加一些转码：

Python

# 用于判断是何种压缩算法，如果是gzip则调用gzip算法

encoding = response.info().get('Content-Encoding')

# 判断使用的是不是gzip压缩算法

if encoding == 'gzip':

print(content)

content = gzip_decompress(content)

print(content.decode(encoding='utf-8', errors='strict'))

# deflate很少有人在用了，老网站可能用，这里也判断一下

elif encoding == 'deflate':

content = deflate_decompress(content)

# 用于判断是何种压缩算法，如果是gzip则调用gzip算法

encoding=response.info().get('Content-Encoding')

# 判断使用的是不是gzip压缩算法

ifencoding=='gzip':

print(content)

content=gzip_decompress(content)

print(content.decode(encoding='utf-8',errors='strict'))

# deflate很少有人在用了，老网站可能用，这里也判断一下

elifencoding=='deflate':

content=deflate_decompress(content)

也就是加入： print(content.decode(encoding=’utf-8′, errors=’strict’))将其转换成utf-8格式的数据，这样输出的就是HTML文件了。至此，我们的小爬虫终于历经千辛万苦，爬到了糗百的HTML代码数据。

3.爬取一个页面的图片代码

下面哥要进行正则解析了，坑了我好久啊啊啊嗯啊啊啊

Python

孩子的世界就是这么的单纯。

我们要抓图片，这里需要看网页的html源码，找到用户发的图片信息。

Python

import urllib.request

import http.client

import ssl

import gzip

import zlib

import re

import urllib.error as error

import os

ssl._create_default_https_context = ssl._create_unverified_context

def gzip_decompress(data):

try: # 尝试解压

print('正在使用gzip解压.....')

data = gzip.decompress(data)

print('gzip解压完毕!')

except:

print('未经gzip压缩, 无需解压')

return data

# deflate压缩算法

def deflate_decompress(data):

try:

print('正在使用deflate解压.....')

return zlib.decompress(data, -zlib.MAX_WBITS)

print('deflate解压完毕!')

except zlib.error:

return zlib.decompress(data)

# 封装头信息，伪装成浏览器

header = {

'Connection': 'Keep-Alive',

'Accept-Language': 'zh-CN,zh;q=0.8',

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',

'Accept-Encoding': 'gzip, deflate',

'X-Requested-With': 'XMLHttpRequest',

}

url = "https://www.qiushibaike.com/8hr/page/1/"

try:

# 使用包含header的信息，进行请求

request = urllib.request.Request(url, headers=header)

# 通过urlopen打开包含header的url链接，用来获取数据

response = urllib.request.urlopen(request)

except urllib.error.HTTPError as e:

print(e.code)

except urllib.error.URLError as e:

print(e.reason)

except http.client.error as e:

print(e)

else:

# 读取返回的结果

content = response.read()

# 用于判断是何种压缩算法，如果是gzip则调用gzip算法

encoding = response.info().get('Content-Encoding')

# 判断使用的是不是gzip压缩算法

if encoding == 'gzip':

content = gzip_decompress(content).decode(encoding='utf-8', errors='strict')

# deflate很少有人在用了，老网站可能用，这里也判断一下

elif encoding == 'deflate':

content = deflate_decompress(content)

# 这里获取了所有的img链接，包括头像什么的。

imgre = re.compile('

imglist = imgre.findall(content) # re.findall() 方法读取html 中包含 imgre（正则表达式）的数据

print(imglist)

# 把筛选的图片地址通过for循环遍历并保存到本地

# 核心是urllib.urlretrieve()方法,直接将远程数据下载到本地，图片通过x依次递增命名

x = 0

dirpath = 'E:/爬虫数据/1.糗百图片/'

if os.path.exists(dirpath):

print("文件夹已经存在")

else:

os.makedirs(dirpath)

print("文件夹不存在！刚刚已经创建")

for imgurl in imglist:

# 将正则表达式编译成pattern对象

pattern = re.compile('\w?//pic.qiushibaike.com/.*.jpeg')

# 进行匹配查找

if pattern.match(imgurl):

# 如果匹配成功则进行下载，保存到本地文件夹

try:

# 糗百的html代码中关于图片的代码，没有http，需要手动加上

image_data = urllib.request.urlopen("https:"+imgurl).read()

image_path = dirpath + str(x) + '.jpeg'

x += 1

print(image_path)

with open(image_path, 'wb') as image_file:

image_file.write(image_data)

image_file.close()

except error.URLError as e:

print('Download failed')

importurllib.request

importhttp.client

importssl

importgzip

importzlib

importre

importurllib.erroraserror

importos

ssl._create_default_https_context=ssl._create_unverified_context

defgzip_decompress(data):

try:# 尝试解压

print('正在使用gzip解压.....')

data=gzip.decompress(data)

print('gzip解压完毕!')

except:

print('未经gzip压缩, 无需解压')

returndata

# deflate压缩算法

defdeflate_decompress(data):

try:

print('正在使用deflate解压.....')

returnzlib.decompress(data,-zlib.MAX_WBITS)

print('deflate解压完毕!')

exceptzlib.error:

returnzlib.decompress(data)

# 封装头信息，伪装成浏览器

header={

'Connection':'Keep-Alive',

'Accept-Language':'zh-CN,zh;q=0.8',

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',

'Accept-Encoding':'gzip, deflate',

'X-Requested-With':'XMLHttpRequest',

}

url="https://www.qiushibaike.com/8hr/page/1/"

try:

# 使用包含header的信息，进行请求

request=urllib.request.Request(url,headers=header)

# 通过urlopen打开包含header的url链接，用来获取数据

response=urllib.request.urlopen(request)

excepturllib.error.HTTPErrorase:

print(e.code)

excepturllib.error.URLErrorase:

print(e.reason)

excepthttp.client.errorase:

print(e)

else:

# 读取返回的结果

content=response.read()

# 用于判断是何种压缩算法，如果是gzip则调用gzip算法

encoding=response.info().get('Content-Encoding')

# 判断使用的是不是gzip压缩算法

ifencoding=='gzip':

content=gzip_decompress(content).decode(encoding='utf-8',errors='strict')

# deflate很少有人在用了，老网站可能用，这里也判断一下

elifencoding=='deflate':

content=deflate_decompress(content)

# 这里获取了所有的img链接，包括头像什么的。

imgre=re.compile('

imglist=imgre.findall(content)# re.findall() 方法读取html 中包含 imgre（正则表达式）的数据

print(imglist)

# 把筛选的图片地址通过for循环遍历并保存到本地

# 核心是urllib.urlretrieve()方法,直接将远程数据下载到本地，图片通过x依次递增命名

x=0

dirpath='E:/爬虫数据/1.糗百图片/'

ifos.path.exists(dirpath):

print("文件夹已经存在")

else:

os.makedirs(dirpath)

print("文件夹不存在！刚刚已经创建")

forimgurlinimglist:

# 将正则表达式编译成pattern对象

pattern=re.compile('\w?//pic.qiushibaike.com/.*.jpeg')

# 进行匹配查找

ifpattern.match(imgurl):

# 如果匹配成功则进行下载，保存到本地文件夹

try:

# 糗百的html代码中关于图片的代码，没有http，需要手动加上

image_data=urllib.request.urlopen("https:"+imgurl).read()

image_path=dirpath+str(x)+'.jpeg'

x+=1

print(image_path)

withopen(image_path,'wb')asimage_file:

image_file.write(image_data)

image_file.close()

excepterror.URLErrorase:

print('Download failed')

这里将爬取的图片保存到了硬盘上，首先判断硬盘是否已经存在文件夹，如果存在，则直接写入图片数据。如果不存在，则进行自动创建文件夹操作。爬取结果如下：

Python

D:\python\python.exe E:/技术学习/Python代码/test/test.py

正在使用gzip解压.....

gzip解压完毕!

['/static/images/banner.png', '//pic.qiushibaike.com/system/avtnew/2858/2858128

E:/爬虫数据/1.糗百图片/0.jpeg

E:/爬虫数据/1.糗百图片/1.jpeg

E:/爬虫数据/1.糗百图片/2.jpeg

E:/爬虫数据/1.糗百图片/3.jpeg

E:/爬虫数据/1.糗百图片/4.jpeg

E:/爬虫数据/1.糗百图片/5.jpeg

E:/爬虫数据/1.糗百图片/6.jpeg

E:/爬虫数据/1.糗百图片/7.jpeg

E:/爬虫数据/1.糗百图片/8.jpeg

E:/爬虫数据/1.糗百图片/9.jpeg

E:/爬虫数据/1.糗百图片/10.jpeg

E:/爬虫数据/1.糗百图片/11.jpeg

E:/爬虫数据/1.糗百图片/12.jpeg

E:/爬虫数据/1.糗百图片/13.jpeg

进程已结束,退出代码0

D:\python\python.exeE:/技术学习/Python代码/test/test.py

正在使用gzip解压.....

gzip解压完毕!

['/static/images/banner.png','//pic.qiushibaike.com/system/avtnew/2858/2858128

E:/爬虫数据/1.糗百图片/0.jpeg

E:/爬虫数据/1.糗百图片/1.jpeg

E:/爬虫数据/1.糗百图片/2.jpeg

E:/爬虫数据/1.糗百图片/3.jpeg

E:/爬虫数据/1.糗百图片/4.jpeg

E:/爬虫数据/1.糗百图片/5.jpeg

E:/爬虫数据/1.糗百图片/6.jpeg

E:/爬虫数据/1.糗百图片/7.jpeg

E:/爬虫数据/1.糗百图片/8.jpeg

E:/爬虫数据/1.糗百图片/9.jpeg

E:/爬虫数据/1.糗百图片/10.jpeg

E:/爬虫数据/1.糗百图片/11.jpeg

E:/爬虫数据/1.糗百图片/12.jpeg

E:/爬虫数据/1.糗百图片/13.jpeg

进程已结束,退出代码0

4.本篇小结

大家可以升一下级，写抓任意页面的图片。本章，咱们一步一步爬取了糗百的图片。虽然上面东西大家可能看着比较简单，实际上，我在同步编写的代码的时候，遇到了好几个坑爹的地方。下一篇我们将要称热打铁，将继续爬取糗百的段子，作者，评论数，点赞数等等。

赞赏作者

微信赞赏支付宝赞赏

喜欢 (5)or分享 (0)