基于BeautifulSoup和requests的网站图片爬取（超详细）

最新推荐文章于 2023-03-07 17:11:30 发布

NO17-MONSter

最新推荐文章于 2023-03-07 17:11:30 发布

阅读量3.3k

点赞数 5

文章标签： python html 乱码

本文链接：https://blog.csdn.net/Monster_No17/article/details/106277822

版权

库的准备

bs4，requests，re，lxml
csdn上有很多安装教程，就不一一阐述。

网站准备

本次爬取网站为站长之家的背景图片素材，网站为http://sc.chinaz.com/tupian/beijingtupian.html。

我们的目的为爬取该网站下所有的背景图片。

requests爬取网页源代码

首先用requests.get（）爬取该网站的源代码，代码为

import requests,lxml,re
from bs4 import  BeautifulSoup as ba
url = 'http://sc.chinaz.com/tupian/beijingtupian_2.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;\
			 Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
 			Chrome/83.0.4103.61 Safari/537.36'}
reponse = requests.get(url=url, headers=headers)
print(reponse.text)

结果为（部分结果）

ï»¿<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>èƒŒæ™¯å›¾ç‰‡ã€èƒŒæ™¯å›¾ç‰‡å¤§å
¨ã€æ¡Œé¢èƒŒæ™¯å›¾ç‰‡ã€èƒŒæ™¯å›¾ç‰‡ç´ æ_ç«™é•¿ç´ æ</title>
<meta name="Keywords" content="èƒŒæ™¯å›¾ç‰‡,èƒŒæ™¯å›¾ç‰‡å¤§å
¨,èƒŒæ™¯å›¾ç‰‡ç´ æ,èƒŒæ™¯å›¾ç‰‡ä¸‹è½½" />
<meta name="description" content="èƒŒæ™¯å›¾ç‰‡æ ç›®æ”¶é›†ç©ºé—´èƒŒæ™¯å›¾ç‰‡,æ‰‹æœºèƒŒæ™¯å›¾ç‰‡,é«˜æ¸
èƒŒæ™¯å›¾ç‰‡,æµªæ¼«èƒŒæ™¯å›¾ç‰‡,æ¡Œé¢èƒŒæ™¯å›¾ç‰‡é«˜æ¸
,æ¸
æ–°æ·¡é›
èƒŒæ™¯å›¾ç‰‡,å¥½çœ‹çš„èƒŒæ™¯å›¾ç‰‡ç´ ææä¾›ç»™å¹¿å¤§ç”¨æˆ·å
è´¹ä¸‹è½½ã€‚" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="x-ua-compatible" content="ie=7" />
<link href="/style/pic_all.css" type="text/css" rel="stylesheet" />

这时候发现出现了字符乱码。查看网页的源代码应该是这样的在这里插入图片描述
这时候猜测是网页编码问题，查看get到的代码的编码方式

reponse = requests.get(url=url, headers=headers)
print(reponse.encoding)

输出为ISO-8859-1，查了一些资料，可能是因为网页压缩所致。
编码方式不正确，所以才导致了乱码，
解决方法：运用encode以及decode将获得的源代码转化为UTF-8。

   reponse=requests.get(url=url2,headers=headers).text
   #当网页编码为UTF-8，请求为ISO-8859-1时，可用以下办法（ignore为有些不可转换的编码准备）
   reponse=reponse.encode("ISO-8859-1")
   reponse=reponse.decode("utf-8",'ignore')

在重新print下源代码
在这里插入图片描述
发现编码问题解决了。

建立BeautifulSoup对象

获得了网页的源代码后，接下来开始解析代码，并用find_all（）函数去获得我们想要的内容。
首先解析网页，我们需要的不是这个页面的缩略图效果，需要的是原图，当我们点击图片名字的时候就会进入他的原题所在页面，所以，我们需要获得这个页面的链接。那么怎么获得这个页面的链接呢？鼠标停在图片名字处，并右键审查元素，获得图片代码所在处。
在这里插入图片描述
在看一个

发现里面有一个后缀为.htm的连接，复制用浏览器打开

打开后发现是该图的大图模式，也就是我们的目标。同样右键审查元素

源代码里面就有该图片的连接，用requests.get（该地址），就可以获得该图像了。
大体已经知道了该怎么做了，归纳下

获得该图片大图的连接，
打开连接
获得该图片的连接
下载图片
首先先用获得该大图连接，通过分析该网站两个图片的源代码发现，大图连接都是一个格式。

<a target="_blank" href="http://sc.chinaz.com/tupian/200521063411.htm" alt="咖啡豆黄色背景图片">咖啡豆黄色背景图片</a
<a target="_blank" href="http://sc.chinaz.com/tupian/200521351214.htm" alt="爱心平铺背景图片">爱心平铺背景图片</a>

我们先从代码中提取该条标签，用find_all（）字符匹配可以获得该标签

 sourhtml=ba(reponse,'lxml')
 imagehtml=sourhtml.find_all('a',alt=re.compile('图片'),target="_blank",href=re.compile('http://sc.chinaz.com/tupian'))

由于他们的格式都很相似，所以可以直接找标签为‘a’，alt属性里面包含“图片”，target属性为“_blank”，href属性包含“http://sc.chinaz.com/tupian”，结果为。
在这里插入图片描述
由于返回的是类似一个列表类型，所以需要用for循环来输出它。
获得这个后，我们就需要获得href的属性值，用get（）函数可获得

在这里插入图片描述
之后我们在用requests.get（）获得图片的连接，跟上面同样的方法，具体不在阐述。直接贴代码

for i in imagehtml:
    reponseimag = requests.get(url=i.get('href'), headers=headers).text
    #避免字符乱码
    reponseimag = reponseimag.encode("ISO-8859-1")
    reponseimag = reponseimag.decode("utf-8", 'ignore')
    sour = ba(reponseimag, 'lxml')
    image = sour.find('img', src=re.compile('http://pic.sc.chinaz.com/files/pic/pic9/'))
    print(image.get('src'))#连接存在于src的属性里

结果在这里插入图片描述
最后再用requests.get获得该图片并保存即可，保存的文件名以原图片的名字.get(‘alt’)为名。

for i in imagehtml:
	#获得图片连接
    reponseimag = requests.get(url=i.get('href'), headers=headers).text
    reponseimag = reponseimag.encode("ISO-8859-1")
    reponseimag = reponseimag.decode("utf-8", 'ignore')
    sour = ba(reponseimag, 'lxml')
    image = sour.find('img', src=re.compile('http://pic.sc.chinaz.com/files/pic/pic9/'))
    #保存图片
    with open('C:\\Users\\17\\Desktop\\背景图\\{}.jpg'.format(i.get('alt')[0:-2]), 'wb') as fd:
        ima = requests.get(url=image.get('src')).content
        fd.write(ima)

爬取结果在这里插入图片描述
看起来挺令人兴奋的对吧，但是先不要激动，我们的目标是什么，是爬取所有的图片，这么简单的几个图满足不了我们的。那么如何爬取剩下的图呢？让我们进入下一页

看看网址有什么变化，在进入下一页，看看有什么变化，我想你已经看出来了，我直接网址贴出来

http://sc.chinaz.com/tupian/beijingtupian_2.html#第二页
http://sc.chinaz.com/tupian/beijingtupian_3.html#第三页
http://sc.chinaz.com/tupian/beijingtupian_4.html#第四页
....

没错，他们区别就是最后面的数字的区别，我们就可通过改变它来获得所有的网址用一个for循环就可以完成了

for j in range(1,97):#记住从1开始，不然会出错
    url = 'http://sc.chinaz.com/tupian/beijingtupian_{}.html'.format(j)
    ...

加上这句就可以吧所有的图片全部爬取下来了，最后的结果（图片太多就不一一展示了）
在这里插入图片描述
最后贴上全部的代码：

import requests,lxml,re
from bs4 import  BeautifulSoup as ba
for j in range(1,97):
    url = 'http://sc.chinaz.com/tupian/beijingtupian_{}.html'.format(j)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
    reponse = requests.get(url=url, headers=headers).text
    #避免字符乱码
    reponse = reponse.encode("ISO-8859-1")
    reponse = reponse.decode("utf-8", 'ignore')
    #获得一个页面的所有大图的链接
    sourhtml = ba(reponse, 'lxml')
    imagehtml = sourhtml.find_all('a', alt=re.compile('图片'), target="_blank",href=re.compile('http://sc.chinaz.com/tupian'))
    for i in imagehtml:
        #获得图片的地址
        reponseimag = requests.get(url=i.get('href'), headers=headers).text
        reponseimag = reponseimag.encode("ISO-8859-1")
        reponseimag = reponseimag.decode("utf-8", 'ignore')
        sour = ba(reponseimag, 'lxml')
        image = sour.find('img', src=re.compile('http://pic.sc.chinaz.com/files/pic/pic9/'))
        #保存文件
        with open('C:\\Users\\17\\Desktop\\背景图\\{}.jpg'.format(i.get('alt')[0:-2]), 'wb') as fd:
            ima = requests.get(url=image.get('src')).content
            fd.write(ima)

乱码解决方法来源：https://blog.csdn.net/qq_44105778/article/details/86021178?utm_source=app
初次写博客，如有错误，请指出，谢谢！

NO17-MONSter

关注

5
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
基于BeautifulSoup和requests的网站图片爬取（超详细）

库的准备bs4，requests，re，lxmlcsdn上有很多安装教程，就不一一阐述。网站准备本次爬取网站为站长之家的背景图片素材，网站为http://sc.chinaz.com/tupian/beijingtupian.html。我们的目的为爬取该网站下所有的背景图片。requests爬取网页源代码首先用requests.get（）爬取该网站的源代码，代码为import requests,lxml,refrom bs4 import BeautifulSoup as baur
复制链接

扫一扫