html5 相册 python,Python网络爬虫5 – 图片抓取

最新推荐文章于 2022-05-17 11:51:39 发布

康复师于老师

最新推荐文章于 2022-05-17 11:51:39 发布

阅读量187

点赞数

文章标签： html5 相册 python

要抓取网页首先就要找出图片的网址。这里仍然是使用BeautifulSoup，具体如何使用在前一节《使用BeautifulSoup解析网页》时说过，现在就不说了。看下代码好了：

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

#!python

# encoding: utf-8

fromurllib.requestimporturlopen

frombs4importBeautifulSoup

defget(url):

response=urlopen(url)

html=response.read().decode("gbk")

response.close()

returnhtml

defdetect(html):

soup=BeautifulSoup(html,"html.parser")

images=soup.select("img[data-lazyload-src]")

returnimages

defmain():

html=get("http://pp.163.com/longer-yowoo/pp/10069141.html")

links=detect(html)

foriinrange(len(links)):

print(links[i].attrs['data-lazyload-src'])

if__name__=='__main__':

main()

在上面的代码中soup.select(“img[data-lazyload-src]”)一句查询了所有包含data-lazyload-src属性的img标签。在捕捉到图片标签后，又取出data-lazyload-src属性并打印了出来，一共有六个。

然后就是如何抓取图片了。先来看看之前的一段代码：

Python

1

html=response.read().decode("gbk")

这段代码的作用是抓取网页内容并转换为字符串。其中，response是http反馈信息，read方法的作用是读取出http返回的字节流，decode则是将字节流转换为字符串。字符串本质是字节流，图片也是。那么，如何获取图片也就清楚了：就是通过http获取到图片的字节流，再将字节流保存到硬盘即可。看下是如何实现的：

Python

1

2

3

4

5

6

defdownload(url,pic_path):

response=urlopen(url)

img_bytes=response.read()

f=open(pic_path,"wb")

f.write(img_bytes)

f.close()

注意open方法中的mode属性“wb”，w指的是写文件，b指的是采用二进制模式。

再来看看完整的程序：

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

#!python

# encoding: utf-8

importos

fromurllib.requestimporturlopen

frombs4importBeautifulSoup

defget(url):

response=urlopen(url)

html=response.read().decode("gbk")

response.close()

returnhtml

defdetect(html):

soup=BeautifulSoup(html,"html.parser")

images=soup.select("img[data-lazyload-src]")

returnimages

defdownload(url,pic_path):

response=urlopen(url)

img_bytes=response.read()

f=open(pic_path,"wb")

f.write(img_bytes)

f.close()

defmain():

html=get("http://pp.163.com/longer-yowoo/pp/10069141.html")

images=detect(html)

pic_folder="/pics"

os.mkdir(pic_folder)

foriinrange(len(images)):

url=images[i].attrs['data-lazyload-src']

download(url,pic_folder+"/"+str(i)+".jpg")

if__name__=='__main__':

main()

上面的代码仍可以优化下：要下载的文件的名称及扩展名最好是从下载链接中动态获取。这里我偷了个懒，随意指定了文件的名称，扩展名则是早已经知道了。

###################

康复师于老师

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。