表情包的爬取

最新推荐文章于 2024-07-31 14:30:10 发布

qq_46131444

最新推荐文章于 2024-07-31 14:30:10 发布

阅读量659

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/qq_46131444/article/details/106207226

版权

今天这一期教大家如何爬取并下载表情包，我门爬取最新表情表，首页链接如下：表情包首页
爬取的页数我们以50页为示例，如果你想爬取全部，可以自行改动
首先我们用代码获取前50页的链接：

import requests
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import os

BASE_PAGE_URL = 'http://www.doutula.com/photo/list/?page='
PAGE_URL_LIST = []
for x in range(1,50):
    url = BASE_PAGE_URL + str(x)
    PAGE_URL_LIST.append(url)

print(PAGE_URL_LIST)

我们点开发现就是我们需要爬取的网页
接下来我们就只需要获取图片的链接并且下载就完成了我们的爬虫

通过分析网页，我们可以发现，所有的表情包都是放在每一个a标签下面的，而图片的链接这是放在为一个a标签下面的data-original属性，直到这些之后，我们就可以对我们的网站进行爬取(代码接上面的)

response = requests.get(PAGE_URL_LIST)

file_obj = open('doutula.html', 'w', encoding='utf-8')

file_obj.write(response.content.decode('utf-8'))
file_obj.close()

file_obj = open('doutula.html', 'rb')
content = file_obj.read()
file_obj.close()

soup = BeautifulSoup(content, 'lxml')
img_list = soup.find_all('img', class_= "img-responsive lazy image_dta")
for img in img_list:
        url = (img['data-original'])
        split_list = url.split('/')

        filename = split_list.pop()

最后获取到我们需要将下载的图片放到本地指定的位置去：

 path = os.path.join('images', filename)
        urlretrieve(url, filename=path)

最后，我们爬取的结果是这样的：
在这里插入图片描述
这样，我们这次的小项目就完成了！

qq_46131444

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
表情包的爬取

今天这一期教大家如何爬取并下载表情包，我门爬取最新表情表，首页链接如下：表情包首页爬取的页数我们以50页为示例，如果你想爬取全部，可以自行改动首先我们用代码获取前50页的链接：import requestsfrom bs4 import BeautifulSoupfrom urllib.request import urlretrieveimport osBASE_PAGE_URL = 'http://www.doutula.com/photo/list/?page='PAGE_URL_L
复制链接

扫一扫