python表情包爬虫程序_Python网络爬虫爬取站长素材上的表情包

由于经常看群消息,收藏的表情包比较少,每次在群里斗图我都处于下风,最近在中国大学MOOC上学习了嵩天老师的Python网络爬虫与信息提取课程,于是决定写一个爬取网上表情包的网络爬虫。通过搜索发现站长素材上的表情包很是丰富,一共有446页,每页10个表情包,一共是4000多个表情包,近万个表情,我肯以后谁还敢给我斗图

0818b9ca8b590ca3270a3433284dd417.png

网页分析

站长素材第一页表情包是这样的:

0818b9ca8b590ca3270a3433284dd417.png

接下来是分析每一页表情包列表的源代码:

0818b9ca8b590ca3270a3433284dd417.png

再来分析每个表清包全部表情对应的网页:

0818b9ca8b590ca3270a3433284dd417.png

0818b9ca8b590ca3270a3433284dd417.png

步骤

1、获得每页展示的每个表情包连接和title

2、获得每个表情包的所有表情的链接

3、使用获取到的表情链接获取表情,每个表情包的表情放到一个单独的文件夹中,文件夹的名字是title属性值

代码

#-*-coding:utf-8-*-

'''

Created on 2017年3月18日

@author: lavi

'''

import bs4

from bs4 import BeautifulSoup

import re

import requests

import os

import traceback

'''

获得页面内容

'''

def getHtmlText(url):

try:

r = requests.get(url,timeout=30)

r.raise_for_status()

r.encoding = r.apparent_encoding

return r.text

except:

return ""

'''

获得content

'''

def getImgContent(url):

head = {"user-agent":"Mozilla/5.0"}

try:

r = requests.get(url,headers=head,timeout=30)

print("status_code:"+r.status_code)

r.raise_for_status()

return r.content

except:

return None

'''

获得页面中的表情的链接

'''

def getTypeUrlList(html,typeUrlList):

soup = BeautifulSoup(html,'html.parser')

divs = soup.find_all("div", attrs={"class":"up"})

for div in divs:

a = div.find("div", attrs={"class":"num_1"}).find("a")

title = a.attrs["title"]

typeUrl = a.attrs["href"]

typeUrlList.append((title,typeUrl))

def getImgUrlList(typeUrlList,imgUrlDict):

for tuple in typeUrlList:

title = tuple[0]

url = tuple[1]

title_imgUrlList=[]

html = getHtmlText(url)

soup = BeautifulSoup(html,"html.parser")

#print(soup.prettify())

div = soup.find("div", attrs={"class":"img_text"})

#print(type(div))

imgDiv = div.next_sibling.next_sibling

#print(type(imgDiv))

imgs = imgDiv.find_all("img");

for img in imgs:

src = img.attrs["src"]

title_imgUrlList.append(src)

imgUrlDict[title] = title_imgUrlList

def getImage(imgUrlDict,file_path):

head = {"user-agent":"Mozilla/5.0"}

countdir = 0

for title,imgUrlList in imgUrlDict.items():

#print(title+":"+str(imgUrlList))

try:

dir = file_path+title

if not os.path.exists(dir):

os.mkdir(dir)

countfile = 0

for imgUrl in imgUrlList:

path = dir+"/"+imgUrl.split("/")[-1]

#print(path)

#print(imgUrl)

if not os.path.exists(path):

r = requests.get(imgUrl,headers=head,timeout=30)

r.raise_for_status()

with open(path,"wb") as f:

f.write(r.content)

f.close()

countfile = countfile+1

print("当前进度文件夹进度{:.2f}%".format(countfile*100/len(imgUrlList)))

countdir = countdir + 1

print("文件夹进度{:.2f}%".format(countdir*100/len(imgUrlDict)))

except:

print(traceback.print_exc())

#print("from getImage 爬取失败")

def main():

#害怕磁盘爆满就不获取全部的表情了,只获取30页,大约300个表情包里的表情

pages = 30

root = "http://sc.chinaz.com/biaoqing/"

url = "http://sc.chinaz.com/biaoqing/index.html"

file_path = "e://biaoqing/"

imgUrlDict = {}

typeUrlList = []

html = getHtmlText(url);

getTypeUrlList(html,typeUrlList)

getImgUrlList(typeUrlList,imgUrlDict)

getImage(imgUrlDict,file_path)

for page in range(pages):

url = root + "index_"+str(page)+".html"

imgUrlDict = {}

typeUrlList = []

html = getHtmlText(url);

getTypeUrlList(html,typeUrlList)

getImgUrlList(typeUrlList,imgUrlDict)

getImage(imgUrlDict,file_path)

main()

结果

0818b9ca8b590ca3270a3433284dd417.png

0818b9ca8b590ca3270a3433284dd417.png

如果你在群里斗图吃了亏,把上面的程序运行一遍。。。不要谢我,3月是学雷锋月。哈哈,来把我们斗会图,

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值