景点图片的爬取和存储
最近主要实现的是景点图片的爬取和存储,因为在展示景点时加上一些图片会大大提升用户体验,而图片大都是不会变化的,所以这一部分并没有采用实时爬取,而是预先爬下来进行存储,到时需要的时候直接按照格式查询即可。
通过对比几个网站(马蜂窝、穷游、以及之前爬取过的景点门票的网站),综合衡量选择爬取马蜂窝的图片(目前只每个景点爬取一张图片)。
马蜂窝对于每个城市都有一个全部景点的模块,所需要的就是爬取这一部分。
但是这一部分在网页源码或者动态传输的数据中没有任何展示,所以采取的是爬取网页中的一个mappoints的数据格式,代码如下:
def getAllImg(self):
'''
获取所有图片
:return:
'''
html = self.getHtml('http://www.mafengwo.cn/mdd/')
soup = BeautifulSoup(html,'html.parser')
china1 = soup.find('div',{'class':'hot-list clearfix'})
china2 = soup.find('div',{'class':'hot-list clearfix hide'})
allCityList = {}
allCityidList={}
allCity = china1.find_all('dd')
for city in allCity:
c = city.find_all('a')
for a in c:
'http://www.mafengwo.cn/jd/11065/gonglve.html'
allCityList[a.string] = 'https://www.mafengwo.cn/jd/'+a['href'][29:a['href'].find('.html')]+'/gonglve.html'
for city,url in allCityList.items():
detail = self.getHtml(url)
detail = detail.replace(' ','').replace('\n','')
detail = detail[detail.find('mapponints')+11:detail.find('M.closure(function(require)')-1]
try:
info = json.loads(detail)
except:
continue
onecitylist = {}
for scenic in info:
name = scenic['name'].encode("utf-8").decode("utf-8")
fname = scenic['foreign_name']
des = scenic['description'].encode("utf-8").decode("utf-8")
rank = scenic['rank']
id = scenic['id']
img = 'https://n1-q.mafengwo.net/'+scenic['img'].replace('\\','').replace('mfwStorage','s')
onesceniclist = {}
onesceniclist['fname'] = fname
onesceniclist['description'] = des
onesceniclist['rank'] = rank
onesceniclist['imgsrc'] = img
onecitylist[name] = onesceniclist
self.list[city] = onecitylist
print(self.list)
filename = 'allScenic.json'
with open(filename,'w') as file:
json.dump(self.list,file)
存储的结果:(unicode编码)