景点详细信息和美食数据的爬取以及存储
从马蜂窝网站上获取景点详细信息(包括景点的名称、所在城市、图片、简介、评分、排名等),从携程的美食网站上爬取美食数据,以上两种数据都以json的方式存储。
之前已在马蜂窝上爬取了图片,以同样的方式爬取景点其他信息即可,使用代码:
filename = 'allScenic.json'
with open(filename,'w') as file:
json.dump(self.list,file)
就可以把数据存储起来,另外再设置一个外部可以调用的函数,实现景点信息检索的模糊查询:
def getInfo(request):
'''
获取某个景点的详细信息
:param city: 景点所在城市
:param name: 景点名称
:return:
'''
city = request.POST.get('city')
name = request.POST.get('name')
list = {}
try:
filename = 'allScenic.json'
with open(filename, 'r') as file:
list = json.load(file)
result = process.extractBests(city, list.keys(), score_cutoff=80, limit=1)
info = list[result[0][0]][name]
response = {'data': info}
return JsonResponse(response)
except:
response = {'data': {}}
return JsonResponse(response)
这样就可以返回相对应的景点的全部信息了。
美食主要从携程上爬取,按照城市进行分类,主要实现的是前端返回城市就可以返回该城市所有的特色美食信息:
def getXC(self, city):
'''
主方法
:param city:
:return:
'''
url = self.base + self.getCityUrl(city) + ".html"
p = 1
l = []
count = 1
while p != 3:
tmp = self.base + self.getCityUrl(city) + "/s0-p" + str(p) + ".html"
html = self.getHtml(tmp)
soup = BS(html, 'html.parser')
vs = soup.find_all(name="div", attrs={"class": "rdetailbox"})
print("len(vs)", len(vs))
for j in range(len(vs)):
try:
# 获取子网页链接地址
href = vs[j].find(name="a", attrs={"target": "_blank"}).attrs["href"]
# print("href",href)
# 再次请求子网页,获取景点详细信息
res = self.getHtml(self.base2 + href)
soupi = BS(res, "html.parser") # 该网页的html代码
vis = soupi.find_all(name="li", attrs={"class": "infotext"}) # 获取此时的dom文件位置所在
introduce = []
for i in range(len(vis)):
introduce.append(vis[i].get_text())
imgs = [];
imglinks = soupi.find_all(name="a", attrs={"href": "javascript:void(0)"})
# print(imte)
# print(imglinks)
# print(type(imglinks))
# for img in imte:
# imgs.append(img.attrs["src"])
srcs = []
for src in imglinks:
srcs.append(src.find('img')['src'])
tmp = {};
tmp["id"] = count;
tmp["name"] = vs[j].find(name="a", attrs={"target": "_blank"}).string;
tmp["name"] = tmp["name"].replace(" ", "").replace("\n", "");
tmp["introduce"] = introduce
tmp["img"] = srcs
tmp["city"] = city
count = count + 1;
l.append(tmp);
#time.sleep(1);
except Exception as e:
print(e)
pass
'''with io.open("D:\\py\\aaaaaaaaa\cxsx\zwfpachong\\" + tmp["name"] + ".txt", 'w',
encoding="utf-8") as f:
f.write(str(tmp))'''
p = p+1
return l
# for i in l:
# print((i))
这是负责爬取所有城市的特色美食并存储的代码,另外再设置一个外部可调用的函数:
def getInfo(request):
'''
:param city: 城市名称
:return: 该城市所有美食
'''
city = request.POST.get('city')
list = {}
try:
filename = 'allFood.json'
with open(filename, 'r') as file:
list = json.load(file)
result = process.extractBests(city, list.keys(), score_cutoff=80, limit=1)
info = list[result[0][0]]
response = {'data':info}
return JsonResponse(response)
except:
response = {'data': {}}
return JsonResponse(response)
这样就基本实现了该部分的功能。