景点图片爬取(改进)
经过测试发现,之前爬到的数据并不是全部的数据,据估计,全国共两千多个城市和地区,数以万计的景点,而之前仅仅是将每个城市的一小部分景点爬了出来,所以现在需要做的就是把所有数据全部爬取出来。
在一开始采用的依旧是马蜂窝,但是经过反复尝试,发现它在访问景点详情页的时候,返回的并不是网页源码,而是一段JS混淆代码:
<script>document.cookie=('_')+('_')+('j')+('s')+('l')+('_')+('c')+('l')+('e')+('a')+('r')+('a')+('n')+('c')+('e')+('=')+(-~0+'')+([2]*(3)+'')+((1<<1)+'')+(2+'')+(2+6+'')+((2<<2)+'')+(+!+[]*2+'')+(-~[2]+'')+(1+3+'')+((1<<1)+'')+('.')+(-~[2]+'')+((2^1)+'')+(-~(8)+'')+('|')+('-')+(-~0+'')+('|')+((1+[0])/[2]+'')+('J')+('T')+('c')+('e')+('q')+('b')+(1+5+'')+('U')+('x')+('q')+('X')+('h')+('b')+(3+6+'')+('J')+('V')+('D')+('F')+('H')+('R')+('%')+(0+1+0+1+'')+('F')+('N')+('r')+('%')+(-~1+'')+('F')+('Q')+('c')+('%')+((1+[2]>>2)+'')+('D')+(';')+('m')+('a')+('x')+('-')+('a')+('g')+('e')+('=')+((2^1)+'')+(3+3+'')+(~~''+'')+(~~''+'')+(';')+('p')+('a')+('t')+('h')+('=')+('/');location.href=location.pathname+location.search</script>
经过网上查阅多个资料,得知这段代码返回的是一个新的cookie,只有把它解码出来才能正确访问。
但是这段代码过于复杂,网上查阅资料也没有查到什么有用的信息,经过反复尝试(request、selenium),只能仅使用马蜂窝爬取景点图片,换别的网站爬取详细信息。
经过几个网站对比观察,最终采用了穷游网。
我选择从上面这个网页爬取所有城市列表,共170页,获取每个城市ID,然后利用开发者工具抓包可以发现在景点列表页面更换页码时有json文件传入,所以我们需要遍历每一页,获取所有的景点。
由于爬取数据量巨大,此方法还采用了多进程的方式,以提高速度。
def get_data(html_data):
global allList
# print(allList)
selector = parsel.Selector(html_data)
lis = selector.xpath('//ul[@class="plcCitylist"]/li')
if len(lis)<15:
print(str(len(lis)),html_data)
for li in lis:
travel_place = li.xpath('.//h3/a/text()').get() # 目的地
travel_place = travel_place.replace('\xa0','')
onecitylist={}
pid = li.xpath('.//p[@class="addPlanBtn"]/@data-pid').get()
# print(travel_place, travel_people, travel_hot, travel_url, travel_imgUrl,pid, sep=' | ')
page = 1
while page < 500:
headers = {
'authority': 'place.qyer.com',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
'accept': 'application/json, text/javascript, */*; q=0.01',
'x-requested-with': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'origin': 'https://place.qyer.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://place.qyer.com/hong-kong/sight/',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': '_qyeruid=CgIBAWC4PHp2qkJVmSkPAg==; new_uv=1; new_session=1; _guid=Rc6e04dd-9db6-cafc-afa0-e9515fac0d3f; ql_guid=QL5c19c9-1f38-4377-82a6-18242efa0235; source_url=https%3A//www.qyer.com/; isnew=1622686857075; __utma=253397513.1025643824.1622686844.1622825628.1622888267.5; __utmc=253397513; __utmz=253397513.1622888267.5.3.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmt=1; ql_created_session=1; ql_stt=1622888267070; ql_vts=5; frombaidu=1; PHPSESSID=23bf82872f55096d2a8720eb8dfeb705; __utmb=253397513.10.10.1622888267; ql_seq=10',
}
params = (
('action', 'list_json'),
)
data = {
'page': page,
'type': 'city',
'pid': pid,
'sort': '32',
'subsort': 'all',
'isnominate': '-1',
'haslastm': 'false',
'rank': '6'
}
print(page)
page = page+1
response = requests.post('https://place.qyer.com/poi.php', headers=headers, params=params, data=data)
# NB. Original query string below. It seems impossible to parse and
# reproduce query strings 100% accurately so the one below is given
# in case the reproduced version is not "correct".
# response = requests.post('https://place.qyer.com/poi.php?action=list_json', headers=headers, data=data)
try:
result = json.loads(response.text)
except Exception as e:
# page = page-1
print('error'+response.text)
time.sleep(3)
page = page-1
continue
data = result['data']['list']
# print(data)
if len(data)==0:
break
for dat in data:
name = dat['cnname']
enname = dat['enname']
score = dat['grade']
imgsrc = dat['photo']
rank = dat['rank']
try:
dis=dat['comments'][0]['text']
except:
dis=''
# detail_url = 'https:'+dat['url']
# detail_html = send_request(detail_url)
# soup = BeautifulSoup(detail_html, "html.parser")
# try:
# dis = soup.find('div',{'class':'compo-detail-info'}).text.replace(' ','')
# except:
# dis = ''
onesceniclist = {}
onesceniclist['fname'] = enname
onesceniclist['description'] = dis
onesceniclist['rank'] = rank
onesceniclist['imgsrc'] = imgsrc
onesceniclist['score'] = score
onecitylist[name] = onesceniclist
# print(onecitylist)
allList[travel_place] = onecitylist
# print(allList)
# count = count+1
print('城市:'+travel_place)
time.sleep(3)
return allList
经过尝试,发现在获取每一页城市列表时有时会出现获取不到的情况,因此爬到的城市还是不够多,所以我在每一页爬完时让程序sleep一段时间,然后再继续,但是问题依旧存在。
之后又把程序改为了按照城市单个进行爬取的方式,但是因为城市太多,这必然要消耗更多的时间。
最后,我选择了分页爬取的方式,一次仅爬取5-10页的城市,这样即使有问题,也能及时更改过来,重新获取信息。
all_task = []
with ProcessPoolExecutor(max_workers=13) as executor:
for page in range(1, 4): #172
# print(page)
url = f'https://place.qyer.com/china/citylist-0-0-{page}/'
all_task.append(executor.submit(main, url))
time.sleep(1)
# for value in as_completed(all_task):
# l = value.result()
# allList = dict(**allList, **l)
# filename = 'allScenic.json'
# with open(filename, 'w') as file:
# json.dump(allList, file)
wait(all_task, return_when=ALL_COMPLETED)
for value in all_task:
l = value.result()
try:
allList = dict(**allList, **l)
except:
for key, value in l.items():
allList[key] = value
即使如此,爬取到所有数据依旧耗费了大量的时间,不过最后总算是成功地爬了下来,这样在网页展示景点的时候,内容将会更加丰富。