开始接触ajax时的确有点蒙,任务就是用ajax爬取post提取的网页。
url = 'http://hotel.elong.com/ajax/list/asyncsearch'
header={'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate',
'Accept-Language' :'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Connection': 'keep-alive',
'Content-Type':'application/x-www-form-urlencoded;
'X-Requested-With':'XMLHttpRequest',
'Content-Length':'1062'}
data5 ={
'listRequest.orderFromID':50,
'listRequest.pageIndex':4}
file = open('juidian.txt','w')
首先,想要爬取post提取的网页时,就需要加上它特有的头和参数。然后就是翻页问题。要实现翻页,发现也输得变化值存在于参数中。于是写了以下代码来实现翻页:
forj in range(0,20):
data5['listRequest.pageIndex']=j
其实除了头和参数的变动之外,其他部分还是采取了json爬取的一些方法。
还有就是对于翻页之后每页的评论也需要进行翻页就显得有些困难。
后来发现在网址中有一些参数是对于整个网址的指向没有影响的,所以就有了一个长长的网址:
for s in range(0,1):
for k in data4:
url3 = 'http://hotel.elong.com/ajax/detail/gethotelreviews/?hotelId=' + str(
k) + '&recommendedType=0&pageIndex=' + str(
s) + '&mainTagId=0&subTagId=0&code=9253708&elongToken=j9cpvej8-4dea-4d07-a3b9-54b27a2797e9&ctripToken=88cf3b41-c4a2-4e49-a411-16af6b55ebec&_=1509281280799'
还有就是两个for循环套在一起需要注意的事项很多,还有一些对齐的内容真的很重要。稍有不慎就会满盘皆空。
在这次任务中还接触到了一个时间的函数:
import time
time.sleep(1)
time sleep() 函数推迟调用线程的运行。
r=requests.post(url,data=data5,headers=header).text
这里使用到了requests来发送json数据,除此之外还有很多,像发送post请求,get请求,上传文件等等。
做这个任务的过程中一直加载不出任何内容,结果最后发现是headers写错了。。
完整代码如下:
#!/usr/bin/python
#coding:utf-8
import re
import json
import urllib2
import requests
import time
from bs4 import BeautifulSoup
url = 'http://hotel.elong.com/ajax/list/asyncsearch'
header={'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate',
'Accept-Language' :'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Connection': 'keep-alive',
'Cookie' :'CookieGuid=158948cd-563a-4211-8a3e-a92bd263cba0; page_time=1509628060819%2C1509628152402%2C1509628219681%2C1509693907739%2C1509693938986%2C1509693967862%2C1509701358453%2C1509701412108%2C1509764088450%2C1509764288344%2C1509772980043%2C1509773457586%2C1509773753700%2C1509774349211%2C1509774650832%2C1509775873404%2C1509775901516%2C1509776293716%2C1509776386781%2C1509776424331%2C1509776587625%2C1509776889277%2C1509778200633%2C1509779461446; _RF1=111.225.131.187; _RSG=zs4ixjEH93BfifEVp1.6NB; _RDG=28df89f13e8f75279825ba18abc6dd550e; _RGUID=b0c944eb-7bb0-4dbd-b5c6-f81cdf3a9b29; ShHotel=CityID=0101&CityNameCN=%E5%8C%97%E4%BA%AC%E5%B8%82&CityName=%E5%8C%97%E4%BA%AC%E5%B8%82&OutDate=2017-11-06&CityNameEN=beijing&InDate=2017-11-05; _fid=j9jp7iwv-bf9b-4c1d-8c4f-893b780f205c; newjava1=a79d58d364ea53c8c9161ec3b2f45c8d; JSESSIONID=EB741B472421B3E455397D577D009DE8; SessionGuid=6a99e379-6ae4-4a74-b0a4-f94dfe60ab5a; Esid=1ec4caeb-c805-4d96-8ddf-43a6a23eed52; com.eLong.CommonService.OrderFromCookieInfo=Status=1&Orderfromtype=1&Isusefparam=0&Pkid=50&Parentid=50000&Coefficient=0.0&Makecomefrom=0&Cookiesdays=0&Savecookies=0&Priority=8000; fv=pcweb; s_cc=true; s_sq=%5B%5BB%5D%5D; s_visit=1',
'Host' :'hotel.elong.com ',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested-With':'XMLHttpRequest',
'Content-Length':'1062'}
data5 ={
'listRequest.orderFromID':50,
'listRequest.pageIndex':4}
file = open('juidian.txt','w')
def getnews(url2):
page = urllib2.urlopen(url2)
html = page.read()
soup = BeautifulSoup(html, "html.parser")
for tag1 in soup.find_all('div', class_="hdetail_view"):
m_1 = tag1.get_text(strip=True)
m2=m_1+'\n'
# print url
file.write(m2.encode('utf-8') + '\n')
# print m2.encode('utf-8')
def getcontent(url3):
r = requests.get(url3).text
data = json.loads(r)
print data
print data["contents"]
for i in data["contents"]:
m_1 = i['content'].encode('utf-8')
print type(i)
m_3 = i['createTimeString'].encode('utf-8')
file.write("用户评论内容:{}".format(m_1)+'\n')
# print "用户评论内容:{}".format(m_1)+'\n'
# print "用户评论时间:{}".format(m_3) + '\n'
file.write("用户评论时间:{}".format(m_3) + '\n')
for j in range(0,20):
data5['listRequest.pageIndex']=j
r=requests.post(url,data=data5,headers=header).text
data0 = re.findall(u'\"hotelIds(.*?)filterStatusMap',r)
data1 = data0[0]
data2 = re.sub('\"\:\"','',data1)
data3 = re.sub('","','',data2)
data4 = data3.split(',')
time.sleep(1)
for s in range(0,1):
for k in data4:
url3 = 'http://hotel.elong.com/ajax/detail/gethotelreviews/?hotelId=' + str(
k) + '&recommendedType=0&pageIndex=' + str(
s) + '&mainTagId=0&subTagId=0&code=9253708&elongToken=j9cpvej8-4dea-4d07-a3b9-54b27a2797e9&ctripToken=88cf3b41-c4a2-4e49-a411-16af6b55ebec&_=1509281280799'
url2 = 'http://hotel.elong.com/' + str(k)
getnews(url2)
getcontent(url3)
time.sleep(2)
file.close()