前言
今天用google浏览器试着抓了一下包,感觉很有意思,做了一个小demo。
抓包
所谓抓包,按照我的理解,就是获取浏览网站时的各个请求。
通过google浏览器可以进行查看这些包。
如下
下面来介绍一下怎么调出这个界面来。(大佬请忽略)
登录携程网,点击导航栏火车,按一下F12(出现如上图右边部分)然后点击network,network中中对不同的请求进行分类如下
xhr我的理解是请求一些数据,就是里面有我们想要的一些信息,就比如火车票信息等等(当然这个理解还比较浅显)。
上网找了一个博客里面有讲xhr
https://blog.csdn.net/m_s_l/article/details/89460964
js、css、Img、Media,Font就分别对应着js文件、css样式文件、图片文件、媒体文件、字体文件。
Doc我还不是很懂,打开好像对应这该页面的html文件。
WS、Manifest好像很少出现,我也不是很清楚到底是啥。。。
========
然后单击某个链接会出现如下右边的部分,再点击preview会出现这个请求所返回的数据,如下就是我们想要的火车票数据。(这个需要一个一个试,下面的是对应getTransferList这个请求)
再点击Headers会出现请求的头部信息。
如下
在这里,我们可以获取请求的头部数据(对应Request Headers字段)。
还有请求的一些参数(params)(对应下图的Query String Parameters字段)
params中显然departureStation对应起始城市
arrivalStation对应终点城市
departDataStr对应旅行时间
好了google浏览器的操作大概就是这些。下面就需要python登场了,需要用到这些请求的headers数据(就是上图所示的内容的Request Headers了,最前面是“:”的字段在实际请求中貌似可以不用添加)还有params数据(我们可以自己定义)。通过requests.get(url=yoururl,headers=headers,params=params)就可以获取我们想要的数据了。
headers如下
headers = {
"accept": "application/json, text/plain, */*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cookie": "_RSG=PU4FwIKVxx9.sBQdraHoLB; _RDG=28f1bbc8914ef92dbc250ea6fa7e8f7399; _RGUID=9efb7d44-ba3c-4799-b572-464d6768bdbe; _ga=GA1.2.1456400253.1543146928; _abtest_userid=e77ce008-59ec-483d-b404-f15f2fcd00a0; MKT_CKID=1584281102684.70ish.yppc; _gcl_aw=GCL.1584281103.EAIaIQobChMIyYL__dKc6AIVxKiWCh2_cAPJEAAYASAAEgLcfPD_BwE; _gcl_dc=GCL.1584281103.EAIaIQobChMIyYL__dKc6AIVxKiWCh2_cAPJEAAYASAAEgLcfPD_BwE; GUID=09031060310697161187; _gac_UA-3748357-1=1.1584282420.EAIaIQobChMIi8qm7Nec6AIVgqqWCh0_KglCEAAYASAAEgLy8fD_BwE; StartCity_Pkg=PkgStartCity=475; __utma=1.1456400253.1543146928.1585987948.1585990496.2; __utmz=1.1585990496.2.2.utmcsr=ctrip.com|utmccn=(referral)|utmcmd=referral|utmcct=/; _RF1=112.243.9.2; Session=SmartLinkCode=U155952&SmartLinkKeyWord=&SmartLinkQuary=&SmartLinkHost=&SmartLinkLanguage=zh; MKT_CKID_LMT=1588528318213; _gid=GA1.2.1622222748.1588528318; MKT_Pagesource=PC; HotelCityID=7split%E9%9D%92%E5%B2%9BsplitQingdaosplit2020-5-4split2020-05-05split2; gad_city=96e43befd48178e35a28c490547b37c1; Union=OUID=index&AllianceID=4897&SID=155952&SourceID=&createtime=1588561884&Expires=1589166683614; _jzqco=%7C%7C%7C%7C1588528318407%7C1.2042095316.1584281102671.1588529040897.1588561883630.1588529040897.1588561883630.undefined.0.0.144.144; __zpspc=9.23.1588561883.1588561883.1%232%7Cwww.baidu.com%7C%7C%7C%7C%23; _gat=1; _bfi=p1%3D108001%26p2%3D0%26v1%3D204%26v2%3D0; appFloatCnt=131; _bfa=1.1543146924604.2g8hwy.1.1588528315406.1588561880984.18.205.10650041414; _bfs=1.3",
"referer": r"https://trains.ctrip.com/pages/booking/search?ticketType=0&fromCn=%25E5%258C%2597%25E4%25BA%25AC&toCn=%25E6%25BD%258D%25E5%259D%258A&day=2020-05-04&mkt_header=&allianceID=&sid=&ouid=&orderSource=",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
emmmmm,有一个不好处就是headers从浏览器复制下来得自己手动添加那些冒号和引号,还有逗号。或者用一下python的正则表达式也是可以的。详见文末。
好现在开始封装请求函数
def getinfo(start,end,time="2020-05-04"):
data = {
"departureStation": start,
"arrivalStation": end,
"departDateStr": time
}
res = requests.get(url="https://trains.ctrip.com/pages/booking/getTransferList",params=data,headers=headers)
print(res.text)
return res
运行一下该函数
getinfo("上海","潍坊",time="2020-06-08")
得到如下结果
显示出我们想要的2020年6月8号从上海到潍坊的火车票数据。
全部代码如下
import requests
headers = {
"accept": "application/json, text/plain, */*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cookie": "_RSG=PU4FwIKVxx9.sBQdraHoLB; _RDG=28f1bbc8914ef92dbc250ea6fa7e8f7399; _RGUID=9efb7d44-ba3c-4799-b572-464d6768bdbe; _ga=GA1.2.1456400253.1543146928; _abtest_userid=e77ce008-59ec-483d-b404-f15f2fcd00a0; MKT_CKID=1584281102684.70ish.yppc; _gcl_aw=GCL.1584281103.EAIaIQobChMIyYL__dKc6AIVxKiWCh2_cAPJEAAYASAAEgLcfPD_BwE; _gcl_dc=GCL.1584281103.EAIaIQobChMIyYL__dKc6AIVxKiWCh2_cAPJEAAYASAAEgLcfPD_BwE; GUID=09031060310697161187; _gac_UA-3748357-1=1.1584282420.EAIaIQobChMIi8qm7Nec6AIVgqqWCh0_KglCEAAYASAAEgLy8fD_BwE; StartCity_Pkg=PkgStartCity=475; __utma=1.1456400253.1543146928.1585987948.1585990496.2; __utmz=1.1585990496.2.2.utmcsr=ctrip.com|utmccn=(referral)|utmcmd=referral|utmcct=/; _RF1=112.243.9.2; Session=SmartLinkCode=U155952&SmartLinkKeyWord=&SmartLinkQuary=&SmartLinkHost=&SmartLinkLanguage=zh; MKT_CKID_LMT=1588528318213; _gid=GA1.2.1622222748.1588528318; MKT_Pagesource=PC; HotelCityID=7split%E9%9D%92%E5%B2%9BsplitQingdaosplit2020-5-4split2020-05-05split2; gad_city=96e43befd48178e35a28c490547b37c1; Union=OUID=index&AllianceID=4897&SID=155952&SourceID=&createtime=1588561884&Expires=1589166683614; _jzqco=%7C%7C%7C%7C1588528318407%7C1.2042095316.1584281102671.1588529040897.1588561883630.1588529040897.1588561883630.undefined.0.0.144.144; __zpspc=9.23.1588561883.1588561883.1%232%7Cwww.baidu.com%7C%7C%7C%7C%23; _gat=1; _bfi=p1%3D108001%26p2%3D0%26v1%3D204%26v2%3D0; appFloatCnt=131; _bfa=1.1543146924604.2g8hwy.1.1588528315406.1588561880984.18.205.10650041414; _bfs=1.3",
"referer": r"https://trains.ctrip.com/pages/booking/search?ticketType=0&fromCn=%25E5%258C%2597%25E4%25BA%25AC&toCn=%25E6%25BD%258D%25E5%259D%258A&day=2020-05-04&mkt_header=&allianceID=&sid=&ouid=&orderSource=",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
def getinfo(start,end,time="2020-05-04"):
data = {
"departureStation": start,
"arrivalStation": end,
"departDateStr": time
}
res = requests.get(url="https://trains.ctrip.com/pages/booking/getTransferList",params=data,headers=headers)
print(res.text)
return res
getinfo("上海","潍坊",time="2020-06-08")
附录:处理headers的代码
#直接复制下来是下面这种的,需要加引号和逗号
headers="""
Accept: text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Connection: keep-alive
"""
#对浏览器复制粘贴下来的headers进行处理
#然后我们复制粘贴打印出来的headers
import re
def getheaders(headers):
headers = re.sub(r": ",r"': '",headers)
headers = re.sub(r"\n","',\n'",headers)
print(headers)
getheaders(headers)
注:网页跳转的时候登陆之类请求有可能会被清除,需要在Network选项勾选上“Preserve log”。
感觉抓包比较有意思,就记录了下来。喜欢的话不要忘记点个赞哦!
本人才疏学浅,难免有说的不对的地方欢迎评论区讨论。