前言
本文的灵感来自于:
正文
本文已豆瓣电影的Cookies为例, 展示了从Cookies的获取, 解析的过程.
我们在浏览器中看到的Cookies大概是这样的:
Cookie:bid=hZdgjLJMNv4; _vwo_uuid_v2=AD40AA237919D79C67460DEFD37AFAA4|65f61f85190c51b2cfa95d3910cc2914; gr_user_id=2d7956ee-7cd2-4fad-8a7d-d0b2265ceeba; ll="118316"; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1489750475%2C%22https%3A%2F%2Fwww.google.com.hk%2F%22%5D; ap=1; _pk_id.100001.4cf6=270eb4959a2a2414.1489750475.1.1489750559.1489750475.; _pk_ses.100001.4cf6=*; __utma=30149280.1851478845.1488968861.1489658025.1489750475.5; __utmb=30149280.0.10.1489750475; __utmc=30149280; __utmz=30149280.1489750475.5.5.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utma=223695111.721177542.1489750475.1489750475.1489750475.1; __utmb=223695111.0.10.1489750475; __utmc=223695111; __utmz=223695111.1489750475.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)
首先我们先把它导入到Python字典中, 作为我们的前期准备工作:
In [57]: from http.cookies import SimpleCookie
In [58]: s = SimpleCookie('''bid=hZdgjLJMNv4; _vwo_uuid_v2=AD40AA237919D79C67460DEFD37AFAA4|65f61f85190c51b2
...: cfa95d3910cc2914; gr_user_id=2d7956ee-7cd2-4fad-8a7d-d0b2265ceeba; ll="118316"; _pk_ref.100001.4cf6
...: =%5B%22%22%2C%22%22%2C1489750475%2C%22https%3A%2F%2Fwww.google.com.hk%2F%22%5D; ap=1; _pk_id.100001
...: .4cf6=270eb4959a2a2414.1489750475.1.1489750559.1489750475.; _pk_ses.100001.4cf6=*; __utma=30149280.
...: 1851478845.1488968861.1489658025.1489750475.5; __utmb=30149280.0.10.1489750475; __utmc=30149280; __
...: utmz=30149280.1489750475.5.5.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided);
...: __utma=223695111.721177542.1489750475.1489750475.1489750475.1; __utmb=223695111.0.10.1489750475; _
...: _utmc=223695111; __utmz=223695111.1489750475.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmc
...: tr=(not%20provided)''')
In [59]: {v.key:v.value for k,v in s.items()}
{'__utma': '223695111.721177542.1489750475.1489750475.1489750475.1',
'__utmb': '223695111.0.10.1489750475',
'__utmc': '223695111',
'__utmz': '223695111.1489750475.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)',
'_pk_id.100001.4cf6': '270eb4959a2a2414.1489750475.1.1489750559.1489750475.',
'_pk_ref.100001.4cf6': '%5B%22%22%2C%22%22%2C1489750475%2C%22https%3A%2F%2Fwww.google.com.hk%2F%22%5D',
'_pk_ses.100001.4cf6': '*',
'_vwo_uuid_v2': 'AD40AA237919D79C67460DEFD37AFAA4|65f61f85190c51b2cfa95d3910cc2914',
'ap': '1',
'bid': 'hZdgjLJMNv4',
'gr_user_id': '2d7956ee-7cd2-4fad-8a7d-d0b2265ceeba',
'll': '118316'}
首先我们注意到__utm开头的cookies, 它们是Google Analytics用于分析访客信息的:__utma stores the amount of visits (for each visitor), the time of the first visit, the previous visit, and the current visit
__utma 是用于记录访问时间的:
In [68]: for ts in cookies['__utma'].split('.'):
...: print(datetime.fromtimestamp(int(ts)))
...:
...:
1977-02-02 09:31:51
1992-11-08 07:05:42
2017-03-17 19:34:35
2017-03-17 19:34:35
2017-03-17 19:34:35
1970-01-01 08:00:01
从第三个时间开始就是我们的初次访问, 之前访问, 以及现在的时间.__utmb and __utmc are used to check approximately how fast people leave: when a visit starts, and approximately ends (c expires quickly).
__utmb 和 __utmc也是时间戳, 用于计算你在豆瓣逗留的时间, 这里就不再展示了.__utmz records whether the visitor came from a search engine (and if so, the search keyword used), a link, or from no previous page (e.g. a bookmark).
__utmz 记录着你进入豆瓣的途径, 通过搜索引擎或者是其他的链接:
In [70]: cookies['__utmz']
...:
...:
Out[70]: '223695111.1489750475.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)'
可以简单的看出我是从Google搜索进入豆瓣的._pk_id* 你的ID
_p_ses* 这个一般不包含数据
_pk_ref 这个类似与HTTP 首部中的Refer, 记录着你跳转过来的页面:
In [75]: from urllib.parse import unquote
In [79]: unquote(cookies['_pk_ref.100001.4cf6'])
Out[79]: '["","",1489750475,"https://www.google.com.hk/"]'
In [80]: cookies['_pk_ref.100001.4cf6']
Out[80]: '%5B%22%22%2C%22%22%2C1489750475%2C%22https%3A%2F%2Fwww.google.com.hk%2F%22%5D'
In [81]: eval(unquote(cookies['_pk_ref.100001.4cf6']))[2]
Out[81]: 1489750475
In [82]: datetime.fromtimestamp(_)
Out[82]: datetime.datetime(2017, 3, 17, 19, 34, 35)
我们把Cookies中的URL解码一下, 可以得出我是从 https://www.google.com.hk/ 跳转过来的, 还有一个记录时间的时间戳
剩下的就是豆瓣自己设置的Cookies了, 不属于任何分析平台.
以上是使用浏览器打开豆瓣向服务器发送的Cookies, 那么服务器会向我们设置一些什么Cookies呢, 我们来测试一下:
In [93]: for i in range(10):
...: r = requests.get('https://movie.douban.com/tag/2016?start=0&type=T')
...: print(r.headers['Set-Cookie'])
...:
bid=YE31t9f2CtY; Expires=Sat, 17-Mar-18 13:11:42 GMT; Domain=.douban.com; Path=/
bid=AkV_9uN6CxQ; Expires=Sat, 17-Mar-18 13:11:43 GMT; Domain=.douban.com; Path=/
bid=8EhJ9dCj1pw; Expires=Sat, 17-Mar-18 13:11:44 GMT; Domain=.douban.com; Path=/
bid=G4O0c55MbGU; Expires=Sat, 17-Mar-18 13:11:52 GMT; Domain=.douban.com; Path=/
bid=UtW6FWxzk5E; Expires=Sat, 17-Mar-18 13:11:54 GMT; Domain=.douban.com; Path=/
bid=QzZ_sVbO4Qs; Expires=Sat, 17-Mar-18 13:11:56 GMT; Domain=.douban.com; Path=/
bid=dLPTZc4Kh7Q; Expires=Sat, 17-Mar-18 13:11:58 GMT; Domain=.douban.com; Path=/
bid=imFq99iN5f8; Expires=Sat, 17-Mar-18 13:12:00 GMT; Domain=.douban.com; Path=/
bid=Q-bdkpxA0zM; Expires=Sat, 17-Mar-18 13:12:15 GMT; Domain=.douban.com; Path=/
bid=3rv0SUwSG2c; Expires=Sat, 17-Mar-18 13:12:17 GMT; Domain=.douban.com; Path=/
分别请求豆瓣电影页10次, 可以看到豆瓣服务器向我们设置的是bid这一项, 并且是设置在豆瓣域名根目录下的, 表明这个Cookies会在我们访问任何豆瓣网页的时候都会发送给服务器, 我们看一下过期时间:
In [102]: import dateutil.parser
In [103]: dateutil.parser.parse('17-Mar-18 13:11:42 GMT')
Out[103]: datetime.datetime(2018, 3, 17, 13, 11, 42, tzinfo=tzutc())
可以看到这个Cookies的有效期为一年, 应该是作为我们的ID来追踪用户的.
总结:
这篇文章介绍了使用Python分析Cookies的一些方法, 找出了豆瓣用于追踪用户的Cookies项. 后续将会介绍如果伪装Cookies.