python 中cookie_使用Python分析Cookies

前言

本文的灵感来自于:

正文

本文已豆瓣电影的Cookies为例, 展示了从Cookies的获取, 解析的过程.

我们在浏览器中看到的Cookies大概是这样的:

Cookie:bid=hZdgjLJMNv4; _vwo_uuid_v2=AD40AA237919D79C67460DEFD37AFAA4|65f61f85190c51b2cfa95d3910cc2914; gr_user_id=2d7956ee-7cd2-4fad-8a7d-d0b2265ceeba; ll="118316"; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1489750475%2C%22https%3A%2F%2Fwww.google.com.hk%2F%22%5D; ap=1; _pk_id.100001.4cf6=270eb4959a2a2414.1489750475.1.1489750559.1489750475.; _pk_ses.100001.4cf6=*; __utma=30149280.1851478845.1488968861.1489658025.1489750475.5; __utmb=30149280.0.10.1489750475; __utmc=30149280; __utmz=30149280.1489750475.5.5.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utma=223695111.721177542.1489750475.1489750475.1489750475.1; __utmb=223695111.0.10.1489750475; __utmc=223695111; __utmz=223695111.1489750475.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)

首先我们先把它导入到Python字典中, 作为我们的前期准备工作:

In [57]: from http.cookies import SimpleCookie

In [58]: s = SimpleCookie('''bid=hZdgjLJMNv4; _vwo_uuid_v2=AD40AA237919D79C67460DEFD37AFAA4|65f61f85190c51b2

...: cfa95d3910cc2914; gr_user_id=2d7956ee-7cd2-4fad-8a7d-d0b2265ceeba; ll="118316"; _pk_ref.100001.4cf6

...: =%5B%22%22%2C%22%22%2C1489750475%2C%22https%3A%2F%2Fwww.google.com.hk%2F%22%5D; ap=1; _pk_id.100001

...: .4cf6=270eb4959a2a2414.1489750475.1.1489750559.1489750475.; _pk_ses.100001.4cf6=*; __utma=30149280.

...: 1851478845.1488968861.1489658025.1489750475.5; __utmb=30149280.0.10.1489750475; __utmc=30149280; __

...: utmz=30149280.1489750475.5.5.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided);

...: __utma=223695111.721177542.1489750475.1489750475.1489750475.1; __utmb=223695111.0.10.1489750475; _

...: _utmc=223695111; __utmz=223695111.1489750475.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmc

...: tr=(not%20provided)''')

In [59]: {v.key:v.value for k,v in s.items()}

{'__utma': '223695111.721177542.1489750475.1489750475.1489750475.1',

'__utmb': '223695111.0.10.1489750475',

'__utmc': '223695111',

'__utmz': '223695111.1489750475.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)',

'_pk_id.100001.4cf6': '270eb4959a2a2414.1489750475.1.1489750559.1489750475.',

'_pk_ref.100001.4cf6': '%5B%22%22%2C%22%22%2C1489750475%2C%22https%3A%2F%2Fwww.google.com.hk%2F%22%5D',

'_pk_ses.100001.4cf6': '*',

'_vwo_uuid_v2': 'AD40AA237919D79C67460DEFD37AFAA4|65f61f85190c51b2cfa95d3910cc2914',

'ap': '1',

'bid': 'hZdgjLJMNv4',

'gr_user_id': '2d7956ee-7cd2-4fad-8a7d-d0b2265ceeba',

'll': '118316'}

首先我们注意到__utm开头的cookies, 它们是Google Analytics用于分析访客信息的:__utma stores the amount of visits (for each visitor), the time of the first visit, the previous visit, and the current visit

__utma 是用于记录访问时间的:

In [68]: for ts in cookies['__utma'].split('.'):

...: print(datetime.fromtimestamp(int(ts)))

...:

...:

1977-02-02 09:31:51

1992-11-08 07:05:42

2017-03-17 19:34:35

2017-03-17 19:34:35

2017-03-17 19:34:35

1970-01-01 08:00:01

从第三个时间开始就是我们的初次访问, 之前访问, 以及现在的时间.__utmb and __utmc are used to check approximately how fast people leave: when a visit starts, and approximately ends (c expires quickly).

__utmb 和 __utmc也是时间戳, 用于计算你在豆瓣逗留的时间, 这里就不再展示了.__utmz records whether the visitor came from a search engine (and if so, the search keyword used), a link, or from no previous page (e.g. a bookmark).

__utmz 记录着你进入豆瓣的途径, 通过搜索引擎或者是其他的链接:

In [70]: cookies['__utmz']

...:

...:

Out[70]: '223695111.1489750475.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)'

可以简单的看出我是从Google搜索进入豆瓣的._pk_id* 你的ID

_p_ses* 这个一般不包含数据

_pk_ref 这个类似与HTTP 首部中的Refer, 记录着你跳转过来的页面:

In [75]: from urllib.parse import unquote

In [79]: unquote(cookies['_pk_ref.100001.4cf6'])

Out[79]: '["","",1489750475,"https://www.google.com.hk/"]'

In [80]: cookies['_pk_ref.100001.4cf6']

Out[80]: '%5B%22%22%2C%22%22%2C1489750475%2C%22https%3A%2F%2Fwww.google.com.hk%2F%22%5D'

In [81]: eval(unquote(cookies['_pk_ref.100001.4cf6']))[2]

Out[81]: 1489750475

In [82]: datetime.fromtimestamp(_)

Out[82]: datetime.datetime(2017, 3, 17, 19, 34, 35)

我们把Cookies中的URL解码一下, 可以得出我是从 https://www.google.com.hk/ 跳转过来的, 还有一个记录时间的时间戳

剩下的就是豆瓣自己设置的Cookies了, 不属于任何分析平台.

以上是使用浏览器打开豆瓣向服务器发送的Cookies, 那么服务器会向我们设置一些什么Cookies呢, 我们来测试一下:

In [93]: for i in range(10):

...: r = requests.get('https://movie.douban.com/tag/2016?start=0&type=T')

...: print(r.headers['Set-Cookie'])

...:

bid=YE31t9f2CtY; Expires=Sat, 17-Mar-18 13:11:42 GMT; Domain=.douban.com; Path=/

bid=AkV_9uN6CxQ; Expires=Sat, 17-Mar-18 13:11:43 GMT; Domain=.douban.com; Path=/

bid=8EhJ9dCj1pw; Expires=Sat, 17-Mar-18 13:11:44 GMT; Domain=.douban.com; Path=/

bid=G4O0c55MbGU; Expires=Sat, 17-Mar-18 13:11:52 GMT; Domain=.douban.com; Path=/

bid=UtW6FWxzk5E; Expires=Sat, 17-Mar-18 13:11:54 GMT; Domain=.douban.com; Path=/

bid=QzZ_sVbO4Qs; Expires=Sat, 17-Mar-18 13:11:56 GMT; Domain=.douban.com; Path=/

bid=dLPTZc4Kh7Q; Expires=Sat, 17-Mar-18 13:11:58 GMT; Domain=.douban.com; Path=/

bid=imFq99iN5f8; Expires=Sat, 17-Mar-18 13:12:00 GMT; Domain=.douban.com; Path=/

bid=Q-bdkpxA0zM; Expires=Sat, 17-Mar-18 13:12:15 GMT; Domain=.douban.com; Path=/

bid=3rv0SUwSG2c; Expires=Sat, 17-Mar-18 13:12:17 GMT; Domain=.douban.com; Path=/

分别请求豆瓣电影页10次, 可以看到豆瓣服务器向我们设置的是bid这一项, 并且是设置在豆瓣域名根目录下的, 表明这个Cookies会在我们访问任何豆瓣网页的时候都会发送给服务器, 我们看一下过期时间:

In [102]: import dateutil.parser

In [103]: dateutil.parser.parse('17-Mar-18 13:11:42 GMT')

Out[103]: datetime.datetime(2018, 3, 17, 13, 11, 42, tzinfo=tzutc())

可以看到这个Cookies的有效期为一年, 应该是作为我们的ID来追踪用户的.

总结:

这篇文章介绍了使用Python分析Cookies的一些方法, 找出了豆瓣用于追踪用户的Cookies项. 后续将会介绍如果伪装Cookies.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值