百度指数爬取原文在这里:如何用Python下载百度指数的数据_小小明-代码实体的专栏-CSDN博客
00.序言
一直想把百度指数每天定时爬下来,然后放到Dashboard里展示,终于,我看到了大神给出的方法。开心开心....本文会把每个步骤得到的结果一一列出(方便哪天脑子不好使,不知道这个步骤是干嘛的)
注意事项:
- 百度指数需要登录账号(为的是拿cookie,拿完就不用了吧)
- cookie需要在开发者工具里面找(Chrome F12)随便找一个链接打开,往下滑都能找到
- 百度指数只给塞5个对比词,超过了就需要再来一遍
01.百度指数爬虫
首先是标题头,大概知道这是为了伪装我们的爬虫像个正常用户一样去浏览,防止反爬...但具体的,等我深入学习之后再来填坑吧
headers = {
"Connection": "keep-alive",
"Accept": "application/json, text/plain, */*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://index.baidu.com/v2/index.html",
"Accept-Language": "zh-CN,zh;q=0.9",
'Cookie': '太长,随用随找吧'
}
然后是解密函数,这个是重点啊!!!(原文作者用decrypt搜索全局,找到了解密函数,论函数名字的的重要性)
def decrypt(ptbk, index_data): ## ptbk='pm1r-DUTRjYaVX95801+27.36-,%49';index_data='r1rpXar11Xja9DrXammpma91rrammjpa9DpU'
n = len(ptbk)//2 ## 15
a = dict(zip(ptbk[:n], ptbk[n:])) ## {'p': '5', 'm': '8', '1': '0', 'r': '1', '-': '+', 'D': '2', 'U': '7', 'T': '.', 'R': '3', 'j': '6', 'Y': '-', 'a': ',', 'V': '%', 'X': '4', '9': '9'}
return "".join([a[s] for s in index_data])
然后是正文部分,把百度指数的数值爬取出来,装到数组里面,然后返回结果。
def get_index_data(keys, start=None, end=None): ##keys=['大众','丰田','特斯拉']
words = [[{"name": key, "wordType": 1}] for key in keys]
words = str(words).replace(" ", "").replace("'", "\"") ##'[[{"name":"丰田","wordType":1}],[{"name":"特斯拉","wordType":1}],[{"name":"大众","wordType":1}]]'
today = date.today() ## 2021-11-28
if start is None:
start = str(today-timedelta(days=8)) ## '2021-11-20'
if end is None:
end = str(today-timedelta(days=2)) ## '2021-11-26'
url = f'http://index.baidu.com/api/SearchApi/index?area=0&word={words}&area=0&startDate={start}&endDate={end}' ## 'http://index.baidu.com/api/SearchApi/index?area=0&word=[[{"name":"丰田","wordType":1}],[{"name":"特斯拉","wordType":1}],[{"name":"大众","wordType":1}]]&area=0&startDate=2021-11-20&endDate=2021-11-26'
print(words, start, end)
res = requests.get(url, headers=headers) ## <Response [200]>
data = res.json()['data'] ## {'userIndexes': [ {'word': [{'name': '丰田', 'wordType': 1}], 'all': {'startDate': '2021-11-20', 'endDate': '2021-11-26', 'data': 'r1rpXar11Xja9DrXammpma91rrammjpa9DpU'}, 'pc': {'startDate': '2021-11-20', 'endDate': '2021-11-26', 'da