最近在弄爬虫,有些网站需要传入headers,自己将网站的 headers 一一弄下来形成字典太麻烦了。偶然发现 Chrome 可以生成一个叫 cURL 的东西,里面包含该网页的 headers 。就写了个函数自动提取并返回字典形式的 headers。
先介绍一下怎么获取到网页的 cURL:
按上面步骤执行就能得到下面的东西:
curl "https://www.baidu.com/" -H "Connection: keep-alive" -H "Cache-Control: max-age=0" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7" -H "Cookie: BIDUPSID=69190AC2F9371D83695F00687C3AB96D; PSTM=1537321028; BD_UPN=12314753; BAIDUID=F6C3199E0D95B688A7B4A126DB21CF65:FG=1; __cfduid=d980a4ba128a4742f240b9a792da30f1e1540616515; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; MCITY=-125^%^3A; BDRCVFR^[PaHiFN6tims^]=9xWipS8B-FspA7EnHc1QhPEUf; delPer=0; BD_CK_SAM=1; PSINO=6; BD_HOME=0; H_PS_645EC=4e59hoONzQuutKCtXCPV6g0wJR542dDzWVAf8xeRV9zRC1YqNm24lgBZVnaFvrG3DTCl7KqX; BDRCVFR^[PGnakqNNAQT^]=9xWipS8B-FspA7EnHc1QhPEUf; H_PS_PSSID=1458_21120_27401_27376_26350" --compressed
下面是处理代码:
import re
import json
def curl_to_headers(curl):
curl = re.sub('--compressed','',curl)
curl = re.sub('"','',curl)
curl = curl.split('-H')
headers = {}
for i in curl[1:]:
s = re.match('(.*?):(.*)',i.strip())
headers[s.group(1).strip()] = s.group(2).strip()
print(json.dumps(headers,indent=4))
return headers
if __name__ == '__main__':
curl = input('请输入需要转换的cURL: ')
print('\n','*'*50,'\n')
curl_to_headers(curl)
可以运行该文件,生成相应的 headers ,也可以作为函数模块调用。
生成 headers 如下:
{
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
"Cookie": "BIDUPSID=69190AC2F9371D83695F00687C3AB96D; PSTM=1537321028; BD_UPN=12314753; BAIDUID=F6C3199E0D95B688A7B4A126DB21CF65:FG=1; __cfduid=d980a4ba128a4742f240b9a792da30f1e1540616515; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; MCITY=-125^%^3A; BDRCVFR^[PaHiFN6tims^]=9xWipS8B-FspA7EnHc1QhPEUf; delPer=0; BD_CK_SAM=1; PSINO=6; BD_HOME=0; H_PS_645EC=4e59hoONzQuutKCtXCPV6g0wJR542dDzWVAf8xeRV9zRC1YqNm24lgBZVnaFvrG3DTCl7KqX; BDRCVFR^[PGnakqNNAQT^]=9xWipS8B-FspA7EnHc1QhPEUf; H_PS_PSSID=1458_21120_27401_27376_26350"
}
有时候也需要用 requests 尝试访问一些链接,查看网页的返回内容。每次都要重新写一些同样的代码,所以将上面的内容增加了一点,直接生成一份完整的可直接运行的代码,复制粘贴就能跑起来了。
import json
import re
def curl_to_headers(curl):
text = """
import requests
url = '{url}'
headers = {headers}
re = requests.get(url,headers=headers).text
print(re)
"""
curl = re.sub('--compressed','',curl)
curl = re.sub('"','',curl)
curl = curl.split('-H')
headers = {}
for i in curl[1:]:
s = re.match('(.*?):(.*)',i.strip())
headers[s.group(1).strip()] = s.group(2).strip()
text = text.format(url=curl[0][5:],headers=json.dumps(headers,indent=4))
print(text)
if __name__ == '__main__':
curl = input('请输入需要转换的cURL: ')
print('\n','*'*50,'\n')
curl_to_headers(curl)
运行结果如下:
import requests
url = 'https://www.baidu.com/ '
headers = {
"Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
"Cookie": "BIDUPSID=69190AC2F9371D83695F00687C3AB96D; PSTM=1537321028; BD_UPN=12314753; BAIDUID=F6C3199E0D95B688A7B4A126DB21CF65:FG=1; __cfduid=d980a4ba128a4742f240b9a792da30f1e1540616515; sug=3; sugstore=0; ORIGIN=0; bdime=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; MCITY=-^%^3A; BDRCVFR^[rePVrIVEn7n^]=9xWipS8B-FspA7EnHc1QhPEUf; delPer=0; BD_HOME=0; H_PS_PSSID=1458_21120_27376_26350_27244_27542"
}
re = requests.get(url,headers=headers).text
print(re)