爬虫 — 生成的网页 headers

最近在弄爬虫,有些网站需要传入headers,自己将网站的 headers 一一弄下来形成字典太麻烦了。偶然发现 Chrome 可以生成一个叫 cURL 的东西,里面包含该网页的 headers 。就写了个函数自动提取并返回字典形式的 headers。

先介绍一下怎么获取到网页的 cURL:

复制 cURL
按上面步骤执行就能得到下面的东西:

curl "https://www.baidu.com/" -H "Connection: keep-alive" -H "Cache-Control: max-age=0" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7" -H "Cookie: BIDUPSID=69190AC2F9371D83695F00687C3AB96D; PSTM=1537321028; BD_UPN=12314753; BAIDUID=F6C3199E0D95B688A7B4A126DB21CF65:FG=1; __cfduid=d980a4ba128a4742f240b9a792da30f1e1540616515; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; MCITY=-125^%^3A; BDRCVFR^[PaHiFN6tims^]=9xWipS8B-FspA7EnHc1QhPEUf; delPer=0; BD_CK_SAM=1; PSINO=6; BD_HOME=0; H_PS_645EC=4e59hoONzQuutKCtXCPV6g0wJR542dDzWVAf8xeRV9zRC1YqNm24lgBZVnaFvrG3DTCl7KqX; BDRCVFR^[PGnakqNNAQT^]=9xWipS8B-FspA7EnHc1QhPEUf; H_PS_PSSID=1458_21120_27401_27376_26350" --compressed

下面是处理代码:

import re
import json

def curl_to_headers(curl):
	
	curl = re.sub('--compressed','',curl)
	curl = re.sub('"','',curl)
	curl = curl.split('-H')
	headers = {}
	for i in curl[1:]:
		s = re.match('(.*?):(.*)',i.strip())
		headers[s.group(1).strip()] = s.group(2).strip()
	
	print(json.dumps(headers,indent=4))
	return headers

if __name__ == '__main__':

	curl = input('请输入需要转换的cURL: ')
	print('\n','*'*50,'\n')
	curl_to_headers(curl)

可以运行该文件,生成相应的 headers ,也可以作为函数模块调用。
生成 headers 如下:

{
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
    "Cookie": "BIDUPSID=69190AC2F9371D83695F00687C3AB96D; PSTM=1537321028; BD_UPN=12314753; BAIDUID=F6C3199E0D95B688A7B4A126DB21CF65:FG=1; __cfduid=d980a4ba128a4742f240b9a792da30f1e1540616515; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; MCITY=-125^%^3A; BDRCVFR^[PaHiFN6tims^]=9xWipS8B-FspA7EnHc1QhPEUf; delPer=0; BD_CK_SAM=1; PSINO=6; BD_HOME=0; H_PS_645EC=4e59hoONzQuutKCtXCPV6g0wJR542dDzWVAf8xeRV9zRC1YqNm24lgBZVnaFvrG3DTCl7KqX; BDRCVFR^[PGnakqNNAQT^]=9xWipS8B-FspA7EnHc1QhPEUf; H_PS_PSSID=1458_21120_27401_27376_26350"
}

有时候也需要用 requests 尝试访问一些链接,查看网页的返回内容。每次都要重新写一些同样的代码,所以将上面的内容增加了一点,直接生成一份完整的可直接运行的代码,复制粘贴就能跑起来了。

import json
import re

def curl_to_headers(curl):
	
	text = """
import requests

url = '{url}'
headers = {headers}
re = requests.get(url,headers=headers).text

print(re)
"""
	
	curl = re.sub('--compressed','',curl)
	curl = re.sub('"','',curl)
	curl = curl.split('-H')
	headers = {}
	for i in curl[1:]:
		s = re.match('(.*?):(.*)',i.strip())
		headers[s.group(1).strip()] = s.group(2).strip()

	text = text.format(url=curl[0][5:],headers=json.dumps(headers,indent=4))
	print(text)

if __name__ == '__main__':

	curl = input('请输入需要转换的cURL: ')
	print('\n','*'*50,'\n')
	curl_to_headers(curl)

运行结果如下:

import requests

url = 'https://www.baidu.com/ '
headers = {
    "Connection": "keep-alive",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
    "Cookie": "BIDUPSID=69190AC2F9371D83695F00687C3AB96D; PSTM=1537321028; BD_UPN=12314753; BAIDUID=F6C3199E0D95B688A7B4A126DB21CF65:FG=1; __cfduid=d980a4ba128a4742f240b9a792da30f1e1540616515; sug=3; sugstore=0; ORIGIN=0; bdime=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; MCITY=-^%^3A; BDRCVFR^[rePVrIVEn7n^]=9xWipS8B-FspA7EnHc1QhPEUf; delPer=0; BD_HOME=0; H_PS_PSSID=1458_21120_27376_26350_27244_27542"
}
re = requests.get(url,headers=headers).text

print(re)
  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值