通用爬虫案例3:百度搜索

需求:
1.将百度搜索的页面保存到本地
2.自定义搜索内容

url:统一资源定位符 一个url只能对应一个页面,一个页面可以由多个url对应

步骤:
1.导入requests

import requests

定义请求头:接收字典

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
  1. 使用请求头时:务必注意以下几点:
    1.无论访问什么网址,都必须加入请求头,请求头中至少加入UA
    2.如果加入UA后还是,获取不到页面内容,再尝试加入Cookie
    3.如果加入了Cookie还是不好用,把请求头全部加入(:开头的请求头不需要加入)

2.发起请求,接收响应

response = requests.get(url='https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xbc8cddba00193fb1&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=1&rsv_dl=tb&oq=python%255C&rsv_btype=t&inputT=152&rsv_t=3b34X8aZ%2BOR4wFoWNsRSDFZ%2BXUSLwZhzd8ANNBnR5W0KeDLL3B5iMMn0mURtzf3wVFns&rsv_sug3=10&rsv_sug1=9&rsv_sug7=100&rsv_pq=8619014a003e7ae2&rsv_sug2=0&rsv_sug4=824',headers = headers)
# print(response.text)

保存文件

with open('baidu.html','w',encoding='utf-8') as fp:
    fp.write(response.text)
  1. 发现问题:直接请求,发现请求下来的内容和网页原代码不一致
  2. 原因:页面做了反爬
  3. 最基本是封装请求头

Request Headers:请求头

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
# 接收的文件类型:q指的是权重系数,(范围是0-1),数值越大,权重越大,优先级越高,优先接收,如果没有q,代表q=1
Accept-Encoding: gzip, deflate, br
# 接收的文件编码格式
Accept-Language: zh-CN,zh;q=0.9
# 接收的语言类型
Cache-Control: max-age=0
# 缓存
Connection: keep-alive
# 连接的类型  keep-alive代表长连接
Cookie: BAIDUID=14E5BDDC5DF587C63FFA6790E6BC1B1E:FG=1; BIDUPSID=14E5BDDC5DF587C63FFA6790E6BC1B1E; PSTM=1569939300; BD_UPN=12314753; BDUSS=HdjVDk0aWlRN24yVERWRUJCZ1FKeGN4SVR-MzQyeUlIOVRQakM3SXo4VzQxakJlRVFBQUFBJCQAAAAAAAAAAAEAAADtZEg~4erc-OzhyMvYr8zD2K0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALhJCV64SQleb; H_WISE_SIDS=140842_143862_142064_141910_128698_143579_143298_142207_142112_143879_142357_140631_139056_141745_143161_138904_142510_143942_139175_142918_142780_131246_137746_138165_138883_141942_127969_142874_140065_143997_140593_134047_143060_141808_140351_138425_143469_143922_143275_141930_131423_107315_131115_138595_143478_140797_143549_142576_110085; BD_HOME=1; H_PS_PSSID=32292_1437_31672_31253_32045_32231_31321_32298; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; BD_CK_SAM=1; PSINO=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_645EC=f25eqgY0usptplBtTg61c6tAdftIHE92SDpWrPYsKsGO6oM7XwcsVeyJzfIC33VZTwOj; BDSVRTM=146; COOKIE_SESSION=3_0_9_5_6_4_0_0_9_2_2_0_1593_0_0_0_1595409644_0_1595467044%7C9%23658285_21_1594742369%7C9
# 因为HTTP请求无状态,所以Cookie用来记录状态
Host: www.baidu.com
# 主机
Referer: https://www.baidu.com/s?wd=python%5C&rsv_spt=1&rsv_iqid=0xbc8cddba00193fb1&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_dl=tb&rsv_sug3=8&rsv_sug1=7&rsv_sug7=101&rsv_sug2=0&rsv_btype=i&inputT=1511&rsv_sug4=2232
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36
# 代表用户标识,可以作为反爬虫的第一步,可以通过识别UA来判断是否是爬虫

查看默认请求头

print(response.request.headers)  # {'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

“”"

python的url:'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xbc8cddba00193fb1&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=1&rsv_dl=tb&oq=python%255C&rsv_btype=t&inputT=152&rsv_t=3b34X8aZ%2BOR4wFoWNsRSDFZ%2BXUSLwZhzd8ANNBnR5W0KeDLL3B5iMMn0mURtzf3wVFns&rsv_sug3=10&rsv_sug1=9&rsv_sug7=100&rsv_pq=8619014a003e7ae2&rsv_sug2=0&rsv_sug4=824'
php的url: 'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=2&tn=baiduhome_pg&wd=php&rsv_spt=1&oq=python&rsv_pq=aba343a70020eb7b&rsv_t=16f7e1RsXXXnKeKB9nXxDMpc8Fh0lmxI97uopKhVYfI0dqgJWrnM%2BPfDhDhVAspYdY7M&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=3&rsv_sug1=3&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=770&rsv_sug4=770'
php的url化简(参数精简)为:https://www.baidu.com/s?wd=php(删一个请求一个,删除后不影响原来页面即为可删)
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

自定义搜索内容

wd = input('请输入查找内容')

定义参数字典

parmas = {
    'wd': wd
}
response = requests.get(url='https://www.baidu.com/s',params=parmas,headers = headers)

保存文件

with open('{}.html'.format(wd),'w',encoding='utf-8') as fp:
    fp.write(response.text)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值