爬虫学习记录-2

  • url的组成

  • 请求对象的定制

如果将http改成https并且还是用原代码就会出现爬取失败,原因是有反爬的出现

url = 'https://www.baidu.com'
response = urllib.request.urlopen(url)
context = response.read().decode('utf-8')
print(context)

所以需要ua,注意ua为字典类型的数据,冒号两端需要加引号,又因为urlopen方法不能存储字典,headers传递不进去,所以需要请求对象的定制

request = urllib.request.Request(url=url,headers=headers)
url = 'https://www.baidu.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.76'}
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
context = response.read().decode('utf-8')
print(context)

*注:Request后的括号里不能直接写(url,headers),因为Request括号里的内容分别为url、data、headers

  • get请求的quote方法

将中文转换成Unicode编码格为(需要调用urllib.parse库)

name = urllib.parse.quote('周杰伦')

  • get请求的urlencode方法

当需要将多个中文转换成Unicode编码时,quote方法比较繁琐且需要拼接,而urlencode方法比较简洁,urlencode方法要求参数以字典形式存在

date = {
    'q':'周杰伦',
    'sex':'男',
}
new_date = urllib.parse.urlencode(date)

  • post请求

post的请求参数一定要转换成Unicode编码才能拼接,因此同样需要调用urllib.parse包,post请求的参数不是拼接在url后面,而是在请求对象定制中的Request()传入data参数

date = {'kw':'spider'}
request = urllib.request.Request(url=url,data=data,headers=headers)

*注:此时的data为字符串类型,而当发送请求时,data应为字节型数据,因此post请求的参数必须要进行编码,在urlencode(data)后面加上.encode('utf-8')

data = urllib.parse.urlencode(data).encode('utf-8')

此时运行得到的结果为json类型,需要转换成json字符串类型转换成json对象类型,调用json包

obj = json.loads(content)
print(obj)
url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.76'}
date = {'kw':'spider'}
data = urllib.parse.urlencode(data).encode('utf-8')
request = urllib.request.Request(url=url,data=data,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)

obj = json.loads(content)
print(obj)

到此步按照老师视频上的步骤就结束了,但结果遇到了反爬,可能是老师的课录制的时间比较早,当时百度翻译还没有运用反爬,因此按照下一个视频内容,将请求头内容给到headers,需要注意的是要将Accept-Encoding此句注释掉

url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh'
headers = {
    'Accept': '*/*',
    'Accept-Language':',',
    'Acs-Token':'1673597114416_1673684731734_wNOw40KHDoThOttvnz3E2+MzvFTqIVBg5D/YSb5if1cu/Db9cBl1IUgfrJWPt70kOc6LYONVr6JrBXv0aOchqC5qEg5eelF+samJ/TWckKCFcH8wbh5MzFVJHk7kwiVmsiuMGu/8xxWlj1ZupaRVj1fipZnNb7y8E2hE5FE5ZVtrzWmKmy7oTUoXoc9e5z1j1jB4tS8m6/j5FKjExiu+Ry3y9mXspK2Aps2Iwfk1RGv7myyfdtQe+bSu5QHowee0bPSTeT3wVfZ6o3Mz3UctfthBWS/8VEYe1Z8/ATrimYjmWZd9DKnMIVkYkLcKx+pAyKrKwUm2lFqNoAktV21qwg==',
    'Connection': 'keep-alive',
    'Content-Length': '117',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Cookie': 'BIDUPSID=A9033E673693226562565BC0C5D5CB90; PSTM=1579152228; __yjs_duid=1_c3eab7486e61dc9f7b99b255ee9201311626010676599; BDUSS=VdEamY4VThWWnU5bmV5a2prWkhQNGVJQklOUH5mU3pncFUwV2N-Nlh5RnZ4YWxpRVFBQUFBJCQAAAAAAQAAAAEAAACRKBoQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG84gmJvOIJiV; BDUSS_BFESS=VdEamY4VThWWnU5bmV5a2prWkhQNGVJQklOUH5mU3pncFUwV2N-Nlh5RnZ4YWxpRVFBQUFBJCQAAAAAAQAAAAEAAACRKBoQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG84gmJvOIJiV; BAIDUID=8BA77C9C54F42CDB9CCFC4413033A89A:FG=1; H_PS_PSSID=36557_37647_37906_38013_36920_37989_37796_37938_26350_22157_38008_37881; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BAIDUID_BFESS=8BA77C9C54F42CDB9CCFC4413033A89A:FG=1; delPer=0; PSINO=1; BA_HECTOR=ag0l8g040l008h25a1052gd91hs4l0t1l; ZFY=MCepCr8PT4:B4:Ab6HFt:AkaBEcXXM6sCgCmHeAwBnng14:C; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1673681256; APPGUIDE_10_0_2=1; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1673684716; ab_sr=1.0.1_YWExYTY3MmM3MDliNzk3MmJjMWUzNzJlNGZiMjg1MTBjNDBhN2UxMjEzNjM2ZGE5YTkyMjBjNmYyNzQzZDk5ZDQ3YWEwNTcyODQ0NjczZWZjZjJlYjBiNGZiYjA3OWZkMTBiZGYxZTY1MWQxNWE5NGQwNTBmMGI5MjFkY2QzNTI4MzNhMzkzMDljMjJhMzlhYTEzMzFlMTMwYzBhNjczNTUyMWI3M2VjODNkMzU0Njg1N2UxZDcwMGU5MTk1OThl',
    'Host': 'fanyi.baidu.com',
    'Origin': 'https://fanyi.baidu.com',
    'Referer': 'https://fanyi.baidu.com/',
    'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Microsoft Edge";v="108"',
    'sec-ch-ua-mobile':'?0',
    'sec-ch-ua-platform': 'Windows',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.76',
    'X-Requested-With': 'XMLHttpRequest',
}
data = {
    'from':'en',
    'to':'zh',
    'query':'spider',
    'simple_means_flag':'3',
    'sign':'63766.268839',
    'token':'3613d79ecb5f10d666bf10d01faa0588',
    'domain':'common'
}
data = urllib.parse.urlencode(data).encode('utf-8')
request = urllib.request.Request(url=url,data=data,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)

obj = json.loads(content)
print(obj)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值