2021/4/21爬虫第四次课（爬虫网络请求模块下）

最新推荐文章于 2024-04-24 10:15:24 发布

笔记本IT

最新推荐文章于 2024-04-24 10:15:24 发布

阅读量149

点赞数 1

文章标签： python json 乱码爬虫

本文链接：https://blog.csdn.net/httpsssss/article/details/116023486

版权

文章目录

一、改写上期百度贴吧的代码
二、post案例（实现简单翻译）
三、requests模块
四、代理IP
五、补充

一、改写上期百度贴吧的代码

上期地址
原始方式 --> 函数式的编程 --> 面向对象的编程方式

二、post案例（实现简单翻译）

# 需求：简单的翻译小软件
import urllib.request
import urllib.parse
import json

# 请输入您要翻译的内容
content = input('请输入您要翻译的内容:')
# 目标url 发请求 需要去掉_o 1 经验  2 js逆向
url = 'https://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'
}
# 携带数据
data = {
    'i': content,
    'from': 'AUTO',
    'smartresult': 'dict',
    'client': 'fanyideskweb',
    'salt': '15880623642174',
    'sign': 'c6c2e897040e6cbde00cd04589e71d4e',
    'ts': '1588062364217',
    'bv': '42160534cfa82a6884077598362bbc9d',
    'doctype': 'json',
    'version': '2.1',
    'keyfrom': 'fanyi.web',
    'action': 'FY_BY_CLICKBUTTION'
}

data = urllib.parse.urlencode(data)
data = bytes(data)
req = urllib.request.Request(url,data=data,headers=headers)
res = urllib.request.urlopen(req)
html = res.read().decode('utf-8')

# print(type(html))
# 解析数据
# json类型的str --> python类型的字典
r_dict = json.loads(html)
# print(type(r_dict),r_dict)
r = r_dict['translateResult'] # [[{"src":"你好","tgt":"hello"}]]
result = r[0][0]['tgt'] # [{"src":"你好","tgt":"hello"}]->{"src":"你好","tgt":"hello"}
print(result)

'''
简单的解析 JSON类型的字符串str
  {"type":"ZH_CN2EN","errorCode":0,"elapsedTime":1,"translateResult":[[{"src":"你好","tgt":"hello"}]]}
'''

实现功能：中文–>英文，其他–>中文

三、requests模块

import requests

'''
response = requests.get(url=base_url,params=params,headers=headers) 
1 url是最基本的url 不包含参数的
2 params中的键值对为参数
以后直接这样写
response = requests.get(url,headers=headers) 

'''

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'
}

url = 'https://qq.yh31.com/zjbq/2920180.html'
response = requests.get(url,headers=headers)   #返回的是字节流bytes(以utf-8编码)
print(response.content.decode('utf-8'))
# response.encoding = 'utf-8'
# print(response.text) # 返回的str

'''
response.content 它是直接从网站上抓取数据，没有做任何的处理
response.text 它是requests模块将 response.content解码之后所得到的数据  requests就会先猜一个编码的方式 
用法：
response.text返回的是一个unicode型的文本数据，只有文本信息
response.content返回的是bytes型的二进制数据，包括图片、文本、文件信息等
'''

'''
print(response.content) # 返回的是字节流bytes
乱码问题response.content.decode('utf-8') bytes --> str(是以unicode编码的)
'''

'''
如果出现乱码
第一种方式response.content.decode('utf-8')
第二种 response.encoding = 'utf-8'   response.text
'''

import requests
# import json

# 请输入您要翻译的内容
content = input('请输入您要翻译的内容:')
# 目标url 发请求 需要去掉_o 1 经验  2 js逆向 url = 'https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
url = 'https://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'
}
# 携带数据
data = {
    'i': content,
    'from': 'AUTO',
    'smartresult': 'dict',
    'client': 'fanyideskweb',
    'salt': '15880623642174',
    'sign': 'c6c2e897040e6cbde00cd04589e71d4e',
    'ts': '1588062364217',
    'bv': '42160534cfa82a6884077598362bbc9d',
    'doctype': 'json',
    'version': '2.1',
    'keyfrom': 'fanyi.web',
    'action': 'FY_BY_CLICKBUTTION'
}

res = requests.post(url,data=data,headers=headers)
res.encoding = 'utf-8'(可有可无)
html = res.text
print(html)

四、代理IP

代理ip的匿名度
1 透明服务器知道了你使用了代理Ip 也知道你的真实ip
2 匿名知道使用了代理ip 不知道真实的ip
3 高匿不知道使用了代理Ip 也不知道真实的Ip

import requests
import random

ips = [('223.241.51.225:766'),('123.96.26.72:36410'),('113.237.242.51:23564'),('124.94.188.12:3617'),('117.95.24.132:894')]
url = 'http://httpbin.org/ip'

for i in range(5):
    try:
        ip = random.choice(ips)
        res = requests.get(url,proxies={'http':ip},timeout=0.5)
        print(res.text)
    except Exception as e:
        print('出现异常',e)

找代理IP：
https://h.wandouip.com/user/login
查看IP：
cmd - ipconfig 内网的ip 私有的地址局域网不能使用外网
ipip.net 或 http://httpbin.org/ip 外网上网的ip

五、补充

encode和decode使用

笔记本IT

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
2021/4/21爬虫第四次课（爬虫网络请求模块下）

文章目录一、改写上期百度贴吧的代码二、post案例（实现简单翻译）三、requests模块四、代理IP五、补充一、改写上期百度贴吧的代码上期地址原始方式 --> 函数式的编程 --> 面向对象的编程方式二、post案例（实现简单翻译）# 需求：简单的翻译小软件import urllib.requestimport urllib.parseimport json# 请输入您要翻译的内容content = input('请输入您要翻译的内容:')# 目标url 发请求需要去
复制链接

扫一扫