爬虫GET&POST提交方法

最新推荐文章于 2024-03-26 16:09:40 发布

Lank蓝柯

最新推荐文章于 2024-03-26 16:09:40 发布

阅读量920

点赞数

分类专栏：笔记文章标签： Python爬虫

本文链接：https://blog.csdn.net/MrWanC/article/details/88715863

版权

笔记专栏收录该内容

9 篇文章 1 订阅

订阅专栏

#urllib.parse 该模块可以完成对url的编解码
from urllib import parse

d = {'id':1,
     'name':'tom'
     }
url = 'http://www.magedu.com/python'
u = parse.urlencode(d)  #urlencode函数第一参数要求是一个字典或者二元组序列 
print(u)
'''
用过urlencode编码拿到后拼接到url 
http://www.magedu.com/python?id=1&name=tom 这就是查询字符串 典型的GET请求
而如果将body = 'id=1&name=tom' 将数据放入data中就是典型的POST请求 
'''
d2 = parse.urlencode({
        'url':'http://www.magedu.com/python',
        'p_url':'http://www.magedu.com/python?id=1&name=张三'
        })
print(d2)

结果：

id=1&name=tom
p_url=http%3A%2F%2Fwww.magedu.com%2Fpython%3Fid%3D1%26name%3D%E5%BC%A0%E4%B8%89&url=http%3A%2F%2Fwww.magedu.com%2Fpython

#网页使用utf-8编码
#https://www.baidu.com/s?wd=中
#上面的url编码后，如下
#https://www.baidu.com/s?wd=%E4%B8%AD

from urllib import parse

u = parse.urlencode({'wd':'中'}) #编码
url = 'https://www.baidu.com/s?{}'.format(u) #'中'编码后进行拼接
print(url)

print('中'.encode('utf-8')) #'中'的utf-8编码

print(parse.unquote(u)) #解码
print(parse.unquote(url))

结果：

https://www.baidu.com/s?wd=%E4%B8%AD
b'\xe4\xb8\xad'
wd=中
https://www.baidu.com/s?wd=中

爬虫提交方法method
最常用的HTTP交互数据的方法是GET、POST
GET方法：数据是通过URL传递的，也就是说数据是在HTTP报文的header部分
POST方法：数据是放在HTTP报文的body部分提交的
数据都是键值对形式，多个参数之间使用&符号连接，例如a=1&b=abc

#GET方法
#连接必应搜索引擎官网，获取一个搜索的URL http://cn.bing.com/search?q=马哥教育 需求
#请求程序完成对关键字的bing搜索，将返回的结果保存到一个网页文件
from urllib import parse
#http://cn.bing.com/search?q=马哥教育
base_url = 'http://cn.bing.com/search'

d = {
    'q':'马哥教育' 
}
u = parse.urlencode(d)
url = '{}?{}'.format(base_url, u)

print(url)
print(parse.unquote(url))

from urllib.request import urlopen, Request
ua = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'

req = Request(url, headers = {  #req为Request类型
        'User-agent':ua
        })

res = urlopen(req)  #res为Response类型
with res: #类文件对象用with语法
    with open('o:/bing.html', 'wb') as f: #文件对象用with语法
        f.write(res.read())
        f.flush()
        
print('成功')

结果：

http://cn.bing.com/search?q=%E9%A9%AC%E5%93%A5%E6%95%99%E8%82%B2
http://cn.bing.com/search?q=马哥教育

#POST方法
#http://httpbin.org/ 测试网站
from urllib import parse
from urllib.request import urlopen, Request
import simplejson

url = 'http://httpbin.org/post' #POST
data = parse.urlencode({'name':'张三,@=/&*', 'age':'6'}) #body
ua = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'

req = Request(url, headers={
        'User-agent':ua
        })

print('1:',data) #打印完发现data输出没有b'，只是一个string类型，而urlopen中POST data should be bytes or an iterable of bytes
print('2:',data.encode()) #对data编码后得到bytes类型，满足POST对data的类型要求
with urlopen(req, data=data.encode()) as res:  #POST请求，要求data不能是None
    text = res.read() 
    print('3:',text) #text为json字符串
    d = simplejson.loads(text) #对json字符串进行转换
    print('4:',d)
    print('5:',type(d))  #表明simplejson将json转换为了dict类型 
    
'''
本程序通过这种方式完成了POST交互：
把数据提交上去，只要data有数据，发起POST请求，对方网站如果有了响应，将会返回一些数据，
而本程序中输出的是json数据，因此顺带使用simplejson转化为dict类型，实际上也可以用其他的json库进行转换都行

有时候网站响应后返回数据并非是json，有可能返回html数据，因此建立沟通交互需要进行测试，然后再决定用什么方法对数据进行解析
'''

结果：

1: name=%E5%BC%A0%E4%B8%89%2C%40%3D%2F%26%2A&age=6
2: b'name=%E5%BC%A0%E4%B8%89%2C%40%3D%2F%26%2A&age=6'
3: b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "age": "6", \n    "name": "\\u5f20\\u4e09,@=/&*"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "47", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36"\n  }, \n  "json": null, \n  "origin": "61.167.119.254, 61.167.119.254", \n  "url": "https://httpbin.org/post"\n}\n'
4: {'json': None, 'data': '', 'form': {'name': '张三,@=/&*', 'age': '6'}, 'url': 'https://httpbin.org/post', 'files': {}, 'args': {}, 'headers': {'Accept-Encoding': 'identity', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'Content-Length': '47', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}, 'origin': '61.167.119.254, 61.167.119.254'}
5: <class 'dict'>