静态网页指纯HTML格式的网页,没有后台数据库、不含程序、不可交互,更新相对比较麻烦,数据比较好爬。
使用的库:Request
Request库使用示例
import requests
r=requests.get("https://mp.csdn.net/mp_blog/creation/editor?spm=1011.2124.3001.6192")
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
print(r.content)#字节方式的响应体,自动解码gzip和deflate编码的响应数据
'''
200
text/html; charset=utf-8
utf-8
b'<!DOCTYPE html><html><head><meta charset="utf-8"><meta http-equiv=...
'''
#Request的编码模式可修改,修改后使用修改后的编码获取网页内容
r.encoding='ISO-8859-1'
print(r.encoding)
print(r.content)
JSON是JavaScript的对象标记,使用对象和数组的组合表示数据。
传递URL参数
两种构建方式
直接构建:http://httpbin.org/get?key1=value1
import requests
key_dict={'key1':'value1','key2':'value2'}
r=requests.get('http://httpbin.org/get',params=key_dict)
print(r.url)
print(r.text)
'''
http://httpbin.org/get?key1=value1&key2=value2
{
"args": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.31.0",
"X-Amzn-Trace-Id": "Root=1-65a6782e-23eda628341abb5738e9de51"
},
"origin": "111.33.117.248",
"url": "http://httpbin.org/get?key1=value1&key2=value2"
}
'''
网页编码
参数allow_redirects=False禁止跳转,可以直接显示跳转的状态码。
定制请求头
#获取请求头,以字典形式返回
r=requests.get("https://mp.csdn.net/mp_blog/creation/editor?spm=1011.2124.3001.6192")
print(r.headers)
'''
{'Server': 'openresty', 'Date': 'Tue, 16 Jan 2024 12:52:37 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Tue, 09 Jan 2024 06:48:35 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=20', 'Vary': 'Accept-Encoding', 'ETag': 'W/"659cec43-1478"', 'Strict-Transport-Security': 'max-age=864000', 'Content-Encoding': 'gzip'}
'''
#获取请求头内容
print(r.request.header)
'''
{'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
'''
发送POST请求
将字典传给request的data参数,发送时自动编码为表单格式。
key_dict={'key1':'value1','key2':'value2'}
r=request.post('http://httpbin.org/post',data=key_dict)
设置超时
设置时间为20s,超时报错。
request.get('http://httpbin.org',timeout=20)
代理访问
proxies={
"http":"http://10.10.1.10:3128",
"https":"http://10.10.1.10:1080",
}
#代理需要账号密码
proxies={
"http":"http://user:pass@10.10.1.10:3128/",
}
request.get("http://httpbin.org",proxies=proxies)
伪装请求头部
headers={'User-Agent':'alexkh'}
r=request.get('http:httpbin.org',headers=headers)