urllib/request

最新推荐文章于 2021-08-06 23:32:41 发布

嗨嗨嗨2232

最新推荐文章于 2021-08-06 23:32:41 发布

阅读量155

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/haihaihai2232/article/details/100154754

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

urllib

urllib是标准库, 死一个工具包模块包括以下模块:

urllib.request 用于打开和读取url
urllib.error 包含了由于urllib.request引起的异常
urllib.parse 用于解析url
urllib.robotparser 分析robots.txt文件

urllib.request模块

模块定义了在基本和摘要式身份验证, 重定向, cookies等应用中打开URL(主要是HTTP)和函数和类.
urlopen(url, data=None)

response返回一个类文件对象, 可以发起一个HTTP的GET请求

from  urllib.request import  urlopen , Request
from http.client import HTTPResponse
response = urlopen('http://www.bing.com')  ## GET方法

print(response, type(response))

with response:
    print(1, response.geturl()) ##返回真正的URL
    print(2, response.status, response.reason)  ## 状态
    print(3, response.info())   ## headers
    print(4,response.read())    ## 读取返回的内容
print(response.closed)

Request类, 可以改变ua
Request(url, data=None, headers={})
初始化方法, 构造一个请求对象, 可添加一个header的字典, data参数决定是GET还是POST请求.
add_header(key , val) 为header中增加键值对

urlopen方法第一参数可以接受request实例

from  urllib.request import  urlopen , Request
from http.client import HTTPResponse
ua = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.0"
url = 'https://movie.douban.com'

request = Request(url)
request.add_header('User-Agent', ua)
print(type(request))

response = urlopen(request, timeout=20)   ## request对象或者url都可以
print(type(response))

with response:
    print(1, response.geturl())    ## 返回数据时, 如果是重定向, 则url和初始的不一样
    print(2, response.info())   # 响应头
    print(3, response.read())    ## 读取返回的内容
print(4, request.get_header('User-agent'))
print(5, 'user-agent'.capitalize())

urllib.parse模块

该模块可以对url进行编解码

from urllib import parse

u = parse.urlencode({
    'url': 'http//www.magedu.com/python',
    'url_p': 'http//www.magedu.com/python?id=1&name=张三'
}
)
print(u)
## out
url=http%2F%2Fwww.magedu.com%2Fpython&url_p=http%2F%2Fwww.magedu.com%2Fpython%3Fid%3D1%26name%3D%E5%BC%A0%E4%B8%89

%号后面是十六进制表示的值, 一般来说url的地址部分, 一般不需要使用中文路径, 但是参数部分,可能有斜杠, 问好, 等号等符号, 这样这些字符表示数据, 不表示元字符. 如果直接发给服务器端就会导致接收方无法判断谁是元字符, 谁是数据.为了安全, 部分的字符使用url编码. 中文同样做编码. 一般先按照字符集的encoding要求转换成字节序列, 每衣蛾字节对应十六进制字符串前面加%号.

提交方法method

最常用的HTTP交互数据的方法是GET, POST.
GET方法 , 数据是通过URL传播的, 也就是说数据是在HTTP报文的header部分.
POST方法, 数据是放在HTTP报文的body部分提交的.
数据都是键值对形式, 多个参数之间使用&符号链接, 例如a=1&b=abc

GET方法
连接必应搜索引擎官网, 获取一个搜索的URL

from urllib import parse
from urllib.request import  Request, urlopen
import time
baseurl = 'http://cn.bing.com/search'

keyword = input('>>>')
data = parse.urlencode({'q': keyword})
url = '{}?{}'.format(baseurl, data)

ua = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.0"

request = Request(url, headers={'User-agent': ua})
response = urlopen(request)

with response:
    with open('E:/mage.html', 'wb') as f:
        html = response.read()
        f.write(html)
print('成功')

JSON数据处理

AJAX生成的数据通常需要通过GET/POST方法发送JSON数据获得

from urllib.request import  Request, urlopen
import time
baseurl = 'https://movie.douban.com/j/search_subjects'

data = parse.urlencode({
    'type': 'movie',
    'tag': '热门',
    'page_limit': 50,
    'page_start': 0
})
url = '{}?{}'.format(baseurl, data)

ua = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.0"
##/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0
request = Request(url, headers={'User-agent': ua})
response = urlopen(request)

with response:
    with open('E:/mage.html', 'wb') as f:
        html = response.read()
        f.write(html)
print('成功')

requests

request使用了urllib3, 但是分装的api跟家友好, $pip installl requests

import requests
from  requests.models import Response

ua = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.0"
url = 'http://movie.douban.com/'

response: Response = requests.request('GET', url, headers={'ua': ua})
with response:
    print(1, response.content[:100])    ## bytes类型
    print(2, response.text[:100])         ## unicode编码的字符类型
    print(response.status_code)
    print(response.headers)    ## 响应头
    print(response.request.headers)   ##请求头

requests默认使用Session对象, 是为了在多次服务器端交互保留会话的信息 , 比如cookie

import requests
from  requests.models import Response
#
ua = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.0"
urls = ['https://www.baidu.com/s?wd=magedu', 'https://www.baidu.com/s?wd=magedu']
session = requests.Session()

with session:
    for url in urls:
        response = session.get(url, headers={'User-Agent': ua})
        # response = requests.request('GET', url, headers={'User-Agent': ua})
        with response:
            print(response.headers)
            print(response.cookies)

第二次访问时, request会带上cookie

嗨嗨嗨2232

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
urllib/request

urlliburllib是标准库, 死一个工具包模块包括以下模块:urllib.request 用于打开和读取urlurllib.error 包含了由于urllib.request引起的异常urllib.parse 用于解析urlurllib.robotparser 分析robots.txt文件urllib.request模块模块定义了在基本和摘要式身份验证, 重定向, cook...
复制链接

扫一扫