爬虫之urllib模块

最新推荐文章于 2024-08-15 01:55:01 发布

m0_62213025

最新推荐文章于 2024-08-15 01:55:01 发布

阅读量934

点赞数

文章标签： python

本文链接：https://blog.csdn.net/m0_62213025/article/details/122685681

版权

urllib是Python内置的http请求库，用于获取网页内容

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse URL解析模块

一个简单的get请求

import urllib.request
response=urllib.request.urlopen('http:\\baidu.com') 
print(response.read().decode('utf-8'))

decode() 解码 encode()编码

urllib.urlopen() 打开一个url的方法，返回一个文件对象，然后可以进行类似文件对象的操作

urlopen返回对象提供方法：

- read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样

- info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息

- getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到

- geturl()：返回请求的url

一个简单的post请求

import urllib.parse 
import urllib.request 
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf-8') 
reponse = urllib.request.urlopen('http://httpbin.org/post',data=data) 
print(reponse.read())

使用data参数必须使用bytes（字节流）

from urllib import parse
#url='xxx'
a=parse.quote('文字',encoding='gbk')
#url=url+a

url 特性：url不可以存在非ASCII编码的字符数据，需使用parse模块中parse.quote()进行转码

超时处理

import urllib.request 
response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) 
print(response.read())

timeout参数使用在某些网络情况不好或者服务器端异常，请求超出时间则会抛出异常

import urllib.request 
import socket import urllib.error 
try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01) 
except urllib.error.URLError as e: 
    if isinstance(e.reason,socket.timeout):#判断错误原因 
        print('time out!')

import urllib . request

response = urllib . request . urlopen ( 'http://www.baidu.com' )

print ( response . status ) #获取状态码判断请求是否成功

print ( response . getheaders ()) # 响应头得到的一个元组组成的列表

print ( response . getheader ( 'Server' )) # 得到特定的响应头

print ( response . read (). decode ( 'utf-8' )) # 获取响应体的内容，字节流的数据，需要转成 utf-8

格式

Reques对象

urlopen()只能用于简单的请求无法添加headers参数

from urllib import request,parse 
url='http://httpbin.org/post' 
headers={ 
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
'Host':'httpbin.org' }
dict={'name':'jay' }
data = bytes(parse.urlencode(dict),encoding='utf-8') 
req=request.Request(url=url,data=data,headers=headers,method='POST') 
response=request.urlopen(req) 
print(response.read().decode('utf-8'))

分别创建字符串，字典带入request对象

from urllib import request,parse 
url ='http://httpbin.org/post' 
dict = { 'name':'cq' }
data=bytes(parse.urlencode(dict),encoding='utf-8') 
req = request.Request(url=url,data=data,method='POST') 
req.add_header('user-agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36') 
response=request.urlopen(req) 
print(response.read().decode('utf-8')

通过 addheaders 方法不断的向原始的 requests 对象里不断添加参数

req.add_header() 添加头部信息

import http.cookiejar,urllib.request 
cookie = http.cookiejar.CookieJar() 
handerler=urllib.request.HTTPCookieProcessor(cookie) 
opener=urllib.request.build_opener(handerler) 
response=opener.open('http://www.baidu.com') #获取response后cookie会被自动赋值 
for item in cookie: 
    print(item.name+'='+item.value)

打印出信息cookies

import http.cookiejar,urllib.request 
filename='cookie.txt' 
cookie = http.cookiejar.MozillaCookieJar(filename) handerler=urllib.request.HTTPCookieProcessor(cookie) 
opener=urllib.request.build_opener(handerler) 
response=opener.open('http://www.baidu.com') #获取response后cookie会被自动赋值 cookie.save(ignore_discard=True,ignore_expires=True) #保存cookie.txt文件


import http.cookiejar,urllib.request 
filename='cookie2.txt' 
cookie = http.cookiejar.LWPCookieJar(filename) 
handerler=urllib.request.HTTPCookieProcessor(cookie) 
opener=urllib.request.build_opener(handerler) 
response=opener.open('http://www.baidu.com') #获取response后cookie会被自动赋值 cookie.save(ignore_discard=True,ignore_expires=True) #保存cookie.txt文件

保存cookie文件,两种格式

import http.cookiejar,urllib.request 
cookie = http.cookiejar.MozillaCookieJar() 
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) 
handerler=urllib.request.HTTPCookieProcessor(cookie) 
opener=urllib.request.build_opener(handerler) 
response=opener.open('http://www.baidu.com') 
print(response.read().decode('utf-8'))

用文本文件的形式维持登录状态

异常处理

#父类，只有一个reason 
from urllib import request,error 
try:
    response = request.urlopen('http://www.bai.com/index.html') 
except error.URLError as e: 
    print(e.reason) 

#子类，有更多的属性 
from urllib import request,error 
try:
    response = request.urlopen('http://abc.123/index.html') 
except error.HTTPError as e: 
    print(e.reason,e.code,e.headers,sep='\n')

关于异常处理部分，需要了解有httperror和urlerror两种，父类与子类的关系。

URL处理

from urllib.parse import urlparse 
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment') 
print(result)  #协议内容、路径、参数 
print(type(result)) 

from urllib.parse import urlparse 
result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https') print(result)
 
from urllib.parse import urlparse 
result = urlparse('http://www.baidu.com/index.html;user? id=5#comment',scheme='https') print(result) 

from urllib.parse import urlparse 
result = urlparse('http://www.baidu.com/index.html;user? id=5#comment',allow_fragments=False) #会被拼接 
print(result) 

from urllib.parse import urlparse 
result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False) #会被 拼接到path没有query 
print(result)

解析,将一个url解析

from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))


from urllib.parse import urljoin
#拼接两个url
#截图，以后面的为基准，有留下，没有拼接
print(urljoin('http://www.baidu.com','HAA.HTML'))
print(urljoin('https://wwww.baidu.com','https://www.baidu.com/index.html;questio
n=2'))

url拼接

#字典方式直接转换成url参数 
from urllib.parse import urlencode 
params = { 'name':'germey', 'age':'122' }
base_url='http://www.baidu.com?' 
url=base_url+urlencode(params) print(url)