爬虫：python中的urllib使用

最新推荐文章于 2022-07-20 11:27:50 发布

真你假我

最新推荐文章于 2022-07-20 11:27:50 发布

阅读量338

点赞数

分类专栏： python python 文章标签： python python

本文链接：https://blog.csdn.net/zhangzejia/article/details/79644497

版权

python 同时被 2 个专栏收录

26 篇文章 1 订阅

订阅专栏

python

18 篇文章 0 订阅

订阅专栏

一、urllib简介

urllib是pythond的内置HTTP请求库，包含如下模块处理请求库：

urllib.request：发送HTTP请求

urllib.error：异常处理模块

urllib.parse：URL解析模块

二、urllib.request.urlopen()

import urllib.request
url="http://www.baidu.com"
#urllib.request.urlopen()函数提供了最基本的HTTP请求，利用它可以模拟浏览器的一个请求发起过程
#其参数：第一个：要访问的网页的URL；第二个：访问URL时发送的数据包，一般为none；第三个：timeout访问超时
reponse=urllib.request.urlopen(url)
#读取内容的三种方式：
#（1）read():读取文件的全部内容，会把读取到内容赋值给一个字符串变量
#（2）readline()：读取文件的第一行，一般用来判断是否访问URL成功
#（3）readllines():读取文件的全部内容，并将读取的内容赋值给一个list变量
# data=reponse.read()
data1=reponse.readlines()
# data2=reponse.readline()
print(data2)
#获取当前URL的状态码，200为访问成功
#print(reponse.getcode())
#返回与系统变量有关的信息
#print(reponse.info())

三、清除缓存与解码、编码

import urllib.request
url="http://www.baidu.com"
reponse=urllib.request.urlopen(url)
# 清除缓存
urllib.request.urlcleanup()
# 转码:将冒号转码为：%3A
ur=urllib.request.quote(url)
print(ur)
# 解码：将%3A解码为：
ur1=urllib.request.unquote(ur)
print(ur1)

四、使用urllib.request.build_opener()修改HTTP的报头

# urllib.request.urlopen()函数不支持验证、cookie、代理或者其他HTTP的高级功能。要支持这些功能必须使用
# urllib.request.build_opener()函数，让python函数模拟浏览器进行访问，函数创建自定义的opener对象,
# 该对象具有open()函数，与urllib.request.urlopen()函数相似，如果需要修改HTTP报头如下：
import urllib.request
url='http://www.sina.com.cn/'
header=('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36')
opener=urllib.request.build_opener()
opener.addheaders=[header]
file=opener.open(url)
data=file.read()
print(data)

五、使用urllib.request.Reuest()修改HTTP的报头和host

# 可以使用urlopen()函数构建一个简单的请求，但这几个参数不能构建一个完整的请求，
# 如果请求中需要加入header和host等信息，此时就需要Request类来包装
import urllib.request
url='http://www.sina.com.cn'
header={
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
    'Host': 'www.sina.com.cn'
}
data=urllib.request.Request(url=url,headers=header)
response=urllib.request.urlopen(data)
html=response.read()
print(response.getcode())
with open("1.html",'w') as f:
    # write只能读取二进制的数据
    f.write(str(html))

六、使用代理

# urllib默认会使用环境变量http_proxy来设置 HTTP Proxy。
# 假如一个网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问。
# 所以你可以设置一些代理服务器来帮助你做工作，每隔一段时间换一个代理
def use_proxy(proxy_addr,url):
    import urllib.request
#创建代理处理器
    proxy=urllib.request.ProxyHandler(
        {
            'http':proxy_addr
        }
    )
   
#创建特定的opener对象
    opener=urllib.request.build_opener(proxy)
#安装全局的opener，把urlopen()也变成特定的opener
    urllib.request.install_opener(opener)
    data=urllib.request.urlopen(url).read().decode("utf-8")
    return data
proxy_addr='122.114.31.177'
url="http://news.sina.com.cn/"
data=use_proxy(proxy_addr,url)
print(data)

七、get请求

# 请求的数据为英文
import urllib.request
url="http://www.baidu.com/s?wd="
keyword="hello"
url_all=url+keyword
req=urllib.request.Request(url_all)
file=urllib.request.urlopen(req)
data=file.read()
print(data)
# 请求数据有中文要进行转码
import urllib.request
url="http://www.baidu.com/s?wd="
keyword="智能"
key_code=urllib.request.quote(keyword)
url_all=url+key_code
req=urllib.request.Request(url_all)
file=urllib.request.urlopen(req)
print(file.read())

八、post请求

#post请求，使用parse解析数据
import urllib.request
import urllib.parse
url="http://www.iqianyue.com/mypost/"
# 解析上传数据，urlencode():将数据使用函数编码处理后，在使用encode()设置成utf-8编码
postdata=urllib.parse.urlencode({
    'nmae':'zhang',
    'pass':'dfajlka'
}).encode("utf-8")
req=urllib.request.Request(url,postdata)
file=urllib.request.urlopen(req)
data=file.read()
print(data)

九、超时与异常处理

## 产生URLerror的原因
# 1、连接不上服务器
# 2、远程URL不存在
# 3、无网络
# 4、触发HTTPError异常
#############################################
# 301 重定向到新的URL
# 302 重定向到临时的URL
# 304 请求资源为更新                         
# 400 非法请求
# 403 禁止访问
# 500 服务器内部出现问题
# 501 服务器不支持请求所需要的功能
import urllib.request
import urllib.error
try:
    url="https://www.csdns.net/"
    file=urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
    print(e.reason)
except urllib.error.URLError as e:
    print(e.reason)

#使用timeout设置超时，单位秒
import  urllib.request
for i in range(1,51):
    try:
        file=urllib.request.urlopen("http://www.sina.com.cn",timeout=0.005)
        data=file.read()
        print(data)
    except Exception as e:
        print ('异常'+str(e))

真你假我

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫：python中的urllib使用

一、urllib简介urllib是pythond的内置HTTP请求库，包含如下模块处理请求库：urllib.request：发送HTTP请求urllib.error：异常处理模块urllib.parse：URL解析模块二、urllib.request.urlopen()import urllib.requesturl="http://www.baidu.com"#urllib.request....
复制链接

扫一扫