《基本库的使用》学习笔记之使用urllib

最新推荐文章于 2022-12-04 23:50:42 发布

zxq一样

最新推荐文章于 2022-12-04 23:50:42 发布

阅读量182

点赞数

分类专栏： python学习

python学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

在这里插入图片描述

概念：

urlllib是python内置HTTP请求库，包含以下四个模块
request:发送请求
error:异常处理
parse:工具模块
robotparse:识别网站可爬性(少用)

发送请求

1.1 urlopen()

   import urllib.request
   
   response = urllib.request.urlopen('https://www.python.org')
   print(response.read().decode('utf-8'))

在这里插入图片描述
通过图片我们发现，只用了两行代码，输出了网页的源代码。其中链接、图片的地址以及文本信息都在。
然后，我们利用type()的方法看看它返回的是什么

      import urllib.request
      
      response = urllib.request.urlopen('https://www.python.org')
      print(type(response))
      输出结果是：
      <class 'http.client.HTTPResponse'>

这是一个HTTPResponse类型的对象，包含很多方法，例如read() readinto() getheader(name) getheaders() fielno()，以及msg status reason debuglevel closed等属性，因此我们调用不同的属性，就可以输出不同的值了。

      import urllib.request
      
      response = urllib.request.urlopen('https://www.python.org')
      print(type(response))
      print(response.status)
      print(response.getheaders())
      print(response.getheader('Strict-Transport-Security'))
      输出如下：
      200
 [
 ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'),
 ('X-Frame-Options', 'SAMEORIGIN'),
 ('x-xss-protection', '1; mode=block'),
 ('X-Clacks-Overhead', 'GNU Terry Pratchett'),
 ('Via', '1.1 varnish'), ('Content-Length', '48940'),
 ('Accept-Ranges', 'bytes'),
 ('Date', 'Tue, 15 Jan 2019 10:28:43 GMT'), 
 ('Via', '1.1 varnish'), 
 ('Age', '212'), 
 ('Connection', 'close'), 
 ('X-Served-By', 'cache-iad2134-IAD, cache-tyo19929-TYO'),
 ('X-Cache', 'HIT, HIT'),
 ('X-Cache-Hits', '1, 404'),
 ('X-Timer', 'S1547548123.453413,VS0,VE0'),
 ('Vary', 'Cookie'), 
 ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')
  ]
    max-age=63072000; includeSubDomains

上面就是urllib请求的实例，我们还可以使用参数传入给链接：

     urllinb.requset.ulropen(url, date=none, [timeout,]*, cafile=None, cadefault = falase, context=None

data参数
我们传递了一个参数word值是hello,采用bytes的方法编写，使用urlencode（）的方法将参数转化为字符串，格式是utf8。

 import urllib.request
 import urllib.parse
 
 url = 'http://httpbin.org/post'
 data = bytes(urllib.parse.urlencode({'word' : 'hello'}), encoding = 'utf8')
 response = urllib.request.urlopen(url, data=data)
 print(response.read())
 输出如下：
{
   "args": {}, 
   "data": "", 
   "files": {},
   "form": {
   "word": "hello
     },
   "headers": {
           "Accept-Encoding": "identity",
           "Connection": "close", 
           "Content-Length": "10", 
           "Content-Type": "application/x-www-form-urlencoded",
           "Host": "httpbin.org", 
           "User-Agent": "Python-urllib/3.7"\n  },
           "json": null, 
           "origin": "119.4.133.18", 
           "url": "http://httpbin.org/post"
      }

timeout

设置时间的超时，等待页面响应的时间限制。

1.2 Request
了解用法

 import urllib.requset

 request = urllib.requset.Request('http://python.org')
 response = urllib.requset.urlopen(request)
 print(response.read())

依然用urlopen的方法，只是以Request的方式传递

参数构造的格式

     urllinb.requset.ulropen(url, date=none, [timeout,]*, cafile=None, cadefault = falase, context=None

异常处理

URLError

具有reason属性，即返回错误的原因

HTTPError
是URLError的子类，专门处理HTTP的请求错误
有三个属性：code(status状态码)，reason,headers(返回请求头)
宗上，以实际的例子举例说明（先获取子类错误，再获取父类）

  from urllib import request, error

  try:
       response = request.urlopen('http://www.baidu.com')
  except error.HTTPError as e:
       print(e.reason, e.code, e.headers)
  except error.URLError as e:
       print(e.reason)
   else:
       print('request successfully')

链接分析

urlparse()
可以识别和分段URL

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=6#comment')
print(result)
输出：
ParseResult
(
scheme='http', netloc='www.baidu.com', 
path='/index.html', params='user',
query='id=6', fragment='comment'
 )

一个URL的标准格式（6个部分）

    sheme://netloc/path:parms?query#fragment

sheme:协议
netloc:域名
path:访问路径
parms:代表参数
query:查询条件
fragment:定位下拉位置
例如：http://www.baidu.com/index.html;user?id=6#comment

urlunparse

from urllib.parse import urlunparse

data = ['http','www.baidu,com','index.html','user','id=6','comment']
result = urlunparse(data)
print(result)
输出：
http://www.baidu,com/index.html;user?id=6#comment

urlencode()
此方式在GET请求里面非常有用
例如：

from urllib.parse import urlencode

parmes ={
    'name':'germey',
    'age':22
}    
base_url = 'http://www.baidu.com？'
url = base_url + urlencode(parmes)
print(url)
输出：
http://www.baidu.com？name=germey&age=22

首先调用字典，在利用urlencode的方法转化为GET请求

quote

将内容转化为URL编码。即将中文转化为URL

    from urllib.parse import quote

    keyword = '美食'
    url = 'https://www.baidu.com/s?wd=' + quote(keyword)
    print(url)
    输出：
    https://www.baidu.com/s?wd=%E7%BE%8E%E9%A3%9F

5.unquote
解码urlquote
上面输出的结果进行解码：

    from urllib.parse import unquote

    url = 'https://www.baidu.com/s?wd=%E7%BE%8E%E9%A3%9F'
    print(unquote(url))
    输出：
    https://www.baidu.com/s?wd=美食

参考资料

教材《Python3网络爬虫开发实战》.崔庆才

zxq一样

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《基本库的使用》学习笔记之使用urllib

概念：``urlllib是python内置HTTP请求库，包含以下四个模块request:发送请求error:异常处理parse:工具模块robotparse:识别网站可爬性(少用)1. 发送请求1.1 urlopen() import urllib.request response = urllib.request.urlopen('https:/www....
复制链接

扫一扫

专栏目录