python爬虫 - 使用urllib（二）

最新推荐文章于 2022-03-07 00:14:31 发布

「已注销」

最新推荐文章于 2022-03-07 00:14:31 发布

阅读量141

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/Venusing1998/article/details/86765313

版权

python爬虫专栏收录该内容

8 篇文章 0 订阅

订阅专栏

前面利用urlopen()的方法，打开一个http网页，但是想获得请求头，状态码等信息，又该如何操作？

查看类型

request.urlopen()打开的是什么？

import urllib.request

response = urllib.request.urlopen('https://httpbin.org/get')
print(type(response))

运行结果如下：

<class 'http.client.HTTPResponse'>

原来是一个http.client.HTTPResponse类。

查看一下所拥有的方法，属性。

print(dir(response))

运行结果如下：

['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 'writelines']

除去类自带的一些固有属性和方法，这个类有begin， chunk_left等特有方法，属性。

获取headers

其中就有headers，getheader， getheaders方法可以获得headers。

import urllib.request

response = urllib.request.urlopen('https://httpbin.org/get')
print(response.headers)
print(response.getheaders())

运行结果如下：

Connection: close
Server: gunicorn/19.9.0
Date: Tue, 05 Feb 2019 04:48:55 GMT
Content-Type: application/json
Content-Length: 236
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Via: 1.1 vegur


[('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Date', 'Tue, 05 Feb 2019 04:48:55 GMT'), ('Content-Type', 'application/json'), ('Content-Length', '236'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true'), ('Via', '1.1 vegur')]

第一种方法是获得str，第二种是获得了list。

获取status code

按图索骥，发现有code，status的属性。

import urllib.request

response = urllib.request.urlopen('https://httpbin.org/get')

print(response.code)
print(response.status)

两种方法都获得了status code。

data参数

data参数是可选的，如果要添加data，它要是字节流编码格式的内容，即bytes类型，通过bytes()方法可以进行转化，另外如果传递了这个 data 参数，它的请求方式就不再是GET方式请求，而是 POST。

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

在这里我们传递了一个参数word，值是hello。它需要被转码成bytes（字节流）类型。其中转字节流采用了bytes() 方法，第一个参数需要是str（字符串）类型，需要用urllib.parse模块里的urlencode()方法来将参数字典转化为字符串。第二个参数指定编码格式，在这里指定为utf8。

运行结果如下：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7"
  }, 
  "json": null, 
  "origin": "116.227.107.42", 
  "url": "http://httpbin.org/post"
}

「已注销」

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
python爬虫 - 使用urllib（二）

前面利用urlopen()的方法，打开一个http网页，但是想获得请求头，状态码等信息，又该如何操作？查看类型request.urlopen()打开的是什么？import urllib.requestresponse = urllib.request.urlopen('https://httpbin.org/get')print(type(response))运行结果如下：&lt...
复制链接

扫一扫