uurlib.riquest获取网页编码

最新推荐文章于 2024-09-21 17:51:49 发布

qq_16253013

最新推荐文章于 2024-09-21 17:51:49 发布

阅读量346

点赞数

分类专栏： python爬虫文章标签： url 测试对象编辑器函数

本文链接：https://blog.csdn.net/qq_16253013/article/details/78145261

版权

python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

uurlib.riquest获取网页编码

打算写一个正则远程文件的通用函数。使用uurlib.riquest.urlopen打开。先用京东的评价url做测试。
urlopen会返回一个http.client.HTTPResponse对象并赋值到result。

url = "https://club.jd.com/comment/productPageComments.action?&productId=12693904332&score=0&sortType=5&page=0&pageSize=10"
result = request.urlopen(url)

正常情况下使用result.read()方法就可以获取远程文件的内容。但有时候远程文件的编码跟python所用的编码不一样，会导致出现乱码情况。
这时候就得先把远程文件的内容转换成bytes类型，然后用bytes.defcode()把编码改成一样。

html = bytes(result.read())
html = html.decode('gbk')

这时候又碰到了新的问题，怎么知道远程文件的编码类型？度娘之，发现http.client.HTTPResponse有一个info()方法。返回一个http.client.HTTPMessage对象，表示远程服务器返回的头信息。
远程文件的编码类型正是在这里面。
我们打印一下。

print(result.info())

打印出如下信息：

Server: JDWS/1.0.0
Date: Wed, 27 Sep 2017 11:09:34 GMT
Content-Type: text/html;charset=GBK
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding

我们可以看到Content-Type：text/html;charset=GBK，编码是gbk。
由于我要实现的是自动获取编码类型，这样当然不行，这次是gbk，换一个网址可能是utf8。
另寻它法。
忽然想到python应该会有打印出一个对象所有属性&&方法的办法，再次度娘之，找到了dir函数。既然我们要获取的的是http.client.HTTPMessage对象，也就是result.info()里面的内容。
那么我们就打印一下

print(dir(result.info()))

如下信息：

    ['__bytes__', '__class__', '__contains__', '__delattr__', '__delitem__',
    '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__',
    '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__',
    '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__',
    '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__',
    '__subclasshook__', '__weakref__', '_charset', '_default_type', '_get_params_preserve',
    '_headers', '_payload', '_unixfrom', 'add_header', 'as_bytes', 'as_string', 'attach',
    'defects', 'del_param', 'epilogue', 'get', 'get_all', 'get_boundary', 'get_charset', 'get_charsets',
    'get_content_charset', 'get_content_disposition', 'get_content_maintype',
    'get_content_subtype', 'get_content_type', 'get_default_type', 'get_filename', 'get_param',
    'get_params', 'get_payload', 'get_unixfrom', 'getallmatchingheaders', 'is_multipart', 'items',
    'keys', 'policy', 'preamble', 'raw_items', 'replace_header', 'set_boundary', 'set_charset',
    'set_default_type', 'set_param', 'set_payload', 'set_raw', 'set_type', 'set_unixfrom', 'values', 'walk']

出来一个列表，我们寻找一下有没有需要的？看到了get_content_type。可能是它了。

print(result.info().get_content_type())

打印出如下

text/html

我们需要的，继续寻找。
再次找到get_charset。可能是它了。

print(result.info().get_charset())

打印出如下信息：

None

？？？
依然不是我们需要的。继续寻找。
再次找到get_content_charset。可能是它了。

print(result.info().get_get_content_charset())

打印出如下信息：

gbk

找到了！编码问题就这样解决了。

qq_16253013

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录