网络爬虫：识别网站编码，然后转码，然后写入txt文档

最新推荐文章于 2024-01-22 14:03:47 发布

chaowanghn

最新推荐文章于 2024-01-22 14:03:47 发布

阅读量2.9k

点赞数 1

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/chaowanghn/article/details/54581010

版权

python 专栏收录该内容

36 篇文章 3 订阅

订阅专栏

爬网页信息时，使用python3的urllib.request包

import urllib.request

def main():
    response=urllib.request.urlopen("http://placekitten.com/g/500/600")
    # 或者
    # req=urllib.request.Request("http://placekitten.com/g/500/600")
    # response=urllib.request.urlopen(req)

    # print (response.geturl()) #得到url
    # print(response.info())  #得到信息
    # print(response.getcode()) #200，http的状态，正常相应

    cat_image=response.read()

    with open("cat_500_600.jpg","wb") as f:
        f.write(cat_image)

if __name__=="__main__":
    main()

读取一个网页信息时，需要了解编码格式，然后进行解码。推荐使用chardet包中detect()函数

import urllib.request
response=urllib.request.urlopen("http://baidu.com").read()
import chardet
print ("该网页使用的编码是：%s" %(chardet.detect(response)))

该网页使用的编码是：{‘encoding’: ‘ascii’, ‘confidence’: 1.0}
置信度为100%，则百度为Ascii码。然后进行解码后，打印网页。

response=response.decode("ascii")
print (response)

结果为：

<html>
<meta http-equiv="refresh" content="0;url=http://www.baidu.com/">
</html>

我们再看看小木虫论坛

import urllib.request
response=urllib.request.urlopen("http://muchong.com/bbs/index.php").read()
import chardet
print ("该网页使用的编码是：%s" %(chardet.detect(response)))

该网页使用的编码是：{‘encoding’: ‘GB2312’, ‘confidence’: 0.99}
，99%确定是GB231编码。但也可能使用的是GBK编码（GBK是GB2312的扩展）。

由于GBK是向下兼容GB2312，因此你检测到GB2312，则直接用GBK来编码/解码。

下面写一段代码，可以识别网页使用的编码：

import urllib.request
import chardet

def main():
    url=input("请输入URL：")

    response=urllib.request.urlopen(url)
    html=response.read()

    #识别网页编码
    encode=chardet.detect(html)["encoding"]
    if encode=="GB2312":
        encode="GBK"

    print("该网页使用的编码是：%s" %encode)

    html=html.decode(encode,"ignore")
    #此处可以不加ignore。因为print时可能还是有错误，可以我添加了ignore，忽略识别不出来的字符。
    print("该网页内容为："+html)


if __name__=="__main__":
    main()

有时候将网页内容写入txt文本时，还是有编码问题。需要在open()里增加encoding=encode,即open(filename,”x”,encoding=encode)其中encode为编码格式。错误代码为：UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xa9’ in position 82968: illegal multibyte sequence

import urllib.request
import chardet

def main():
    url="http://www.csdn.net"
    response=urllib.request.urlopen(url)
    html=response.read()

    #识别网页编码
    encode=chardet.detect(html)["encoding"]
    if encode=="GB2312":
        encode="GBK"

    print("该网页使用的编码是：%s" %encode)

    html=html.decode(encode,"ignore")
    #此处可以不加ignore。因为print时可能还是有错误，所以我添加了ignore，忽略识别不出来的字符。
    # print("该网页内容为："+html)


    file_1=open("csdn.txt","x",encoding=encode)
    print (html,file=file_1)
    file_1.close()

    # 或者write()形式写入
    # with open("csdn.txt","x",encoding=encode) as file_1:
    #     file_1.write(html)

if __name__=="__main__":
    main()

此处，open(“csdn.txt”,”x”,encoding=encode)，增加了“encoding=”，问题解决。

chaowanghn

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
网络爬虫：识别网站编码，然后转码，然后写入txt文档

读取一个网页信息时，需要了解编码格式，然后进行解码。推荐使用chardet包中detect()函数import urllib.requestresponse=urllib.request.urlopen("http://baidu.com").read()import chardetprint ("该网页使用的编码是：%s" %(chardet.detect(response)))该网页使用的
复制链接

扫一扫