Python爬虫获取网页编码格式

最新推荐文章于 2024-05-02 21:47:38 发布

残阳次杨

最新推荐文章于 2024-05-02 21:47:38 发布

阅读量2.2k

点赞数 2

文章标签： python爬虫网页编码获取

本文链接：https://blog.csdn.net/weixin_44032983/article/details/100985153

版权

Python爬虫获取网页编码格式

网页编码格式是每个网页规定的本页面文字的编码方式，其中比较流行的是ascii, gbk, utf-8, iso等。观察许多网页的编码格式都是在meta标签的content属性中定义的。基于以上特点本文提供获取编码格式的方法。
代码如下：

'''
注：本人使用的是IDLE python 3.7 64-bit，装载bs4库
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

def getCharsetList(url = None):
    #打开网页，创建BeautifulSoup对象
    newURL = urlopen(url)
    bsObj = BeautifulSoup(newURL, "html.parser")
   
    #首先查找属性中含有text/html的meta标签以缩小查找范围
    metaTagList = bsObj.findAll('meta', content = re.compile('text/html'))
  
    #定义一个存储编码格式的列表
    charsetList = []
   
    #之后从metaTagList列表中的各项查找其属性内容（用get()函数）
    for metaTag in metaTagList:
        attribution = metaTag.get('content')
        charData = str(attribution)
        position = charData.find('charset')
        charsetList.append(charData[(position + 8):].strip())
        
    return charsetList

下面是用百度作为url得到的编码方式（附上本人IDLE运行结果与代码截图）：

百度作为url的代码截图
本机运行结果