小编典典
_, params = cgi.parse_header('text/html; charset=utf-8')
print params['charset'] # -> utf-8
或使用响应对象:
response = urllib2.urlopen('http://example.com')
response_encoding = response.headers.getparam('charset')
# or in Python 3: response.headers.get_content_charset(default)
通常,服务器可能会说谎或根本不报告编码(默认取决于内容类型),或者可能在响应正文中指定编码,例如html文档中的元素或xml文档的xml声明中的元素。作为最后的选择,可以从内容本身猜测编码。
您可以requests用来获取Unicode文本:
import requests # pip install requests
r = requests.get(url)
unicode_str = r.text # may use `chardet` to auto-detect encoding
或BeautifulSoup解析html(并转换为Unicode作为副作用):
from bs4 import BeautifulSoup # pip install beautifulsoup4
soup = BeautifulSoup(urllib2.urlopen(url)) # may use `cchardet` for speed
# ...
from bs4 import UnicodeDammit
dammit = UnicodeDammit(b"Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# -> Sacré bleu!
print(dammit.original_encoding)
# -> utf-8
2021-01-20