多谢大家的回答。
不过问题确实有好几个。我把后来改正过来代码贴一下。
# python3
url = "http://www.neihan8.com/article/list_5_" + str(page) + ".html"
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'
headers = {'User-Agent': user_agent}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read()
print('python3 response.read()', type(html))
checkCode = chardet.detect(html)
print('checkCode', checkCode)
_html = html.decode(checkCode['encoding'])
print('python3 response.read().decode(gb2312)',type(_html))
# 输出:
python3 response.read()
checkCode {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
python3 response.read().decode(gb2312)
python3 type(requests.get(url).content)
python3 type(requests.get(url).content.decode('gb2312'))
下面是python2的代码:
url = "http://www.neihan8.com/article/list_5_" + str(page) + ".html"
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'
headers = {'User-Agent': user_agent}
req = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(req)
html = response.read()
print('python2 response.read()', type(html))
checkCode = chardet.detect(html)
print('checkCode', checkCode)
#gbk_html = html.encode(checkCode['encoding']).decode('utf-8')
_html = html.decode('gb2312')#.decode('gb2312')
print('python2 response.read().decode(\'gb2312\')', type(_html))
# 输出:
('python2 response.read()', )
('checkCode', {'confidence': 0.99, 'language': 'Chinese', 'encoding': 'GB2312'})
("python2 response.read().decode('gb2312')", )
('python2 type(requests.get(url).content)', )
("python2 type(requests.get(url).content.decode('gb2312'))", )
虽然2和3返回的源编码不是同一类型,但只要decode成unicode格式就能print出来了,总结起来还是对输入输出的编码理解有问题。