import wad.detection
import cchardet
import requests
url='https://www.baidu.com'
#查看网站所用的技术
det=wad.detection.Detector()
print(det.detect(url))
#使用cchardet检测网页编码类型
html=requests.get(url)
result=cchardet.detect(html.content)
print(result)
html.encoding=result['encoding']
print(html.encoding)
#查看网站所有者的信息??? 不可用
# import whois
# imagination=whois.whois("www.douban.com")
# print(imagination)
运行结果:
{'https://www.baidu.com/': [{'app': 'jQuery', 'ver': '1', 'type': 'JavaScript Libraries'}]}
{'encoding': 'UTF-8', 'confidence': 0.9900000095367432}
UTF-8
查看编码方法二(chardet):
requests的chardet:
data=requests.get(url)
print(chardet.detect(data.content))
运行结果:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
urllib的chardet:
import urllib
from urllib import request
res=request.Request(url)
data=request.urlopen(res)
print(chardet.detect(data.read()))
!!不能写成:
(print(chardet.detect(data.read().decode('utf-8')))
运行结果:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
查看编码方法三(requests):
import requests
data=requests.get(url)
print(data.encoding)
print(data.apparent_encoding)
#查询网页的实际编码
#request.utls.get_encodings_from_content(response.text)[0]
print(requests.utils.get_encodings_from_content(data.text))
print(requests.utils.get_encoding_from_headers(data.headers))
执行结果:
ISO-8859-1
utf-8
['utf-8']
ISO-8859-1
Response对象有猜测的编码方式和真实的编码方式,encoding和apparent_encoding
requests会使用响应头中Content-Type的charset类型来进行解码得到data对象的text属性的值。若无法得到charset的值,那么requests会默认使用ISO-8859-1的编码格式,这样的话如果网页有中文就无法正常显示。
apparent_encoding能分析出真正的编码
反例:
print(requests.utils.get_encoding_from_headers(data.headers))
print(data.headers['Content-Encoding'])
print(data.headers)
运行结果:
ISO-8859-1
gzip
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Fri, 24 Sep 2021 02:56:06 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:52 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
加了请求头:
url="https://www.baidu.com"
header={
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
import cchardet
import chardet
import requests
data=requests.get(url,headers=header)
print(data.encoding)
print(data.apparent_encoding)
print(requests.utils.get_encodings_from_content(data.text))
print(requests.utils.get_encoding_from_headers(data.headers))
print(data.headers['Content-Encoding'])
print(data.headers)
utf-8
utf-8
['utf-8']
utf-8
gzip
{'Bdpagetype': '1', 'Bdqid': '0xa6b23a4a00001c33', 'Cache-Control': 'private', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=utf-8', 'Date': 'Fri, 24 Sep 2021 03:08:12 GMT', 'Expires': 'Fri, 24 Sep 2021 03:07:36 GMT', 'P3p': 'CP=" OTI DSP COR IVA OUR IND COM ", CP=" OTI DSP COR IVA OUR IND COM "', 'Server': 'BWS/1.1', 'Set-Cookie': 'BAIDUID=4E3F7D5FB1ABBC456994F37E8359A1EF:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BIDUPSID=4E3F7D5FB1ABBC456994F37E8359A1EF; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, PSTM=1632452892; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BAIDUID=4E3F7D5FB1ABBC452B179B0C1AAECB4D:FG=1; max-age=31536000; expires=Sat, 24-Sep-22 03:08:12 GMT; domain=.baidu.com; path=/; version=1; comment=bd, BDSVRTM=0; path=/, BD_HOME=1; path=/, H_PS_PSSID=34644_34441_34068_31254_34551_34584_34106_26350_34627_34424_22160_34691_34675; path=/; domain=.baidu.com', 'Strict-Transport-Security': 'max-age=172800', 'Traceid': '1632452892045537332212011727245652532275', 'X-Frame-Options': 'sameorigin', 'X-Ua-Compatible': 'IE=Edge,chrome=1', 'Transfer-Encoding': 'chunked'}
查看编码方法四(urllib.request):
import urllib
from urllib import request
res=request.Request(url)
data=request.urlopen(res)
print(data.headers)
运行结果:
Bdpagetype: 1
Bdqid: 0x8be0aba3001d6028
Cache-Control: private
Content-Type: text/html;charset=utf-8
Date: Fri, 24 Sep 2021 02:45:07 GMT
Expires: Fri, 24 Sep 2021 02:44:57 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Server: BWS/1.1
Set-Cookie: BAIDUID=203CF8DE7E836247AFA410E8AF44297A:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=203CF8DE7E836247AFA410E8AF44297A; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1632451507; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BAIDUID=203CF8DE7E8362478857F5D88A8C0B15:FG=1; max-age=31536000; expires=Sat, 24-Sep-22 02:45:07 GMT; domain=.baidu.com; path=/; version=1; comment=bd
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=1; path=/
Set-Cookie: H_PS_PSSID=34647_34068_31254_34551_34524_34585_34504_26350_34725_34425_34691; path=/; domain=.baidu.com
Traceid: 1632451507035516877810079244682625114152
Vary: Accept-Encoding
Vary: Accept-Encoding
X-Frame-Options: sameorigin
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
Transfer-Encoding: chunked
总结:
查看网页的编码方法:
序号 | 方法 | requests | urllib.request |
---|---|---|---|
1 | cchardet | html=requests.get(url) result=cchardet.detect(html.content) print(result) html.encoding=result[‘encoding’] print(html.encoding) | res=request.Request(url,headers=header) data=request.urlopen(res) print(cchardet.detect(data.read())) |
2 | chardet | 同上 | 同上 |
3 | headers | ①data.encoding ②data.apparent_encoding ③requests.utils.get_encodings_from_content(data.text) ④requests.utils.get_encoding_from_headers(data.headers) ⑤data.headers[‘Content-Type’] | data.headers([‘Content-Type’]) |
- 关于text和content:
text自动解码,会因为解码猜测错误出现乱码
content记录的是原始的二进制字节流,通常用‘utf-8’解码