Wad和cchardet

import wad.detection
import cchardet
import requests
url='https://www.baidu.com'
#查看网站所用的技术
det=wad.detection.Detector()
print(det.detect(url))

#使用cchardet检测网页编码类型
html=requests.get(url)
result=cchardet.detect(html.content)
print(result)
html.encoding=result['encoding']
print(html.encoding)

#查看网站所有者的信息??? 不可用
# import whois
# imagination=whois.whois("www.douban.com")
# print(imagination)

运行结果:
{'https://www.baidu.com/': [{'app': 'jQuery', 'ver': '1', 'type': 'JavaScript Libraries'}]}
{'encoding': 'UTF-8', 'confidence': 0.9900000095367432}
UTF-8
查看编码方法二(chardet):

requests的chardet:

data=requests.get(url)
print(chardet.detect(data.content))

运行结果:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

urllib的chardet:

import urllib
from urllib import  request

res=request.Request(url)
data=request.urlopen(res)
print(chardet.detect(data.read()))
!!不能写成:
(print(chardet.detect(data.read().decode('utf-8')))
运行结果:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
查看编码方法三(requests):
import requests
data=requests.get(url)
print(data.encoding)
print(data.apparent_encoding)
#查询网页的实际编码
#request.utls.get_encodings_from_content(response.text)[0]
print(requests.utils.get_encodings_from_content(data.text))
print(requests.utils.get_encoding_from_headers(data.headers))

执行结果:
ISO-8859-1
utf-8
['utf-8']
ISO-8859-1

Response对象有猜测的编码方式和真实的编码方式,encoding和apparent_encoding
requests会使用响应头中Content-Type的charset类型来进行解码得到data对象的text属性的值。若无法得到charset的值,那么requests会默认使用ISO-8859-1的编码格式,这样的话如果网页有中文就无法正常显示。
apparent_encoding能分析出真正的编码
反例:

print(requests.utils.get_encoding_from_headers(data.headers))
print(data.headers['Content-Encoding'])
print(data.headers)

运行结果:
ISO-8859-1
gzip
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Fri, 24 Sep 2021 02:56:06 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:52 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

加了请求头:
url="https://www.baidu.com"
header={

    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
import cchardet
import  chardet
import requests
data=requests.get(url,headers=header)
print(data.encoding)
print(data.apparent_encoding)

print(requests.utils.get_encodings_from_content(data.text))
print(requests.utils.get_encoding_from_headers(data.headers))
print(data.headers['Content-Encoding'])
print(data.headers)

utf-8
utf-8
['utf-8']
utf-8
gzip
{'Bdpagetype': '1', 'Bdqid': '0xa6b23a4a00001c33', 'Cache-Control': 'private', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=utf-8', 'Date': 'Fri, 24 Sep 2021 03:08:12 GMT', 'Expires': 'Fri, 24 Sep 2021 03:07:36 GMT', 'P3p': 'CP=" OTI DSP COR IVA OUR IND COM ", CP=" OTI DSP COR IVA OUR IND COM "', 'Server': 'BWS/1.1', 'Set-Cookie': 'BAIDUID=4E3F7D5FB1ABBC456994F37E8359A1EF:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BIDUPSID=4E3F7D5FB1ABBC456994F37E8359A1EF; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, PSTM=1632452892; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BAIDUID=4E3F7D5FB1ABBC452B179B0C1AAECB4D:FG=1; max-age=31536000; expires=Sat, 24-Sep-22 03:08:12 GMT; domain=.baidu.com; path=/; version=1; comment=bd, BDSVRTM=0; path=/, BD_HOME=1; path=/, H_PS_PSSID=34644_34441_34068_31254_34551_34584_34106_26350_34627_34424_22160_34691_34675; path=/; domain=.baidu.com', 'Strict-Transport-Security': 'max-age=172800', 'Traceid': '1632452892045537332212011727245652532275', 'X-Frame-Options': 'sameorigin', 'X-Ua-Compatible': 'IE=Edge,chrome=1', 'Transfer-Encoding': 'chunked'}

查看编码方法四(urllib.request):
import urllib
from urllib import  request

res=request.Request(url)
data=request.urlopen(res)
print(data.headers)
 
 运行结果:
Bdpagetype: 1
Bdqid: 0x8be0aba3001d6028
Cache-Control: private
Content-Type: text/html;charset=utf-8
Date: Fri, 24 Sep 2021 02:45:07 GMT
Expires: Fri, 24 Sep 2021 02:44:57 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Server: BWS/1.1
Set-Cookie: BAIDUID=203CF8DE7E836247AFA410E8AF44297A:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=203CF8DE7E836247AFA410E8AF44297A; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1632451507; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BAIDUID=203CF8DE7E8362478857F5D88A8C0B15:FG=1; max-age=31536000; expires=Sat, 24-Sep-22 02:45:07 GMT; domain=.baidu.com; path=/; version=1; comment=bd
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=1; path=/
Set-Cookie: H_PS_PSSID=34647_34068_31254_34551_34524_34585_34504_26350_34725_34425_34691; path=/; domain=.baidu.com
Traceid: 1632451507035516877810079244682625114152
Vary: Accept-Encoding
Vary: Accept-Encoding
X-Frame-Options: sameorigin
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
Transfer-Encoding: chunked

总结:

查看网页的编码方法:

序号方法requestsurllib.request
1cchardethtml=requests.get(url)
result=cchardet.detect(html.content)
print(result)
html.encoding=result[‘encoding’]
print(html.encoding)
res=request.Request(url,headers=header)
data=request.urlopen(res)
print(cchardet.detect(data.read()))
2chardet同上同上
3headers①data.encoding
②data.apparent_encoding
③requests.utils.get_encodings_from_content(data.text)
④requests.utils.get_encoding_from_headers(data.headers)
⑤data.headers[‘Content-Type’]
data.headers([‘Content-Type’])
  • 关于text和content:
    text自动解码,会因为解码猜测错误出现乱码
    content记录的是原始的二进制字节流,通常用‘utf-8’解码
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值