python爬虫html乱码_pythone爬虫编码自适应解决网页乱码

最新推荐文章于 2023-04-26 18:46:47 发布

白如新

最新推荐文章于 2023-04-26 18:46:47 发布

阅读量208

点赞数

文章标签： python爬虫html乱码

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_30995429/article/details/114446285

版权

该楼层疑似违规已被系统折叠隐藏此楼查看此楼

#coding=utf-8

import chardet #字符集检测

import urllib.parse

import urllib.request

import re

import ssl

#跳过 SSL证书

ssl._create_default_https_context=ssl._create_unverified_context

rr = re.compile(r"\bcharset[=:\"\s]{1,3}([-_A-Z0-9]+)",re.I)

def getCode(string):

p = rr.findall(string)

if len(p)>0:

print(u'编码方式: ' + p[0])

return p[0]

print(u'没找到编码方式')

return ''

#getCode(r'iiifjjd charset:" utf_8iidi-oo">')

def getHtml(url):

headers={

"User-Agent": 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',

'Referer': url

}

values = {

'name': 'hao_hao',

'ie': 'utf-8'

}

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url=url+'?'+data, headers=headers)

#req = urllib.request.Request(url+'?'+data)

response = urllib.request.urlopen(req)

#1 从响应头中找编码方式

page = getCode(response.headers['Content-Type'])

#2 从网页源代码中找编码方式

if page == '':

for line in response.readlines():

page = getCode(line.decode())

if page !='': break

the_page = response.read()

#3 chardet字符集检测进行内容分析. https://mm.taobao.com/search_tstar_model.html GBK 识别成 GB2312 所以不好用. 前两个方法都不行再用

if page =='':

chardit1 = chardet.detect(the_page)

page = chardit1['encoding']

print(u'chardet字符集检测\r\n编码方式: ' + page)

#打印响应头数据.

print(response.headers)

#需要时关闭连接

#response.close()

#都找不到编码方式

if page =='': return ''

return the_page.decode(page) #解码.

#return the_page.decode(page).encode('utf-8')

print ('===============================================')

#gbk

html = getHtml("https://mm.tao[请把这几个字删掉]bao.com/search_tstar_model.html")

print (html)

print ('===============================================')

#utf-8

html = getHtml("http://kyfw.123[请把这几个字删抻]06.cn/otn/leftTicket/init")

print (html)

print ('===============================================')

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。