Python剔除html中的乱码字符

最新推荐文章于 2023-04-29 17:30:51 发布

Memory_and_Dream

最新推荐文章于 2023-04-29 17:30:51 发布

阅读量293

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/Memory_and_Dream/article/details/108007846

版权

python爬虫专栏收录该内容

4 篇文章 1 订阅

订阅专栏

在这里插入图片描述

有时候网页会包含乱码导致xpath解析失败，百度或者谷歌了好久也没发现解决方法，最后只好自己写了一个替换方法，利用报错信息中的position剔除相应的数据。
方法如下

def remove_error_code(byte_string,charset):
    for try_times in range(10):
        try:
            result = byte_string.decode(charset)
            break
        except Exception as e:
            stre = str(e)
            index = re.search('in position (\d+)',stre).group(0)
            if index:
                index=int(index)
                byte_string=byte_string[:index]+byte_string[index+1:]
    return result

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

Memory_and_Dream

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python剔除html中的乱码字符

有时候网页会包含乱码导致xpath解析失败，百度或者谷歌了好久也没发现解决方法，最后只好自己写了一个替换方法，利用报错信息中的position剔除相应的数据。方法如下def remove_error_code(byte_string,charset): for try_times in range(10): try: result = byte_string.decode(charset) break except.
复制链接

扫一扫