python爬虫网页编码问题

最新推荐文章于 2024-04-11 20:24:07 发布

lavender_hhl

最新推荐文章于 2024-04-11 20:24:07 发布

阅读量364

点赞数

本文链接：https://blog.csdn.net/weixin_39643135/article/details/117445547

版权

正在做关于爬虫的事情，需要在爬取多个不同网站的新闻网页，一个很重要的问题就到编码问题，爬取到的内容，默认是utf-8编码，但中文网页中常用到'gb2132'或’gbk'编码，如此爬取到的正文就会是乱码。

解决：先提取网页本身的编码，再按照该编码去解码，即可

import chardet
from bs4 import BeautifulSoup
import requests

#使用requests爬虫
def crawler(url):
    html = requests.get(url, headers=headers)
    #查看网页本身的编码
    #html.apparent_encoding或chardet.detect(html.content)['encoding']
    #按照当前网页编码方式解码
    html.encoding = chardet.detect(html.content)['encoding']
    soup = BeautifulSoup(html.text, 'lxml')
    ##其他......

注：当遇到‘gbk’编码的网页时，可能会有的字无法解码，如爬取到的内容以“\u3000"出现，属于正常情况

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

lavender_hhl

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫网页编码问题

import chardetfrom bs4 import BeautifulSoupimport requests#使用requests爬虫def crawler(url): html = requests.get(url, headers=headers) #html.apparent_encoding查看当前网页的编码方式。。更正式的方法： chardet.detect(html.content)['encoding'] .
复制链接

扫一扫