爬取网页中文出现乱码的解决方法

最新推荐文章于 2024-05-15 17:03:16 发布

fareatm

最新推荐文章于 2024-05-15 17:03:16 发布

阅读量1.2k

点赞数 1

分类专栏： Python 爬虫文章标签：爬虫

本文链接：https://blog.csdn.net/fareatm/article/details/81590146

版权

Python 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

网页编码gb2312，爬取中文text出现乱码，解决方法：

str1 = paper.css('a::text').extract_first()
str1 = str1.encode("ISO 8859-1")
print(str1.decode('gbk'))

python 字符串string 开头r b u f 含义 str bytes 转换 format

字符串开头r b u f各含义：

b'input\n' # bytes字节符，打印以b开头。
输出：
b'input\n'

r'input\n' # 非转义原生字符，经处理'\n'变成了'\\'和'n'。也就是\n表示的是两个字符，而不是换行。
输出：
'input\\n'

u'input\n' # unicode编码字符，python3默认字符串编码方式。
输出：
'input\n'

import time
t0 = time.time()
time.sleep(1)
name = 'processing'
print(f'{name} done in {time.time() - t0:.2f} s')  # 以f开头表示在字符串内支持大括号内的python 表达式
输出：
processing done in 1.00 s

类似于f开头，大括号变量，:定义格式
coord = (3, 5)
'X: {0[0]};  Y: {0[1]}'.format(coord)

'{0}, {1}, {0}'.format(*'abc')      # unpacking argument sequence
'a, b, a'

'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
'Coordinates: 37.24N, -115.81W'

'{:,}'.format(1234567890)
'1,234,567,890'

'Correct answers: {:.2%}'.format(points/total)
'Correct answers: 86.36%'

str与bytes转换：

'€20'.encode('utf-8')
# b'\xe2\x82\xac20'
b'\xe2\x82\xac20'.decode('utf-8')
# '€20'

s1 = '123'
print(s1)
print(type(s1))
s2 = b'123'
print(s2)
print(type(s2))

区别输出：
123
<class 'str'>
b'123'
<class 'bytes'>

Python 2 将字符串处理为 bytes 类型。
Python 3 将字符串处理为 unicode 类型

str转bytes：
bytes('123', encoding='utf8')
str.encode('123')

bytes转str：
str(b'123', encoding='utf-8')
bytes.decode(b'123')

 # bytes object
  b = b"example"
 
  # str object
  s = "example"
 
  # str to bytes
  bytes(s, encoding = "utf8")
 
  # bytes to str
  str(b, encoding = "utf-8")
 
  # an alternative method
  # str to bytes
  str.encode(s)
 
  # bytes to str
  bytes.decode(b)

fareatm

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬取网页中文出现乱码的解决方法

网页编码gb2312，爬取中文text出现乱码，解决方法：str1 = paper.css('a::text').extract_first()str1 = str1.encode("ISO 8859-1")print(str1.decode('gbk')) python 字符串string 开头r b u f 含义 str bytes 转换 format字符串开头r b...
复制链接

扫一扫

专栏目录