![93106762112792ae8bc81a8b15818d59.png](https://i-blog.csdnimg.cn/blog_migrate/6a730ff036fef11aa5b74a6a64e73731.jpeg)
【公众号:大邓和他的python】
做文本分析经常遇到数据乱码问题,一般遇到编码问题我们无能为力,都是忽略乱码的文本。
text = open(file, errors='ignore').read()
但是这样会遗失掉一些信息,那到底怎么治文本分析时经常为非作歹的妖魔鬼怪?
心里默念python大法好!ftfy(fixes text for you)可以为我们整理的乱码数据。
安装
!pip3 install ftfy==5.6
乱码(ง'⌣')ง例子
只我在官方文档上找到这些奇形怪状的字符串,相信大家可能有的也见过这些数据。
(ง'⌣')ง
ünicode
Broken text… it’s flubberific!
HTML entities <3
¯_(ãx83x84)_/¯
ufeffParty likenit’s 1999!
LOUD NOISES
This — should be an em dash
This text was never UTF-8 at allx85
033[36;44mI'm blue, da ba dee da ba doo...033[0m
u201chereu2019s a testu201d
This string is made of two things:u2029 1. Unicodeu2028 2. Spite
ftfy.fix_text:专治各种不符
使用ftfy中的fix_text函数可以制伏绝大多数(ง'⌣')à
from ftfy import fix_text
fix_text("(ง'⌣')ง")
"(ง'⌣')ง"
fix_text('ünicode')
'ünicode'
fix_text('Broken text… it’s flubberific!')
"Broken text… it's flubberific!"
fix_text('HTML entities <3')
'HTML entities <3'
fix_text("¯_(ãx83x84)_/¯")
'¯_(ツ)_/¯'
fix_text('ufeffParty likenit’s 1999!')
"Party likenit's 1999!"
fix_text('LOUD NOISES')
'LOUD NOISES'
fix_text('único')
'único'
fix_text('This — should be an em dash')
'This — should be an em dash'
fix_text('This text is sad .âx81”.')
'This text is sad .⁔.'
fix_text('The more you know 🌠')
'The more you know '
fix_text('This text was never UTF-8 at allx85')
'This text was never UTF-8 at all…'
fix_text("033[36;44mI'm blue, da ba dee da ba doo...033[0m")
"I'm blue, da ba dee da ba doo..."
fix_text('u201chereu2019s a testu201d')
'"here's a test"'
text = "This string is made of two things:u2029 1. Unicodeu2028 2. Spite"
fix_text(text)dd
'This string is made of two things:n 1. Unicoden 2. Spite'
ftfy.fix_file:专治各种不符的文件
上面的例子都是制伏字符串,实际上ftfy还可以直接处理乱码的文件。这里我就不做演示了,大家以后遇到乱码就知道有个叫fixes text for you的ftfy库可以帮助我们fix_text 和 fix_file。
近期文章
python爬虫与文本数据分析 系列课mp.weixin.qq.com![110bd6b16206c0b4144ec2b44075d43a.png](https://i-blog.csdnimg.cn/blog_migrate/0f98ae2c8b70463fd7e7be27ab97a720.jpeg)
![b8e06ee1e3f2436e838a72bce1b4e1be.png](https://i-blog.csdnimg.cn/blog_migrate/93bd2afb1792a64da882f18f98d5ef0d.jpeg)
![6be82607585329335f0feefe1152aa21.png](https://i-blog.csdnimg.cn/blog_migrate/a47a9f75df1ebb4698da6dad12abaeea.jpeg)
![c92db456001cc37818406da53c5285e9.png](https://i-blog.csdnimg.cn/blog_migrate/adf00ac9b38adae382d71ebeb6c35816.jpeg)
![44cf2b1dc40d6374c4256af1e71e07f1.png](https://i-blog.csdnimg.cn/blog_migrate/d3ead1538d967bc92decd0ed11aad735.jpeg)
![8bfd78ef7638f4b19bf660957f476d3a.png](https://i-blog.csdnimg.cn/blog_migrate/660bc0e2e6b5683f6cc36757555f28d3.jpeg)