Python数据分析从入门到进阶：快速处理文本（含代码）

最新推荐文章于 2023-12-11 11:45:42 发布

Python_魔力猿

最新推荐文章于 2023-12-11 11:45:42 发布

阅读量122

点赞数

文章标签： python 数据分析开发语言

本文链接：https://blog.csdn.net/weixin_68789096/article/details/133015610

版权

🍁1. 清洗文本

对一些非结构化的文本数据进行基本的清洗

strip
split
replace

# 创建文本
text_data = ['   Interrobang. By Aishwarya Henriette   ',
             'Parking And goding. by karl fautier',
             '   Today is the night. by jarek prakash    ']

# 去除文本两端的空格
stripwhitespace = [string.strip() for string in text_data]

stripwhitespace

['Interrobang. By Aishwarya Henriette', 'Parking And goding. by karl fautier', 'Today is the night. by jarek prakash']

# 删除句号
remove_periods = [string.replace('.','') for string in text_data]

remove_periods

['   Interrobang By Aishwarya Henriette   ', 'Parking And goding by karl fautier', '   Today is the night by jarek prakash    ']

# 创建函数
def capitalizer(string):
    return string.upper()

[capitalizer(string) for string in remove_periods]

['   INTERROBANG BY AISHWARYA HENRIETTE   ', 'PARKING AND GODING BY KARL FAUTIER', '   TODAY IS THE NIGHT BY JAREK PRAKASH    ']

# 使用正则表达式
import re

def replace_letters_with_x(string):
    return re.sub(r'[a-zA-Z]','x',string)

[replace_letters_with_x(string) for string in remove_periods]

['   xxxxxxxxxxx xx xxxxxxxxx xxxxxxxxx   ', 'xxxxxxx xxx xxxxxx xx xxxx xxxxxxx', '   xxxxx xx xxx xxxxx xx xxxxx xxxxxxx    ']

🍂2. 解析并清洗HTML

#使用beautiful soup 对html进行解析

from bs4 import BeautifulSoup

# 创建html代码
html = """
        <div class='full_name'><span style='font-weight:bold'>
        Masege Azra"
    
    """

# 创建soup对象
soup = BeautifulSoup(html, 'lxml')

soup.find('div')

<div class="full_name"><span style="font-weight:bold">
        Masege Azra"
    
    </span></div>

🍃3. 移除标点

import unicodedata
import sys

text_data = ['Hi!!!! I. love. This. Song....',
             '10000% Agree!!!! #LoveIT',
             'Right??!!']

# 创建一个标点符号字典
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

[string.translate(punctuation) for string in text_data]

['Hi I love This Song', '10000 Agree LoveIT', 'Right']

🌍4. 文本分词

这里介绍一下jieba库

import jieba

# 创建文本
string = 'The science of study is the technology of tomorrow'

seg = jieba.lcut(string)
print(seg)

['The', ' ', 'science', ' ', 'of', ' ', 'study', ' ', 'is', ' ', 'the', ' ', 'technology', ' ', 'of', ' ', 'tomorrow']

当然，本文只是介绍了在数据清洗中的一些最基本的文本处理方法，后续还会介绍目前NLP的一些主流方法和代码。

---------------------------END---------------------------

题外话

当下这个大数据时代不掌握一门编程语言怎么跟的上脚本呢？当下最火的编程语言Python前景一片光明！如果你也想跟上时代提升自己那么请看一下.

在这里插入图片描述

感兴趣的小伙伴，赠送全套Python学习资料，包含面试题、简历资料等具体看下方。

👉CSDN大礼包🎁：全网最全《Python学习资料》免费赠送🆓！（安全链接，放心点击）

一、Python所有方向的学习路线

Python所有方向的技术点做的整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照下面的知识点去找对应的学习资源，保证自己学得较为全面。

二、Python必备开发工具

工具都帮大家整理好了，安装就可直接上手！

三、最新Python学习笔记

当我学到一定基础，有自己的理解能力的时候，会去阅读一些前辈整理的书籍或者手写的笔记资料，这些笔记详细记载了他们对一些技术点的理解，这些理解是比较独到，可以学到不一样的思路。

四、Python视频合集

观看全面零基础学习视频，看视频学习是最快捷也是最有效果的方式，跟着视频中老师的思路，从基础到深入，还是很容易入门的。

五、实战案例

纸上得来终觉浅，要学会跟着视频一起敲，要动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。

六、面试宝典

在这里插入图片描述

简历模板

👉CSDN大礼包🎁：全网最全《Python学习资料》免费赠送🆓！（安全链接，放心点击）

若有侵权，请联系删除

Python_魔力猿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫