pandas文本处理

知识的宝藏

于 2024-08-19 20:47:12 发布

阅读量296

点赞数 8

文章标签： pandas

本文链接：https://blog.csdn.net/TalorSwfit20111208/article/details/141334603

版权

Pandas 提供了一系列工具和方法来处理和操作包含文本数据的 DataFrame。以下是一些主要的方法和功能概述：

处理文本数据

Pandas 提供了许多用于处理文本数据的功能，特别是通过 Series 和 DataFrame 的 str 属性。这个属性允许你应用一系列字符串方法到 Series 或 DataFrame 的每一项上。

常用的字符串方法

提取子字符串：
- str.extract(): 使用正则表达式提取子字符串。
- str.slice(): 通过位置切片字符串。
- str.slice_replace(): 替换字符串的一部分。
- str.split(): 根据分隔符分割字符串。
- str.rsplit(): 类似于 split，但从右向左开始。
- str.cat(): 将多个字符串连接在一起。
- str.contains(): 检查字符串是否包含某个模式。
- str.startswith(): 检查字符串是否以某个前缀开始。
- str.endswith(): 检查字符串是否以某个后缀结束。
- str.repeat(): 重复字符串。
字符串大小写转换：
- str.lower(): 转换为小写。
- str.upper(): 转换为大写。
- str.capitalize(): 首字母大写，其余小写。
- str.title(): 单词首字母大写，其余小写。
字符串格式化：
- str.strip(): 去除两端空白字符。
- str.lstrip(): 去除左侧空白字符。
- str.rstrip(): 去除右侧空白字符。
- str.center(): 将字符串居中，并用指定字符填充两侧。
- str.pad(): 为字符串两边填充字符。
- str.zfill(): 用零填充左边。
字符串替换：
- str.replace(): 替换字符串中的某些子字符串。
正则表达式：
- str.match(): 检查是否匹配正则表达式。
- str.findall(): 返回所有匹配的子字符串列表。
- str.count(): 计算模式出现次数。
- str.len(): 字符串长度。

示例代码

下面是一个简单的示例，展示如何使用 Pandas 中的字符串方法：

import pandas as pd

# 创建一个简单的 DataFrame
data = {'Text': ['Hello, World!', 'Python is fun.', 'Data Science']}
df = pd.DataFrame(data)

# 使用 str 属性
# 转换成小写
df['Lowercase'] = df['Text'].str.lower()

# 提取子字符串
df['Substring'] = df['Text'].str.extract(r'(\b\w{5}\b)')

# 替换字符串
df['Replaced'] = df['Text'].str.replace('World', 'Earth')

# 检查是否包含特定单词
df['Contains_Python'] = df['Text'].str.contains('Python')

# 输出结果
print(df)

输出

                     Text Lowercase Substring Replaced Contains_Python
0            Hello, World!  hello, world!    Hello  Hello, Earth       False
1         Python is fun.  python is fun.       fun.  Python is fun.        True
2      Data Science       data science      Data  Data Science       False