探索Pandas与Tiktoken在数据处理中的应用_python tiktoken库 pad-CSDN博客

本文链接：https://blog.csdn.net/ylong52/article/details/141108589

在数据分析和机器学习领域，处理和分析文本数据是一项常见任务。Python提供了强大的库来帮助我们完成这项工作。在这篇博文中，我们将重点介绍两个非常有用的库：pandas和tiktoken，以及pandas中的一些关键功能，如df.dropna()和对df.Summary列的操作。

Pandas：数据操作的瑞士军刀

pandas是一个开源的数据分析和操作库，它提供了快速、灵活和表达力强的数据结构，旨在使数据清洗和分析工作变得更加简单易行。以下是pandas的一个简单示例：

import pandas as pd

# 创建一个简单的DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# 显示DataFrame
print(df)

Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago

Tiktoken：文本处理的利器

tiktoken 是一个用于将文本转换为 tokens 的库，通常用于自然语言处理任务。Tokens 是文本的最小单位，可以是单词、字符或子词。编码过程将文本字符串转换为一个整数列表，其中每个整数代表输入文本中的一个 token。例如：

import tiktoken

# 假设我们有一个文本序列
text = "Hello, how are you?"

# 使用tiktoken进行编码
encoded_text = tiktoken.encode(text)

# 打印编码结果
print(encoded_text)

例如，输出可能是这样的（具体数字取决于 tiktoken 的实现和选择的模型）：

[1, 2, 3, 4, 5]

df.dropna(): 清洗数据的利器

在处理数据时，我们经常会遇到缺失值。pandas提供了一个非常方便的方法df.dropna()来删除包含缺失值的行或列。以下是一个示例：

# 假设我们有以下DataFrame，其中包含摘要信息
data_with_summary = {
    'Summary': ['Great product', 'Not as expected', 'Excellent service'],
    'Text': ['I loved this product, it was exactly what I needed.', 
             'The product did not meet my expectations.', 
             'The service was above and beyond.']
}
df_with_summary = pd.DataFrame(data_with_summary)

# 合并Summary和Text列
df_with_summary['combined'] = (df_with_summary['Summary'] + "; " + df_with_summary['Text'])

# 显示合并后的DataFrame
print(df_with_summary['combined'])

0 Great product; I loved this product, it was exactly what I needed.
1 Not as expected; The product did not meet my expectations.
2 Excellent service; The service was above and beyond.

每一行都是将 Summary 列和 Text 列的内容通过分号 ; 和空格合并而成的。

df.Summary: 摘要信息的处理

在文本数据集中，Summary列通常包含了对文本内容的简短摘要。在分析过程中，我们可能需要对这些摘要进行特定的处理。例如，我们可以将Summary列与其他文本列合并，以便于后续的文本分析：

# 假设我们有以下DataFrame，其中包含摘要信息
data_with_summary = {
    'Summary': ['Great product', 'Not as expected', 'Excellent service'],
    'Text': ['I loved this product, it was exactly what I needed.', 
             'The product did not meet my expectations.', 
             'The service was above and beyond.']
}
df_with_summary = pd.DataFrame(data_with_summary)

# 合并Summary和Text列
df_with_summary['combined'] = (df_with_summary['Summary'] + "; " + df_with_summary['Text'])

# 显示合并后的DataFrame
print(df_with_summary['combined'])

0 Great product; I loved this product, it was exactly what I needed.
1 Not as expected; The product did not meet my expectations.
2 Excellent service; The service was above and beyond.

最后通过读取一个csv文件来实现上面的代码

import pandas as pd
import tiktoken

# 读取数据：使用 pd.read_csv 从指定路径加载 CSV 文件，并将第一列（索引为 0 的列）设置为数据框的索引。
input_datapath = "https://gitee.com/skyqi/21st-century/raw/master/amazon-fine-food-reviews.csv"
df = pd.read_csv(input_datapath)
# 选择特定列：选择数据框中的六列："Time", "ProductId", "UserId", "Score", "Summary" 和 "Text"。
df = df[["Time","ProductId","UserId","Score","Summary","Text"]]
# 删除缺失值：删除任何包含缺失值的行
df.dropna()
# 合字段：将 "Summary" 和 "Text" 字段组合成一个新的字段 "combined"，格式为 "Title: " + 摘要 + "; Content: " + 正文。
df["combined"] = ("Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip())
# 显示前两行：显示处理后的数据框的前两行
df.head(10)
print(df.head(10))