n-gram/词频/词性/词向量区别以及组合使用案例

最新推荐文章于 2024-09-26 20:12:06 发布

Ai玩家hly

最新推荐文章于 2024-09-26 20:12:06 发布

阅读量206

点赞数 4

文章标签：特征工程 python 人工智能 n-gram 词性/词频/词向量文本特征处理

本文链接：https://blog.csdn.net/qq_45003504/article/details/139997668

版权

n-gram（2-gram）：捕捉文本中的短语和模式。
词频：反映文本中单词的重要性和分布。
词性：提供额外的上下文信息。
词向量：捕捉单词之间的语义关系。
特征组合示例：

n-gram特征：
我喜欢
喜欢吃
吃苹果
词频特征：
我：2次
喜欢：2次
吃：2次
苹果：2次
词性特征：
我：代词
喜欢：动词
吃：动词
苹果：名词
词向量特征：
我：[向量表示]
喜欢：[向量表示]
吃：[向量表示]
苹果：[向量表示]

特征融合示例:

定义一个函数，用于将文本转换为特征向量

def text_to_features(text, word_to_index, bigrams, trigrams, word_freq, pos_tags):
# 初始化特征向量
features = []

# 将文本转换为单词列表
words = text.split()

# 添加n-gram特征
for bigram in bigrams:
    if bigram in words:
        features.append(1)
    else:
        features.append(0)
for trigram in trigrams:
    if trigram in words:
        features.append(1)
    else:
        features.append(0)

# 添加词频特征
for word in words:
    features.append(word_freq[word])

# 添加词性特征
for pos in pos_tags:
    features.append(pos)

# 添加词向量特征
for word in words:
    if word in word_to_index:
        features.append(word_to_index[word])
    else:
        features.append(0)  # 或者使用一个特殊的向量表示未知的词

# 返回特征向量
return features

示例文本

text = “我今天心情很好，阳光明媚。”

假设我们有一个预先构建的词典和索引

word_to_index = {“我”: 0, “今天”: 1, “心情”: 2, “很好”: 3, “阳光”: 4, “明媚”: 5}
bigrams = [“我今天”, “今天心情”, “心情很好”, “很好阳光”, “阳光明媚”]
trigrams = [“我今天心情”, “今天心情很好”, “心情很好阳光”, “很好阳光明媚”]
word_freq = {“我”: 1, “今天”: 1, “心情”: 1, “很好”: 1, “阳光”: 1, “明媚”: 1}
pos_tags = [“代词”, “时间词”, “名词”, “副词”, “名词”, “形容词”]