无监督学习 - 聚类的潜在语义分析（Latent Semantic Analysis，LSA）

草明

于 2024-01-24 05:15:00 发布

阅读量473

点赞数 9

分类专栏：数据结构与算法文章标签：学习聚类数据挖掘机器学习人工智能

本文链接：https://blog.csdn.net/galoiszhou/article/details/135741901

版权

数据结构与算法专栏收录该内容

86 篇文章 2 订阅

订阅专栏

什么是机器学习

潜在语义分析（Latent Semantic Analysis，LSA）是一种无监督学习方法，用于在文本数据中发现潜在的语义结构。LSA 的主要应用之一是进行文本文档的主题建模和信息检索。

以下是一个使用 Python 中的 scikit-learn 库来实现潜在语义分析（LSA）的简单教程。

步骤1: 导入库

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

步骤2: 准备文本数据

# 示例文本数据
documents = [
    "Natural language processing is a field of artificial intelligence.",
    "Text analysis involves processing and understanding written language.",
    "Machine learning algorithms are used in natural language processing.",
    "Topic modeling is a technique in text analysis.",
    "Latent semantic analysis is a type of topic modeling."
]

步骤3: 文本向量化

使用 TF-IDF（Term Frequency-Inverse Document Frequency）向量化文本数据。

# TF-IDF向量化
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

步骤4: 使用潜在语义分析（LSA）

# 使用TruncatedSVD进行潜在语义分析
n_components = 2  # 指定潜在语义的维度
lsa = TruncatedSVD(n_components=n_components)
lsa_result = lsa.fit_transform(X)

步骤5: 查看潜在语义的结果

# 查看潜在语义的结果
print("LSA Components:")
print(lsa.components_)
print("\nLSA Explained Variance Ratio:")
print(lsa.explained_variance_ratio_)

步骤6: 可视化潜在语义的结果

# 可视化潜在语义的结果
plt.scatter(lsa_result[:, 0], lsa_result[:, 1], c='blue', marker='o')
plt.title('Latent Semantic Analysis')
plt.xlabel('LSA Component 1')
plt.ylabel('LSA Component 2')
plt.show()

在这个例子中，我们首先将文本数据使用 TF-IDF 向量化，然后使用 TruncatedSVD 进行潜在语义分析。最后，我们查看了潜在语义的结果，并通过散点图可视化了文档在潜在语义空间的分布。

调整 n_components 参数可以改变潜在语义的维度。这个参数的选择通常是一个平衡，需要根据具体问题和数据集进行调整。

草明

关注

9
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
无监督学习 - 聚类的潜在语义分析（Latent Semantic Analysis，LSA）

（Latent Semantic Analysis，LSA）是一种无监督学习方法，用于在文本数据中发现潜在的语义结构。LSA 的主要应用之一是进行文本文档的和。以下是一个使用 Python 中的库来实现潜在语义分析（LSA）的简单教程。
复制链接

扫一扫

专栏目录