UMAP介绍和代码实例

安装

pip install umap-learn
pip install umap-learn[plot]

UMAP包含一个子包UMAP。绘图UMAP嵌入的结果。这个包需要单独导入,因为它有额外的需求(matplotlib, datashader和holoviews)。它允许快速和简单的绘图,并尝试做出明智的决定,以避免过度绘图和其他陷阱。

基础概念:

Uniform Manifold Approximation and Projection (UMAP)
**流形Manifold:**流形(Manifold)是局部具有欧式空间性质的空间,包括各种纬度的曲线曲面,例如球体、弯曲的平面等。流形的局部和欧式空间是同构的。 把流形的局部假设为欧几里德空间,以方便研究。
**黎曼流形:**以光滑的方式在每一点的切空间上指定了欧式内积的微分流形。

与PCA,和t-SNE的区别:

https://pair-code.github.io/understanding-umap/

该算法基于关于数据的三个假设:

  1. 数据均匀分布在黎曼流形上(Riemannian manifold);
  2. 黎曼度量是局部恒定的(或可以这样近似);
  3. 流形是局部连接的。

可以将UMAP分为两个主要步骤:

  1. 学习高维空间中的流形结构;
  2. 找到该流形的低维表示。

步骤一:学习流形结构
1.寻找最近的邻居:Nearest-Neighbor-Descent算法
**超参数设置:**n_neighbors超参数来指定我们想要使用多少个近邻点。
一个小的n_neighbors值意味着我们需要一个非常局部的解释,准确地捕捉结构的细节。而较大的n_neighbors值意味着我们的估计将基于更大的区域,因此在整个流形中更广泛地准确。

2.构建一个图:通过连接之前确定的最近邻来构建图。
**超参数设置:**local_connectivity(默认值= 1),表示高维空间中的每一个点都与另一个点相关联。

对这两个参数的理解:就是可以将他们视为下限和上限
Local_connectivity(默认值为1):100%确定每个点至少连接到另一个点(连接数量的下限)
n_neighbors(默认值为15):一个点直接连接到第16个以上的邻居的可能性为 0%,因为它在构建图时落在UMAP使用的局部区域之外

步骤二:寻找低维表示
超参数:min_dist(默认值=0.1),定义嵌入点之间的最小距离
Cross-Entropy,在低维表示中找到边的最优权值。这些最优权值随着上述交叉熵函数的最小化而出现,这个过程是可以通过随机梯度下降法来进行优化的。

UMAP的工作完成了,得到了一个数组,其中包含了指定的低维空间中每个数据点的坐标。

实例一:

使用mnist数据分离数字,并在二维空间中展示:

reducer = umap.UMAP(random_state=42)
X_trans = reducer.fit_transform(X)
print(X_trans.shape)

画图

reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(digits.data)
print(embedding.shape)

plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the Digits dataset')
plt.show()

参数设置

n_components
控制投影后的维数,默认值为 2。但是,当特征数较多时,2D可能不足以完全保留数据的底层拓扑结构,以 5 步尝试 2-20 之间的值,并评估不同的基线模型以查看准确性的变化。
n_neighbors
这决定了在流形结构的局部逼近中使用的邻近点的数量。更大的值将导致更多的全局结构被保留,而失去详细的局部结构。通常,该参数通常应该在5到50之间,选择10到15作为合理的默认值。
min_dist
这控制了嵌入的紧密程度,允许压缩点在一起。较大的值确保嵌入点分布更均匀,而较小的值允许算法更准确地针对局部结构进行优化。合理的值在0.001到0.5之间,0.1是合理的默认值。
metric
计算点之间距离的公式,默认值为euclidean。这决定了用于测量输入空间中距离的度量的选择。已经编写了各种各样的度量标准,只要用户定义的函数是numba的JITd,就可以传递它。

UMAP 会消耗大量内存,尤其是在拟合和创建连接图等图表的过程中,可设置low_memory为 True

n_neighbors=100, # default 15, The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation.

n_components=3, # default 2, The dimension of the space to embed into.

metric='euclidean', # default 'euclidean', The metric to use to compute distances in high dimensional space.

n_epochs=1000, # default None, The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings.

learning_rate=1.0, # default 1.0, The initial learning rate for the embedding optimization.

init='spectral', # default 'spectral', How to initialize the low dimensional embedding. Options are: {'spectral', 'random', A numpy array of initial embedding positions}.

min_dist=0.1, # default 0.1, The effective minimum distance between embedded points.

spread=1.0, # default 1.0, The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.

low_memory=False, # default False, For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True.

set_op_mix_ratio=1.0, # default 1.0, The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity=1, # default 1, The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level.

repulsion_strength=1.0, # default 1.0, Weighting applied to negative samples in low dimensional embedding optimization.

negative_sample_rate=5, # default 5, Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

transform_queue_size=4.0, # default 4.0, Larger values will result in slower performance but more accurate nearest neighbor evaluation.

a=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.

b=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.

random_state=42, # default: None, If int, random_state is the seed used by the random number generator;

metric_kwds=None, # default None) Arguments to pass on to the metric, such as the ``p`` value for Minkowski distance.

angular_rp_forest=False, # default False, Whether to use an angular random projection forest to initialise the approximate nearest neighbor search.

target_n_neighbors=-1, # default -1, The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the ``n_neighbors`` value.

#target_metric='categorical', # default 'categorical', The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different.

#target_metric_kwds=None, # dict, default None, Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.

#target_weight=0.5, # default 0.5, weighting factor between data topology and target topology.

transform_seed=42, # default 42, Random seed used for the stochastic aspects of the transform operation.

verbose=False, # default False, Controls verbosity of logging.

unique=False, # default False, Controls if the rows of your data should be uniqued before being embedded.

使用plotly绘制三维图

import plotly.express as px

def chart_plotly(X, y):
    # --------------------------------------------------------------------------#
    # This section is not mandatory as its purpose is to sort the data by label
    # so, we can maintain consistent colors for digits across multiple graphs

    # Concatenate X and y arrays
    arr_concat = np.concatenate((X, y.reshape(y.shape[0], 1)), axis=1)
    # Create a Pandas dataframe using the above array
    df = pd.DataFrame(arr_concat, columns=['x', 'y', 'z', 'label'])
    # Convert label data type from float to integer
    df['label'] = df['label'].astype(int)
    # Finally, sort the dataframe by label
    df.sort_values(by='label', axis=0, ascending=True, inplace=True)
    # --------------------------------------------------------------------------#

    # Create a 3D graph
    fig = px.scatter_3d(df, x='x', y='y', z='z', color=df['label'].astype(str), height=900, width=950)

    # Update chart looks
    fig.update_layout(title_text='UMAP',
                      showlegend=True,
                      legend=dict(orientation="h", yanchor="top", y=0, xanchor="center", x=0.5),
                      scene_camera=dict(up=dict(x=0, y=0, z=1),
                                        center=dict(x=0, y=0, z=-0.1),
                                        eye=dict(x=1.5, y=-1.4, z=0.5)),
                      margin=dict(l=0, r=0, b=0, t=0),
                      scene=dict(xaxis=dict(backgroundcolor='white',
                                            color='black',
                                            gridcolor='#f0f0f0',
                                            title_font=dict(size=10),
                                            tickfont=dict(size=10),
                                            ),
                                 yaxis=dict(backgroundcolor='white',
                                            color='black',
                                            gridcolor='#f0f0f0',
                                            title_font=dict(size=10),
                                            tickfont=dict(size=10),
                                            ),
                                 zaxis=dict(backgroundcolor='lightgrey',
                                            color='black',
                                            gridcolor='#f0f0f0',
                                            title_font=dict(size=10),
                                            tickfont=dict(size=10),
                                            )))
    # Update marker size
    fig.update_traces(marker=dict(size=3, line=dict(color='black', width=0.1)))

    fig.show()

# 设置reducer中n_components=3
X_trans = reducer.fit_transform(X)
# Check the shape of the new data
print('Shape of X_trans: ', X_trans.shape)
chart(X_trans, y)
  • 0
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
下面是使用Python实现kernel UMAP算法的示例代码: ```python import umap import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_moons from sklearn.metrics.pairwise import pairwise_kernels # 生成月亮形状数据 X, y = make_moons(n_samples=500, noise=0.1, random_state=42) # 计算高斯核相似性矩阵 K = pairwise_kernels(X, metric='rbf') # 使用kernel UMAP进行降维 embedding = umap.UMAP(n_neighbors=10, min_dist=0.1, metric='precomputed', random_state=42).fit_transform(K) # 可视化降维结果 plt.scatter(embedding[:, 0], embedding[:, 1], c=y, s=5) plt.show() ``` 在上面的代码中,我们首先使用Scikit-learn库的make_moons函数生成了一个月亮形状的数据集。然后,我们使用Scikit-learn库的pairwise_kernels函数计算了数据点之间的高斯核相似性矩阵。在计算相似性矩阵时,我们使用了rbf(径向基函数)作为核函数,从而计算出数据点之间的相似度。 接下来,我们使用UMAP库的UMAP类对相似性矩阵进行降维。在UMAP类的构造函数中,我们指定了n_neighbors=10表示每个数据点的10个最近邻将被用来构建局部结构,min_dist=0.1表示在低维空间中相邻点之间的最小距离,metric='precomputed'表示使用预先计算的相似性矩阵来计算相似度,random_state=42表示设置随机种子以确保可重复性。 最后,我们使用Matplotlib库将降维结果可视化。在可视化结果中,我们可以看到数据点在低维空间中被聚集成了几个簇,这些簇对应于原始数据中的不同分布模式。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值