科研绘图之tSNE图

Dr.sky_

于 2024-07-11 16:12:45 发布

阅读量584

点赞数 9

分类专栏：科研工具文章标签： python pandas matplotlib

本文链接：https://blog.csdn.net/weixin_43734080/article/details/140354468

版权

科研工具专栏收录该内容

4 篇文章 0 订阅

订阅专栏

t-SNE（t-Distributed Stochastic Neighbor Embedding，t分布随机邻域嵌入）是一种用于数据降维和可视化的算法。它可以将高维数据映射到二维或三维空间，同时尽可能地保留数据点之间的局部关系。t-SNE特别适用于探索数据的内部结构和模式，常用于聚类分析和发现数据中的群集。

在机器学习和数据分析中，t-SNE图通常用来展示数据点在降维后的分布情况，帮助观察数据点之间的相似性或差异性。通过t-SNE图，可以直观地看出数据点是否形成了明显的群集或者存在分离的趋势，有助于深入理解数据的结构和特征。

下面我给出两个具体案例，用python来实现tSNE图。

一、RTE数据集绘制tSNE图

1.1 加载数据

from transformers import RobertaTokenizer, RobertaModel
import torch
import numpy as np
import pandas as pd
import random
# 读取TSV文件
df = pd.read_csv("RTE/train.tsv", sep="\t", header=None) # 替换为您的文件路径
# 随机选择500个句子
# sentences = df.sample(5000)[1]  # 句子在第2列
random_samples = df.sample(500)

# 将第2列和第3列的句子拼接起来，两个句子之间用空格分隔
sentences = random_samples[1] + " " + random_samples[2]

sentences打印结果：

129     The company, whose registered auditors are Del...
791     So far Sony BMG has not released a list of how...
1684    Analysts expected the company to earn $1.42 a ...
109     The University has also apologized for the inc...
1723    The nation got its long-awaited first look at ...
                              ...                        
2039    The International Olympic Committee has leapt ...
1381    Martha Stewart, 64, is back, after serving fiv...
570     President George W. Bush, who gave the keynote...
1519    A former philosophy teacher whose best-known n...
1318    Alan Mulally, Boeing's head of the unit, said ...
Length: 500, dtype: object

注意，绘制tSNE图，上面的数据加载部分只运行一次，来保证随机采样的数据相同。

1.2 加载模型获得特征

# 定义model
model = RobertaModel.from_pretrained("/roberta-large")     # 加载预训练模型权重
# model.load_state_dict(torch.load('CoLA68.9_best_model_Original.pt'), strict=False) # PLM+softmax微调
# model.load_state_dict(torch.load('CoLA72.9_best_model_calculate_weight_diff.pt'), strict=False) # ours impl

# model.load_state_dict(torch.load('RTE_best_model_original.pt'), strict=False) # PLM+softmax微调
model.load_state_dict(torch.load('RTE89.3_calculate_weight_diff_model.pt'), strict=False) # ours impl

tokenizer = RobertaTokenizer.from_pretrained("/roberta-large")

# 获取Input Features
input_feature_list = list()

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding='max_length', truncation=True, max_length=128)
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']
    input_feature = model(input_ids, attention_mask)[1]
    input_feature_list.append(input_feature.detach().cpu().numpy())

# 堆叠Input Features并保存
orign = np.vstack(input_feature_list)

orign.shape

(500, 1024)