1、前言
特征学习和特征转换的差异在于创建新特征时的参数假设。本节会从以下几个方面对特征学习进行讲解:
- 数据的参数假设
- 受限玻尔兹曼机(RBM,restricted Boltzmann machine)
- 伯努利受限玻尔兹曼机(BernoulliRBM)
- 从MNIST中提取RBM特征
- 在机器学习流水线中应用RBM
- 学习文本特征——词向量。
2、基础知识讲解
2.1数据的参数假设
参数假设是指算法对数据形状的基本假设。在前文探索PCA时,我们发现可以利用算法的结果产生主成分,通过矩阵乘法来转换数据。我们的假设是,原始数据的形状可以进行(特征值)分解,并且可以用单个线性变换(矩阵计算)表示。但如果假设不成立呢?PCA和LDA都基于预定的算式,每次肯定输出同样的特征。这也是我们将PCA和LDA都视为线性变换的原因。
特征学习算法希望可以去除这个参数假设,从而解决该问题。这些算法不会对输入数据的形状有任何假设,而是依赖于随机学习(stochastic learning)。这些算法并不是每次输出相同的结果,而是一次次按轮(epoch)检查数据点以找到要提取的最佳特征,并且拟合到一个解决方案(在运行时可能会有所不同)。
需要注意的是,非参数模型不代表模型在训练中对数据完全没有假设。
2.2受限波尔兹曼机
RBM是一组无监督的特征学习算法,使用概率模型学习新特征。与PCA和LDA一样,我们可以使用RBM从原始数据中提取新的特征集,用于增强机器学习流水线。在RBM提取特征之后使用线性模型(线性回归、逻辑回归、感知机等)往往效果最佳
RBM简单来说,就是一个两层的神经网络,第一层是可见层,节点数和输入数据的特征维数相同;第二层为隐藏层,隐藏层的节点数是人为选取的,代表我们想学习的特征数。RBM可以学习到比初始输入更少或更多的特征。
RBM的限制是,不允许任何层内通信。这样,节点可以独立地创造权重和偏差,最终希望是独立的特征。
# 单个数据点
import numpy as np
import math
# S形函数
def activation(x):
return 1 / (1 + math.exp(-x))
inputs = np.array([1, 2, 3, 4])
weights = np.array([0.2, 0.324, 0.1, .001])
bias = 1.5
a = activation(np.dot(inputs.T, weights) + bias)
print(a)
0.9341341524806636
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model, datasets, metrics
# scikit-learn的RBM
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
imgs_path = '/home/kesci/input/mnist9118/mnist_train.csv'
images = np.genfromtxt(imgs_path,delimiter=',')
# 6000个图像 785列 28x28个像素+1个label
images.shape
(6000, 785)
# 提取特征
images_X, images_y = images[:,1:], images[:,0]
# 值很大 但是scikit-learn的RBM会做0-1正则化
np.min(images_X), np.max(images_X)
(0.0, 255.0)
plt.imshow(images_X[0].reshape(28, 28), cmap=plt.cm.gray_r)
images_y[0]
5.0
scikit-learn中唯一的RBM实现是BernoulliRBM,它对原始数据的范围进行了约束。伯努利分布要求数据的值为0~1。scikit-learn的文档称,该模型假定输入是二进制的值,或者是0~1的数。这个限制是为了表示节点的值就是节点被激活的概率,从而可以更快地学习特征集。
我们修改一下原始数据集,只考虑硬编码的黑白像素强度。这样,每个像素的值会变成0或1(白或黑),让学习更加稳健。我们分两步完成:
- 将像素的值缩放到0~1
- 如果超过0.5,将值变成真,否则为假
# images_X缩放到0-1之间
images_X = images_X / 255.
# 二分像素(白或黑)
images_X = (images_X > 0.5).astype(float)
np.min(images_X), np.max(images_X)
(0.0, 1.0)
# 查看变化后的数字5
plt.imshow(images_X[0].reshape(28, 28), cmap=plt.cm.gray_r)
images_y[0]
5.0
2.3从MNIST中提取PCA主成分
在引入RBM前,先看看对数据集应用PCA的效果
from sklearn.decomposition import PCA
# 100个主数字
pca = PCA(n_components=100)
pca.fit(images_X)
# 绘制100个特征
plt.figure(figsize=(10, 10))
for i, comp in enumerate(pca.components_):
plt.subplot(10, 10, i + 1)
plt.imshow(comp.reshape((28, 28)), cmap=plt.cm.gray_r)
plt.xticks(())
plt.yticks(())
plt.suptitle('100 components extracted by PCA')
plt.show()
# 前30个特征捕捉64%的信息
pca.explained_variance_ratio_[:30].sum()
0.6374141412928034
# 碎石图查看方差变化
# 所有的主数字
full_pca = PCA(n_components=784)
full_pca.fit(images_X)
plt.plot(np.cumsum(full_pca.explained_variance_ratio_))
# 100个特征包括90%的方差
# 用第一个拟合过的PCA对象对第一个图像进行转换 提取100个新特征
pca.transform(images_X[:1])
# 然后是矩阵乘法
np.dot(images_X[:1]-images_X.mean(axis=0), pca.components_.T)
array([[ 0.61090568, 1.36377972, 0.42170385, -2.19662828, -0.45181077, -1.320495 , 0.79434681, 0.30551117, 1.22978992, -0.72096718, 0.08168364, -1.91375605, -2.54647342, -1.62440748, 0.67107218, 0.15635569, 0.91831014, -0.18981947, 1.30140645, 1.57929175, 0.99052162, 0.11279707, 1.07343911, 0.70139728, -0.35907112, 0.16659764, 0.99307648, -0.73119403, 0.86974122, -0.18633666, -0.7250392 , 0.11264209, 0.16107565, 0.07307468, 0.11752422, -0.73010951, -0.29687482, 0.17337988, 0.29979024, 2.32445854, -0.20399058, -0.85348351, 0.67707697, 0.34738999, 0.33946718, -0.42206712, -0.20693081, 0.39358505, -0.3124686 , 0.3859772 , 0.06706821, 0.074536 , 0.63300683, 0.79854186, -0.41586582, 0.03372533, -0.17687751, 0.16532837, -0.52249017, -0.36700282, -0.39567849, -0.477975 , 0.49988619, 0.30935125, 0.61148885, 0.55619022, -0.64494891, 0.45321478, -0.23637851, -0.00662123, 0.14325621, 0.54515238, 0.52678601, 0.37044453, -0.31541273, 0.34044621, 0.77076222, 0.00492655, 0.87014935, -0.07367147, 0.17362486, 0.26993403, 0.13470299, -0.00564083, -0.31819501, 0.03795214, -0.27455859, 0.38342616, -0.58323348, -0.11355913, -0.29675462, -0.23951216, 0.01684053, 0.33190497, 0.21656397, -0.25877682, -0.0136806 , 0.75206504, 0.25004643, 0.07487138]])
2.4从MNIST中提取RBM特征
将verbose设置为True,以查看训练过程,将random_state设置为0,复现训练结果。最后将迭代次数n_iter设置为20>n_components和PCA和LDA一样 我们希望创建的特征数。n_components可以是任意整数,小于、等于或大于原始的特征均可
# 实例化BernoulliRBM
rbm = BernoulliRBM(random_state=0, verbose=True, n_iter=20, n_components=100)
rbm.fit(images_X)
[BernoulliRBM] Iteration 1, pseudo-likelihood = -138.59, time = 12.98s [BernoulliRBM] Iteration 2, pseudo-likelihood = -120.25, time = 15.79s [BernoulliRBM] Iteration 3, pseudo-likelihood = -116.46, time = 13.71s [BernoulliRBM] Iteration 4, pseudo-likelihood = -117.87, time = 17.40s [BernoulliRBM] Iteration 5, pseudo-likelihood = -113.16, time = 16.31s [BernoulliRBM] Iteration 6, pseudo-likelihood = -114.22, time = 14.30s [BernoulliRBM] Iteration 7, pseudo-likelihood = -119.82, time = 15.30s [BernoulliRBM] Iteration 8, pseudo-likelihood = -111.19, time = 17.10s [BernoulliRBM] Iteration 9, pseudo-likelihood = -113.71, time = 13.50s [BernoulliRBM] Iteration 10, pseudo-likelihood = -115.86, time = 14.51s [BernoulliRBM] Iteration 11, pseudo-likelihood = -114.38, time = 15.90s [BernoulliRBM] Iteration 12, pseudo-likelihood = -110.26, time = 14.10s [BernoulliRBM] Iteration 13, pseudo-likelihood = -112.14, time = 14.40s [BernoulliRBM] Iteration 14, pseudo-likelihood = -110.77, time = 14.01s [BernoulliRBM] Iteration 15, pseudo-likelihood = -106.87, time = 16.50s [BernoulliRBM] Iteration 16, pseudo-likelihood = -104.23, time = 14.30s [BernoulliRBM] Iteration 17, pseudo-likelihood = -108.45, time = 15.29s [BernoulliRBM] Iteration 18, pseudo-likelihood = -103.26, time = 14.90s [BernoulliRBM] Iteration 19, pseudo-likelihood = -109.38, time = 16.89s [BernoulliRBM] Iteration 20, pseudo-likelihood = -106.87, time = 13.70s
BernoulliRBM(batch_size=10, learning_rate=0.1, n_components=100, n_iter=20, random_state=0, verbose=True)
# RBM也有components_
len(rbm.components_)
100
对RBM特征进行可视化,查看它和特征数字的区别
# 绘制RBM特征
plt.figure(figsize=(10, 10))
for i, comp in enumerate(rbm.components_):
plt.subplot(10, 10, i + 1)
plt.imshow(comp.reshape((28, 28)), cmap=plt.cm.gray_r)
plt.xticks(())
plt.yticks(())
plt.suptitle('100 components extracted by RBM')
plt.show()
# 检查rbm.components_的独立值数量
np.unique(rbm.components_.mean(axis=1)).shape
# (100,)
# 用波尔茨曼机转换数字5
image_new_features = rbm.transform(images_X[:1]).reshape(100,)
image_new_features
array([9.04806201e-15, 4.25261391e-14, 4.69087648e-05, 1.68428220e-02, 5.95051987e-18, 1.34782575e-14, 4.52512943e-17, 1.28015630e-09, 1.11125152e-20, 3.83648382e-09, 1.38021428e-08, 1.37172961e-06, 1.31685475e-26, 5.53977483e-14, 1.00000000e+00, 9.34069615e-02, 1.24033689e-14, 5.28600093e-08, 9.97293230e-01, 2.13254351e-09, 2.52626842e-09, 1.00000000e+00, 5.79273457e-17, 1.07430660e-02, 4.96027750e-17, 2.92280200e-17, 9.81245308e-01, 8.34198047e-01, 9.99928297e-01, 9.99999982e-01, 2.73320707e-07, 1.40508771e-08, 4.35774079e-16, 4.62801900e-09, 1.00000000e+00, 3.35616713e-22, 3.47339674e-17, 1.58520535e-08, 8.03244940e-01, 1.95555092e-17, 7.16742583e-17, 9.81163514e-01, 3.65995376e-11, 2.35462769e-09, 2.43220804e-06, 2.03318113e-06, 5.61256679e-13, 1.84163684e-24, 4.63236725e-06, 1.00000000e+00, 4.40340863e-24, 5.72529024e-21, 1.76111922e-15, 2.78984337e-13, 4.60583502e-23, 4.17534954e-11, 7.14833229e-02, 1.61164039e-16, 1.09626822e-06, 2.13767202e-02, 2.63478778e-05, 6.04435799e-11, 1.00000000e+00, 1.00000000e+00, 1.07632748e-07, 1.34491758e-14, 4.31547352e-12, 3.78797593e-09, 6.32884318e-07, 1.00000000e+00, 3.14742453e-11, 1.58234227e-15, 2.43038360e-22, 3.83152734e-13, 1.43106583e-14, 2.36047035e-11, 2.62844341e-20, 3.63141113e-01, 4.45892434e-02, 5.33948722e-04, 8.56653653e-09, 3.49230511e-11, 9.99999941e-01, 9.99999080e-01, 1.67476238e-25, 1.83981563e-09, 1.03922248e-13, 3.15661949e-17, 2.11101071e-27, 2.73325179e-14, 4.66370296e-14, 1.22410245e-17, 9.99999974e-01, 1.51586882e-13, 5.51350845e-02, 7.76781730e-10, 2.74845645e-14, 1.37149623e-23, 1.45293390e-13, 4.18920878e-15])
# 不是简单的矩阵乘法
# 是神经网络架构(几个矩阵操作)来转换特征
np.dot(images_X[:1]-images_X.mean(axis=0), rbm.components_.T)
array([[ -7.84883708, -9.24901523, -0.24300729, 11.22097549, -6.36937942, -5.87977032, -3.34421256, -25.63801016, -18.41078569, -7.86321757, 2.56149134, 10.90528349, -28.27772895, -7.39843538, 11.07968836, 9.15194147, -2.13812893, -4.32515292, 13.01301527, -4.57295565, -6.04525171, 33.6646876 , -21.49420638, 6.44414599, -5.52527261, -1.26409558, 10.0739347 , 22.97793606, 25.06919405, 11.71771946, -2.52061473, -4.05601807, -23.32665781, 7.54916569, 34.78210224, -22.3332268 , -15.95673423, 14.24952729, 6.42067126, -47.87279723, -25.78985135, 24.04354395, -1.53611797, -5.70439573, 1.32972941, -1.90404979, -6.52857159, -29.96843601, -10.33692856, 36.29618049, -28.02664707, -18.12166495, -8.69766639, -13.21067521, -17.01153533, -0.06326304, 9.45917953, -18.03586765, -1.5379177 , 16.33113261, 0.4428705 , -1.17017591, 9.60074144, 10.98844452, 9.75663099, -7.46002834, -0.75971514, -2.4144897 , 4.78355954, 10.92039555, 5.2660994 , -11.36897506, -23.61334808, -11.21622455, -6.3203475 , -1.31611174, -18.05876116, 5.84813278, 7.47267173, 5.15278762, -0.72055053, -3.09949544, 17.41025048, 25.90890057, -22.6361083 , -2.14960101, -4.48803644, -5.52635608, -33.40471764, -2.65985589, -2.2121072 , -22.7636142 , 38.82026469, -25.28712325, 17.09279627, 7.8128556 , -3.2251757 , -20.72228169, -9.5593031 , -6.0955524 ]])
# 从第一个图像数字5中提取20个代表性的特征
# 注意argsort将数组按照从小到大的顺序排序,输出的是对应的索引值
top_features = image_new_features.argsort()[-20:][::-1]
print(top_features)
image_new_features[top_features]
[63 69 14 62 49 34 21 29 92 82 83 28 18 26 41 27 38 77 15 56]
array([1. , 1. , 1. , 1. , 1. , 1. , 1. , 0.99999998, 0.99999997, 0.99999994, 0.99999908, 0.9999283 , 0.99729323, 0.98124531, 0.98116351, 0.83419805, 0.80324494, 0.36314111, 0.09340696, 0.07148332])
# 绘制最有代表性的RBM特征(新特征)
plt.figure(figsize=(25, 25))
for i, comp in enumerate(top_features):
plt.subplot(5, 4, i + 1)
plt.imshow(rbm.components_[comp].reshape((28, 28)), cmap=plt.cm.gray_r)
plt.title("Component {}, feature value: {}".format(comp, round(image_new_features[comp], 2)), fontsize=20)
plt.suptitle('Top 20 components extracted by RBM for first digit', fontsize=30)
# 最差的特征
bottom_features = image_new_features.argsort()[:20]
plt.figure(figsize=(25, 25))
for i, comp in enumerate(bottom_features):
plt.subplot(5, 4, i + 1)
plt.imshow(rbm.components_[comp].reshape((28, 28)), cmap=plt.cm.gray_r)
plt.title("Component {}, feature value: {}".format(comp, round(image_new_features[comp], 2)), fontsize=20)
plt.suptitle('Bottom 20 components extracted by RBM for first digit', fontsize=30)
plt.show()
2.5在机器学习流水线中应用RBM
创建并运行三条流水线:
- 原始像素强度上的逻辑回归模型
- PCA主成分上的逻辑回归
- RBM特征上的逻辑回归。每条流水线都会对(PCA和RBM的)多个特征和参数C进行网格搜索,以进行逻辑回归
对原始像素值应用线性模型
# 导入逻辑回归和网格搜索
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# 创建逻辑回归
lr = LogisticRegression()
params = {'C':[1e-2, 1e-1, 1e0, 1e1, 1e2]}
# 实例化网格搜索类
grid = GridSearchCV(lr, params)
# 拟合数据
grid.fit(images_X, images_y)
# 最佳参数
grid.best_params_, grid.best_score_
({'C': 0.1}, 0.8908333333333334)
对提取的PCA主成分应用线性模型
# 用PCA提取特征
lr = LogisticRegression()
pca = PCA()
params = {'clf__C':[1e-1, 1e0, 1e1],
'pca__n_components': [10, 100, 200]}
# 创建流水线
pipeline = Pipeline([('pca', pca), ('clf', lr)])
# 实例化网格搜索
grid = GridSearchCV(pipeline, params)
# 拟合数据
grid.fit(images_X, images_y)
# 最佳参数
grid.best_params_, grid.best_score_
({'clf__C': 1.0, 'pca__n_components': 100}, 0.8878333333333334)
对提取的RBM特征应用线性模型
# 用RBM学习特征
rbm = BernoulliRBM(random_state=0)
params = {'clf__C':[1e-1, 1e0, 1e1],
'rbm__n_components': [100, 200]
}
# 流水线
pipeline = Pipeline([('rbm', rbm), ('clf', lr)])
# 网格搜索
grid = GridSearchCV(pipeline, params)
# 拟合数据
grid.fit(images_X, images_y)
# 最佳参数
grid.best_params_, grid.best_score_
({'clf__C': 1.0, 'rbm__n_components': 200}, 0.9191666666666667)
最佳特征数是200表示我们可以试着提取超过200个特征,获得更好的性能
2.6学习文本特征:词向量
当机器学习读写时,会遇到一个很大的问题,那就是上下文的处理
词嵌入
词嵌入是帮助机器理解上下文的一种方法。词嵌入是单词在n维特征空间中的向量化,其中n代表单词潜在特征的数量。
词嵌入需要注意的事项:
- 上下文随语料库的变化而不同,单词的含义也是一样,所以静态的词嵌入不一定是最有用
- 词嵌入依赖于要学习的语料库
# 词嵌入的例子
king = np.array([.2, -.5, .7, .2, -.9])
man = np.array([-.5, .2, -.2, .3, 0.])
woman = np.array([.7, -.3, .3, .6, .1])
queen = np.array([ 1.4, -1. , 1.2, 0.5, -0.8])
np.array_equal((king - man + woman), queen)
# True
两种词嵌入方法:Word2vec和GloVe
为了学习和提取词嵌入,Word2vec会实现另一个浅层神经网络。这次我们不是一股脑地将数据输入可见层,而是故意输入正确的数据,以提供正确的词嵌入。和RBM一样,我们有一个可见的输入层和一个隐藏层。输入层和希望学习的词汇长度相同,图中的隐藏层代表对于每个单词要学习的特征数。注意,输出层和输入层的节点数量一样。词嵌入模型通过参考词的存在与否,预测相邻的单词
# 导入gensim包
import gensim
import logging
from gensim.models import word2vec, Word2Vec
# 日志记录器
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# 语料库
text8_path = '/home/kesci/input/text88918/text8'
sentences = word2vec.Text8Corpus(text8_path)
# 实例化gensim模块
# min-count是忽略次数比它小的词
# size是要学习的词的维数
model = gensim.models.Word2Vec(sentences, min_count=1, size=20)
# 单个词的嵌入
model.wv['king']
array([-0.7017266, -1.0679547, 10.292624 , -2.2882755, -0.2178137, -1.3349842, -0.4876493, 6.348319 , -0.5251462, 1.4976662, -0.5753518, 2.4307735, -0.6643898, 4.0458083, 1.3584578, 1.605897 , 2.0763857, -1.5584726, 0.5769929, -1.3227795], dtype=float32)
# woman + king - man = queen
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)
2024-09-25 09:08:02,059 : INFO : precomputing L2-norms of word weight vectors
[('emperor', 0.8965556621551514), ('prince', 0.8764557242393494), ('elector', 0.8737660646438599), ('pope', 0.8731228113174438), ('consul', 0.8638008832931519), ('empress', 0.8611356019973755), ('viii', 0.856552243232727), ('judah', 0.8555375933647156), ('throne', 0.8463213443756104), ('vetriano', 0.8377453684806824)]
# 伦敦对于英国相当于巴黎对于____
model.wv.most_similar(positive=['Paris', 'England'], negative=['London'], topn=1)
单击部分隐藏输出,双击全隐藏
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-34-25abb07ef297> in <module> 1 # 伦敦对于英国相当于巴黎对于____ ----> 2 model.wv.most_similar(positive=['Paris', 'England'], negative=['London'], topn=1) /opt/conda/lib/python3.6/site-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, restrict_vocab, indexer) 550 mean.append(weight * word) 551 else: --> 552 mean.append(weight * self.word_vec(word, use_norm=True)) 553 if word in self.vocab: 554 all_words.add(self.vocab[word].index) /opt/conda/lib/python3.6/site-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm) 465 return result 466 else: --> 467 raise KeyError("word '%s' not in vocabulary" % word) 468 469 def get_vector(self, word): KeyError: "word 'Paris' not in vocabulary"
发现Paris这个词还没有被学到,因为它不在语料库中。我们已经可以看到这个程序的局限性了:词嵌入会受限于选择的语料库和计算词嵌入的机器。
书中使用的是GoogleNews-vectors-negative300.bin,包括Google所收录网站上的300万个单词,每个单词学习300个维度。
import gensim
word_path = '../data/GoogleNews-vectors-negative300.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(word_path, binary=True)
# 300万单词
len(model.wv.vocab)
# woman + king - man = queen
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
# 伦敦对于英国相当于巴黎对于____
model.wv.most_similar(positive=['Paris', 'England'], negative=['London'], topn=1)
# 不属于类别的单词
model.wv.doesnt_match("duck bear cat tree".split())
# 0-1间的相似性分数
# 女和男的相似程度 比较相似
model.wv.similarity('woman', 'man')
# 树和男的相似度 不大类似
model.wv.similarity('tree', 'man')
词嵌入的应用:信息检索
# 查找词嵌入 没有返回None
def get_embedding(string):
try:
return model.wv[string]
except:
return None
# 原创标题
sentences = [
"this is about a dog",
"this is about a cat",
"this is about nothing"
]
import numpy as np
from functools import reduce
# 3x300的零矩阵
vectorized_sentences = np.zeros((len(sentences),300))
# 每个句子
for i, sentence in enumerate(sentences):
# 分词
words = sentence.split(' ')
# 进行词嵌入
embedded_words = [get_embedding(w) for w in words]
embedded_words = filter(lambda x:x is not None, embedded_words)
# 对标题进行矢量化 取均值
vectorized_sentence = reduce(lambda x,y:x+y, embedded_words)/len(list(embedded_words))
# 改成矢量
vectorized_sentences[i:] = vectorized_sentence
vectorized_sentences.shape
# 最和狗接近的句子
reference_word = 'dog'
# 词嵌入和向量化矩阵的点积
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-1]
# 最相关的句子
sentences[best_sentence_idx]
reference_word = 'cat'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-1]
sentences[best_sentence_idx]
sentences = """How to Sound Like a Data Scientist
Types of Data
The Five Steps of Data Science
Basic Mathematics
A Gentle Introduction to Probability
Advanced Probability
Basic Statistics
Advanced Statistics
Communicating Data
Machine Learning Essentials
Beyond the Essentials
Case Studies """.split('\n')
# 3x300的零矩阵
vectorized_sentences = np.zeros((len(sentences),300))
# 每个句子
for i, sentence in enumerate(sentences):
# 分词
words = sentence.split(' ')
# 进行词嵌入
embedded_words = [get_embedding(w) for w in words]
embedded_words = filter(lambda x:x is not None, embedded_words)
# 对标题进行矢量化 取均值
vectorized_sentence = reduce(lambda x,y:x+y, embedded_words)/len(list(embedded_words))
# 改成矢量
vectorized_sentences[i:] = vectorized_sentence
vectorized_sentences.shape
reference_word = 'math'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-3:][::-1]
[sentences[b] for b in best_sentence_idx]
# 关于数据的演讲
reference_word = 'talk'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-3:][::-1]
[sentences[b] for b in best_sentence_idx]
#关于AI
reference_word = 'AI'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-3:][::-1]
[sentences[b] for b in best_sentence_idx]
3、总结
我们重点介绍了两种强大的特征学习工具:受限玻尔兹曼机(RBM)和词嵌入。这两种工具在机器学习领域中被广泛应用,能够有效地从原始数据中学习到有用的特征。以下是本节的主要总结:
受限玻尔兹曼机(RBM):RBM 是一种无监督学习算法,可以用于学习数据的潜在特征表示。它通过构建概率模型,自动学习数据中的特征,适用于降维和特征提取。
词嵌入:词嵌入是将单词映射到高维空间的技术,保留了单词之间的语义关系。在自然语言处理(NLP)任务中,词嵌入常用于表征文本数据的特征。