机器学习之第3部分词向量的更多的乐趣

最新推荐文章于 2022-05-28 20:41:09 发布

cunyan

最新推荐文章于 2022-05-28 20:41:09 发布

阅读量311

点赞数

分类专栏：机器学习文章标签： python 数据结构与算法人工智能

本文链接：https://blog.csdn.net/cunyan/article/details/84822452

版权

机器学习专栏收录该内容

15 篇文章 0 订阅

订阅专栏

第 3 部分 : 词向量的更多的乐趣

Code

本教程代码第 3 部分住在这里。

https://github.com/wendykan/DeepLearningMovies/blob/master/Word2Vec_BagOfCentroids.py

数字表示的单词

现在 , 我们已经训练模型的语义理解的话 , 我们应该如何使用它呢 ? 如果你看下 ,Word2Vec 模型训练在第 2 部分中包含一个特征向量的每个单词词汇表 , 存储在一个 numpy 数组称为 “syn0”:

>>> # Load the model that we created in Part 2
 >>> from gensim.models import Word2Vec
 >>> model = Word2Vec.load("300features_40minwords_10context")
 2014-08-03 14:50:15,126 : INFO : loading Word2Vec object from 300features_40min_word_count_10context
 2014-08-03 14:50:15,777 : INFO : setting ignored attribute syn0norm to None

 >>> type(model.syn0)
 <type 'numpy.ndarray'>

 >>> model.syn0.shape
 (16492, 300)

行数 syn0 是单词的数量模型的词汇 , 和数量的列对应特征向量的大小 , 我们将在第 2 部分中。设置最低字数 40 与 16492 年给了我们一个 16492 字的总词汇特征。个别单词向量可以访问在以下方式 :

>>> model["flower"]

… 返回一个 1×300 numpy 数组。

从单词到段落 , 尝试 1: 向量平均

IMDB 数据集的一个挑战是可变长度的评论。我们需要找到一种方法来做个别词向量和变换成一个相同的特性集长度为每个 review 。

因为每个单词 300 - 维空间向量 , 我们可以使用向量操作合并每个 review 。一个方法我们尝试简单平均向量这个词在一个给定的评论 ( 为此 , 我们停止的话 , 这只会增加噪音 ) 。

下面的代码平均特征向量 , 建立在我们的代码从第 2 部分。

import numpy as np # Make sure that numpy is imported

 def makeFeatureVec(words, model, num_features):
 # Function to average all of the word vectors in a given
 # paragraph
 #
 # Pre-initialize an empty numpy array (for speed)
 featureVec = np.zeros((num_features,),dtype="float32")
 #
 nwords = 0.
 # 
 # Index2word is a list that contains the names of the words in 
 # the model's vocabulary. Convert it to a set, for speed 
 index2word_set = set(model.index2word)
 #
 # Loop over each word in the review and, if it is in the model's
 # vocaublary, add its feature vector to the total
 for word in words:
 if word in index2word_set: 
 nwords = nwords + 1.
 featureVec = np.add(featureVec,model[word])
 # 
 # Divide the result by the number of words to get the average
 featureVec = np.divide(featureVec,nwords)
 return featureVec


 def getAvgFeatureVecs(reviews, model, num_features):
 # Given a set of reviews (each one a list of words), calculate 
 # the average feature vector for each one and return a 2D numpy array 
 # 
 # Initialize a counter
 counter = 0.
 # 
 # Preallocate a 2D numpy array, for speed
 reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
 # 
 # Loop through the reviews
 for review in reviews:
 #
 # Print a status message every 1000th review
 if counter%1000. == 0.:
 print "Review %d of %d" % (counter, len(reviews))
 # 
 # Call the function (defined above) that makes average feature vectors
 reviewFeatureVecs[counter] = makeFeatureVec(review, model, \
 num_features)
 #
 # Increment the counter
 counter = counter + 1.
 return reviewFeatureVecs

现在 , 我们可以调用这些函数创建每一段的平均向量。以下操作将花几分钟 :

# ****************************************************************
 # Calculate average feature vectors for training and testing sets,
 # using the functions we defined above. Notice that we now use stop word
 # removal.

 clean_train_reviews = []
 for review in train["review"]:
 clean_train_reviews.append( review_to_wordlist( review, \
 remove_stopwords=True ))

 trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

 print "Creating average feature vecs for test reviews"
 clean_test_reviews = []
 for review in test["review"]:
 clean_test_reviews.append( review_to_wordlist( review, \
 remove_stopwords=True ))

 testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )

下一步 , 使用平均段落向量训练随机森林。注意 , 在第 1 部分中 , 我们只能使用标记训练评价训练模型。

# Fit a random forest to the training data, using 100 trees
 from sklearn.ensemble import RandomForestClassifier
 forest = RandomForestClassifier( n_estimators = 100 )

 print "Fitting a random forest to labeled training data..."
 forest = forest.fit( trainDataVecs, train["sentiment"] )

 # Test & extract results 
 result = forest.predict( testDataVecs )

 # Write the test results 
 output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
 output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )

我们发现 , 这产生的结果比较好 , 但表现词袋模型的几个百分点。

自 element-wise 平均向量没有产生惊人的结果 , 也许我们可以以更智能的方式吗 ? 词权重向量的标准方法是应用 “tf-idf” 权重 , 衡量给定单词的重要性在一个给定的一组文件。在 Python 中提取 tf-idf 权重的一种方法是通过使用 scikit-learn TfidfVectorizer, 具有一个接口类似于 CountVectorizer 第 1 部分中 , 我们使用。然而 , 当我们试图以这种方式加权词向量 , 我们没有发现实质性的改善性能。

从单词到段落 , 尝试 2:clusters

Word2Vec 创建 clusters 的语义相关的话 , 那么另一个可能的方法是利用词语的相似度在一个 clusters 中。分组向量以这种方式被称为 “ 向量量化。 “ 为了实现这个目标 , 我们首先需要找到这个词的中心 clusters, 我们可以通过使用诸如 k - means 聚类算法。

http://scikit-learn.org/stable/modules/clustering.html

在 K - means, 我们需要设置一个参数是 “K” 或 clusters 的数量。我们应该如何决定 clusters 创建多少个 ? 反复试验表明 , 小型 clusters, 平均只有 5 单词或 clusters, 给出更好的结果比大 clusters 用许多话。聚类的代码如下所示。我们使用 scikit-learn 执行 k - means 。

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

K - means 聚类大 K 可以很慢 ; 下面的代码我电脑上花了 40 多分钟。下面 , 我们设置一个定时器在 k - means 函数需要多长时间。

from sklearn.cluster import KMeans
 import time

 start = time.time() # Start time

 # Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an
 # average of 5 words per cluster
 word_vectors = model.syn0
num_clusters = word_vectors.shape[0] / 5

 # Initalize a k-means object and use it to extract centroids
 kmeans_clustering = KMeans( n_clusters = num_clusters )
 idx = kmeans_clustering.fit_predict( word_vectors )

 # Get the end time and print how long the process took
end = time.time()
 elapsed = end - start
 print "Time taken for K Means clustering: ", elapsed, "seconds."

每个词的 clusters 作业现在存储在 idx, 从原来的 Word2Vec 模型和词汇仍然是存储在 model.index2word 。为了方便起见 , 我们这些压缩到一个字典如下 :

# Create a Word / Index dictionary, mapping each vocabulary word to
 # a cluster number 
word_centroid_map = dict(zip( model.index2word, idx ))

这是有点抽象 , 所以让我们仔细看看我们的 cluster 包含什么。您的 cluster 可能有所不同 , 因为 Word2Vec 依赖一个随机数种子。这是一个循环 , 打印出 cluster0 到 9 的单词 :

# For the first 10 clusters
 for cluster in xrange(0,10):
 #
 # Print the cluster number 
 print "\nCluster %d" % cluster
 #
 # Find all of the words for that cluster number, and print them out
 words = []
 for i in xrange(0,len(word_centroid_map.values())):
 if( word_centroid_map.values()[i] == cluster ):
 words.append(word_centroid_map.keys()[i])
 print words

结果非常有趣 :

Cluster 0
 [u'passport', u'penthouse', u'suite', u'seattle', u'apple']

Cluster 1
 [u'unnoticed']

Cluster 2
 [u'midst', u'forming', u'forefront', u'feud', u'bonds', u'merge', u'collide', u'dispute', u'rivalry', u'hostile', u'torn', u'advancing', u'aftermath', u'clans', u'ongoing', u'paths', u'opposing', u'sexes', u'factions', u'journeys']

Cluster 3
 [u'lori', u'denholm', u'sheffer', u'howell', u'elton', u'gladys', u'menjou', u'caroline', u'polly', u'isabella', u'rossi', u'nora', u'bailey', u'mackenzie', u'bobbie', u'kathleen', u'bianca', u'jacqueline', u'reid', u'joyce', u'bennett', u'fay', u'alexis', u'jayne', u'roland', u'davenport', u'linden', u'trevor', u'seymour', u'craig', u'windsor', u'fletcher', u'barrie', u'deborah', u'hayward', u'samantha', u'debra', u'frances', u'hildy', u'rhonda', u'archer', u'lesley', u'dolores', u'elsie', u'harper', u'carlson', u'ella', u'preston', u'allison', u'sutton', u'yvonne', u'jo', u'bellamy', u'conte', u'stella', u'edmund', u'cuthbert', u'maude', u'ellen', u'hilary', u'phyllis', u'wray', u'darren', u'morton', u'withers', u'bain', u'keller', u'martha', u'henderson', u'madeline', u'kay', u'lacey', u'topper', u'wilding', u'jessie', u'theresa', u'auteuil', u'dane', u'jeanne', u'kathryn', u'bentley', u'valerie', u'suzanne', u'abigail']

Cluster 4
 [u'fest', u'flick']

Cluster 5
 [u'lobster', u'deer']

Cluster 6
 [u'humorless', u'dopey', u'limp']

Cluster 7
 [u'enlightening', u'truthful']

Cluster 8
 [u'dominates', u'showcases', u'electrifying', u'powerhouse', u'standout', u'versatility', u'astounding']

Cluster 9
 [u'succumbs', u'comatose', u'humiliating', u'temper', u'looses', u'leans']

我们可以看到 , 不同质量的 cluster 。一些有意义 ——cluster3 主要包含名字 , 和 cluster6 - 8 包含相关的形容词 (cluster6 是我最喜欢的 ) 。另一方面 , cluster5 有点神秘 : 龙虾和鹿有什么共同点 ( 除了两只动物 )? cluster0 更糟 : 小棚屋和套房似乎属于彼此 , 但他们似乎不属于苹果和护照。 cluster2 包含 … 也许战争相关的单词 ? 也许我们的算法效果最好的形容词。

无论如何 , 现在我们有一个 cluster ( 或 “ 重心 ”) 转让对于每一个单词 , 我们可以定义一个函数来检查转换成 bags-of-centroids 。这个作品就像袋单词但使用语义相关的 cluster, 而不是个别单词 :

def create_bag_of_centroids( wordlist, word_centroid_map ):
 #
 # The number of clusters is equal to the highest cluster index
 # in the word / centroid map
 num_centroids = max( word_centroid_map.values() ) + 1
 #
 # Pre-allocate the bag of centroids vector (for speed)
 bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
 #
 # Loop over the words in the review. If the word is in the vocabulary,
 # find which cluster it belongs to, and increment that cluster count 
 # by one
 for word in wordlist:
 if word in word_centroid_map:
 index = word_centroid_map[word]
 bag_of_centroids[index] += 1
 #
 # Return the "bag of centroids"
 return bag_of_centroids

上面的函数对每个评审将给我们一个 numpy 数组 , 每个的功能等于 cluster 的数量。最后 , 我们创建包的质心为训练集和测试集 , 然后训练随机森林和提取结果 :

# Pre-allocate an array for the training set bags of centroids (for speed)
 train_centroids = np.zeros( (train["review"].size, num_clusters), \
 dtype="float32" )

 # Transform the training set reviews into bags of centroids
 counter = 0
 for review in clean_train_reviews:
 train_centroids[counter] = create_bag_of_centroids( review, \
 word_centroid_map )
 counter += 1

 # Repeat for test reviews 
 test_centroids = np.zeros(( test["review"].size, num_clusters), \
 dtype="float32" )

 counter = 0
 for review in clean_test_reviews:
 test_centroids[counter] = create_bag_of_centroids( review, \
 word_centroid_map )
 counter += 1

# Fit a random forest and extract predictions 
 forest = RandomForestClassifier(n_estimators = 100)

 # Fitting the forest may take a few minutes
 print "Fitting a random forest to labeled training data..."
 forest = forest.fit(train_centroids,train["sentiment"])
 result = forest.predict(test_centroids)

 # Write the test results 
 output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
 output.to_csv( "BagOfCentroids.csv", index=False, quoting=3 )