kaggle：nlp经典入门（一）词袋模型（数据集 + 代码）

最新推荐文章于 2023-06-13 22:32:34 发布

猫爱吃鱼the

最新推荐文章于 2023-06-13 22:32:34 发布

阅读量2.6k

点赞数 3

分类专栏： NLP专栏文章标签：机器学习自然语言处理 nlp

本文链接：https://blog.csdn.net/qq_39783265/article/details/105012991

版权

NLP专栏专栏收录该内容

8 篇文章 2 订阅

订阅专栏

引言

此部分分享的工作为**词袋模型**，代码和数据集均来自Kaggle的Bag of Words Meets Bags of Popcorn入门级竞赛。本人对其做了复现，并对每部分的工作进行介绍。

准备工作

1.数据集下载：
链接：https://pan.baidu.com/s/1ZV1IY8O1ypJDig06sWedIw
提取码：ghck
2.环境安装，需要用到的包如下：

pandas
numpy
scipy
scikit-learn
Beautiful Soup
NLTK
Cython
gensim
安装此类包的一个trick：
pip install -i https://pypi.douban.com/simple pandas
上面的包换成其他即可

数据集
标记的数据集包含50,000条IMDB电影评论，这些评论是专门为情感分析而选择的。0为负类，1为正类。标有25,000条评论的训练集不包含与25,000条评论测试集相同的电影。此外，还有另外50,000条IMDB评论，没有任何评分标签。

说明：
labeledTrainData-带标签的训练集。该文件以制表符分隔，并有一个标题行，后跟25,000行，其中包含每个评论的ID，情感和文本。
testData- 测试集。制表符分隔的文件包含一个标题行，后跟25,000行，其中包含每个评论的ID和文本。您的任务是预测每个人的情绪。
unlabeledTrainData- 没有标签的额外训练集。制表符分隔的文件包含一个标题行，后跟50,000行，其中包含每个评论的ID和文本。
sampleSubmission- 逗号分隔的示例提交文件，格式正确。
资料栏位
id- 每个评论的唯一ID
sentiment -1表示正面评论，0表示负面评论
review - 评论

import pandas as pd     
train = pd.read_csv("./data/labeledTrainData.tsv", header=0,delimiter="\t", quoting=3)

此处，“ header = 0”指示文件的第一行包含列名，“ delimiter = \ t”指示字段由制表符分隔，并且quoting = 3告诉Python忽略双引号，否则可能会遇到错误尝试读取文件。
train尺寸：（25000，3 ）

数据清理和文本预处理（假设环境所需的包已全部安装好，本人用的Jupyter调试的）

# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(train["review"][0])  
example1

在这里插入图片描述
BeautifulSoup包为了删除HTML标签

import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words
words

在这里插入图片描述
1.正则表达式：将非正常大小写字母外的其他符号给删除掉
2.转换成小写字母
3.将其转成列表形式

停用词过滤处理

最后，我们需要确定如何处理不怎么有意义的频繁出现的单词。这些词称为“ 停用词 ”；在英语中，它们包括“ a”，“ and”，“ is”和“ the”之类的词。方便地，有一些内置了停用词列表的Python软件包。让我们从Python 自然语言工具包（NLTK）导入停用词列表。如果您的计算机上尚未安装该库，则需要安装它。

直接用python下载的话，非常慢，下载方式如下：

import nltk
nltk.download()  # Download text data sets, including stop words

不推荐上面的方法，因为太慢！！！！！！

可以用下面的方法

这里可以直接百度网盘下载我的：
链接：https://pan.baidu.com/s/1DLG7-Xwyuc4_dOLvIty7Fw
提取码：up7m
下载成功之后按此教程进行导入：
https://www.cnblogs.com/ouyxy/articles/9973864.html

from nltk.book import *
import nltk
from nltk.corpus import stopwords # Import the stop word list
from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english") )
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print(words)

完成上述步骤之后，打印结果：
[u’stuff’, u’going’, u’moment’, u’mj’, u’ve’, u’started’, u’listening’, u’music’, u’watching’, u’odd’, u’documentary’, u’watched’, u’wiz’, u’watched’, u’moonwalker’, u’maybe’, u’want’, u’get’, u’certain’, u’insight’, u’guy’, u’thought’, u’really’, u’cool’, u’eighties’, u’maybe’, u’make’, u’mind’, u’whether’, u’guilty’, u’innocent’, u’moonwalker’, u’part’, u’biography’, u’part’, u’feature’, u’film’, u’remember’, u’going’, u’see’, u’cinema’, u’originally’, u’released’, u’subtle’, u’messages’, u’mj’, u’feeling’, u’towards’, u’press’, u’also’, u’obvious’, u’message’, u’drugs’, u’bad’, u’m’, u’kay’,…]
不用担心每个单词前面的“ u”；它只是表明Python在内部将每个单词表示为Unicode字符串。

接着进行函数封装

def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

封装的函数是上面每一步教程的合并
函数中meaningful_words是一个列表的形式，但是为了后续处理，先将其转换成字符形式。

num_reviews = train["review"].size
# Initialize an empty list to hold the clean reviews
print("Cleaning and parsing the training set movie reviews...\n")
clean_train_reviews = []
for i in range(num_reviews):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print("Review %d of %d" % ( i+1, num_reviews ))                                                                    
    clean_train_reviews.append( review_to_words( train["review"][i] ))

这里有两个新元素：首先，我们将停用词列表转换为另一种数据类型，即set。这是为了提高速度；由于我们将要调用该函数数万次，因此它需要快速运行，并且在Python中搜索集合比搜索列表要快得多。

第二，我们将这些词重新纳入一个段落。这是为了使输出更易于在下面的“词袋”中使用。定义上述函数后，如果您调用该函数进行一次检查：
最后封装成列表

用一句话创造特征（使用scikit-learn）

现在我们整理了培训评论，如何将它们转换为某种数字表示形式以进行机器学习？一种常见的方法称为“语言包”。“单词袋”模型从所有文档中学习词汇，然后通过计算每个单词出现的次数来对每个文档进行建模。例如，考虑以下两个句子：

句子1：“The cat sat on the hat”

句子2：“The dog ate the cat and the hat”

从这两个句子中，我们的词汇如下：

{ the, cat, sat, on, hat, dog, ate, and }

为了获得大量的单词，我们计算每个单词在每个句子中出现的次数。在句子1中，“ the”出现两次，而“ cat”，“ sat”，“ on”和“ hat”分别出现一次，因此句子1的特征向量为：

句子1：{ 2, 1, 1, 1, 1, 0, 0, 0 }

同样，句子2：{ 3, 1, 0, 0, 1, 1, 1, 1}

在IMDB数据中，我们有大量的评论，这将为我们提供大量词汇。为了限制特征向量的大小，我们应该选择一些最大词汇量。在下面，我们使用5000个最常用的单词（记住停用词已被删除）。

print( "Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 
train_data_features = vectorizer.fit_transform(clean_train_reviews)
# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_featuress = train_data_features.toarray()

这里调用的CountVectorizer包，相当于整个训练集25000句话，把每一句话进行词频统计，最大特征这里设置的5000。fit_transform函数既包含了数据处理与拟合，最后用toarray转换成数组的形式，最终得到的训练特征向量大小为（25000*5000）

25,000行和5000个特征（每个词汇单词一个）。

请注意，CountVectorizer带有自己的选项，可以自动执行预处理，标记化和停止单词删除-对于这些选项中的每一个，我们可以使用内置方法或指定要使用的函数来代替指定“ None”。有关更多详细信息，请参见功能文档。但是，我们希望在本教程中编写自己的数据清理功能，以向您展示如何逐步完成它。

print("Training the random forest...")
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit( train_data_featuress, train["sentiment"] )

这里采用简单得随机森林进行训练，训练完成之后进行预测。

请注意，当我们将“词袋”用于测试集时，我们仅像训练集那样调用“ transform”，而不是“ fit_transform”。在机器学习中，您不应该使用测试集来拟合您的模型，否则会冒过拟合的风险。因此，在准备好进行预测之前，我们会保持测试集的界限。

test = pd.read_csv("./data/testData.tsv", header=0,delimiter="\t", quoting=3)
test_num = test["review"].size
# Initialize an empty list to hold the clean reviews
print("Cleaning and parsing the training set movie reviews...\n")
clean_test_reviews = []
for i in range(test_num):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%2000 == 0 ):
        print("Review %d of %d" % ( i+1, test_num ))                                                                    
    clean_test_reviews.append( review_to_words( test["review"][i] ))
    print( "Creating the bag of testing words...\n")
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
result = forest.predict(test_data_features)
# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "Bag_of_Words_model.csv", index=False, quoting=3 )

最后保存成需要提交的格式

猫爱吃鱼the

关注

3
点赞
踩
10

收藏

觉得还不错? 一键收藏
1
评论
kaggle：nlp经典入门（一）词袋模型（数据集 + 代码）

引言此部分分享的工作为**词袋模型**，代码和数据集均来自Kaggle的Bag of Words Meets Bags of Popcorn入门级竞赛。本人对其做了复现，并对每部分的工作进行介绍。准备工作1.数据集下载：链接：https://pan.baidu.com/s/1ZV1IY8O1ypJDig06sWedIw提取码：ghck2.环境安装，需要用到的包如下：pandasn...
复制链接

扫一扫