python文本预处理
This post is the second of three sequential articles on steps to build a sentiment classifier. Following our exploratory text analysis in the first post, it’s time to preprocess our text data. Simply put, preprocessing text data is to do a series of operations to convert the text into a tabular numeric data. In this post, we will look at 3 ways with varying complexity to preprocess text to tf-idf matrix as preparation for a model. If you are unsure what tf-idf is, this post explains with a simple example.
这篇文章是有关建立情感分类器的步骤的三篇连续文章的第二篇。 在第一篇文章中我们进行了探索性文本分析之后 ,该对文本数据进行预处理了。 简而言之,预处理文本数据是要执行一系列操作以将文本转换为表格数字数据。 在这篇文章中,我们将研究3种复杂程度不同的方法,以将文本预处理为tf-idf矩阵作为模型的准备。 如果您不确定什么是tf-idf, 本文将以一个简单的示例进行说明。
Before we dive in, let’s take a step back and look at the bigger picture real quickly. CRISP-DM methodology outlines the process flow for a successful data science project. Preprocessing data is one of the key tasks in the data preparation stage.
在我们深入之前,让我们退后一步,快速查看一下实际情况。 CRISP-DM方法论概述了成功的数据科学项目的流程。 预处理数据是数据准备阶段的关键任务之一。
0. Python设置 (0. Python setup)
This post assumes that the reader (👀 yes, you!) has access to and is familiar with Python including installing packages, defining functions and other basic tasks. If you are new to Python, this is a good place to get started.
这篇文章假定读者(👀,是的,您!)可以访问并熟悉Python,包括安装软件包,定义函数和其他基本任务。 如果您不熟悉Python,那么这是一个入门的好地方。
I have tested the scripts in Python 3.7.1 in Jupyter Notebook.
我已经在Jupyter Notebook中的Python 3.7.1中测试了脚本。
Let’s make sure you have the following libraries installed before we start:◼️ Data manipulation/analysis: numpy, pandas◼️ Data partitioning: sklearn◼️ Text preprocessing/analysis: nltk◼️ Spelling checker: spellchecker (pyspellchecker when installing)
让我们确保你已经在我们开始之前安装以下库:◼️ 数据处理/分析:numpy 的,熊猫 ◼️ 数据分区:sklearn◼️ 文本预处理/分析:NLTK Sp️ 拼写检查器: 拼写检查器(安装时为pyspellchecker )
Once you have nltk installed, please make sure you have downloaded ‘stopwords’ and ‘wordnet’ corpora from nltk with the script below:
安装nltk之后 ,请确保使用以下脚本从nltk下载了' stopwords '和'wordnet'语料库:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
If you have already downloaded, running this will notify you so.
如果您已经下载了,运行它会通知您。
Now, we are ready to import all the packages:
现在,我们准备导入所有软件包:
# Setting random seed
seed = 123# Measuring run time
from time import time# Data manipulation/analysis
import numpy as np
import pandas as pd# Data partitioning
from sklearn.model_selection import train_test_split# Text preprocessing/analysis
import re, random
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from spellchecker import SpellChecker
1.数据📦 (1. Data 📦)
We will use IMDB movie reviews dataset. You can download the dataset here and save it in your working directory. Once saved, let’s import it to Python:
我们将使用IMDB电影评论数据集。 您可以在此处下载数据集并将其保存在您的工作目录中。 保存后,让我们将其导入Python:
sample = pd.read_csv('IMDB Dataset.csv')
print(f"{sample.shape[0]} rows and {sample.shape[1]} columns")
sample.head()
Let’s look at the split between sentiments:
让我们看一下情感之间的分歧:
sample['sentiment'].value_counts()