python文本预处理_在python中预处理文本

python文本预处理

This post is the second of three sequential articles on steps to build a sentiment classifier. Following our exploratory text analysis in the first post, it’s time to preprocess our text data. Simply put, preprocessing text data is to do a series of operations to convert the text into a tabular numeric data. In this post, we will look at 3 ways with varying complexity to preprocess text to tf-idf matrix as preparation for a model. If you are unsure what tf-idf is, this post explains with a simple example.

这篇文章是有关建立情感分类器的步骤的三篇连续文章的第二篇。 在第一篇文章中我们进行了探索性文本分析之后 ,该对文本数据进行预处理了。 简而言之,预处理文本数据是要执行一系列操作以将文本转换为表格数字数据。 在这篇文章中,我们将研究3种复杂程度不同的方法,以将文本预处理为tf-idf矩阵作为模型的准备。 如果您不确定什么是tf-idf, 本文将以一个简单的示例进行说明。

Before we dive in, let’s take a step back and look at the bigger picture real quickly. CRISP-DM methodology outlines the process flow for a successful data science project. Preprocessing data is one of the key tasks in the data preparation stage.

在我们深入之前,让我们退后一步,快速查看一下实际情况。 CRISP-DM方法论概述了成功的数据科学项目的流程。 预处理数据数据准备阶段的关键任务之一。

Image for post
Extract from CRISP-DM process flow
摘自CRISP-DMCraft.io流程

0. Python设置 (0. Python setup)

This post assumes that the reader (👀 yes, you!) has access to and is familiar with Python including installing packages, defining functions and other basic tasks. If you are new to Python, this is a good place to get started.

这篇文章假定读者(👀,是的,您!)可以访问并熟悉Python,包括安装软件包,定义函数和其他基本任务。 如果您不熟悉Python,那么是一个入门的好地方。

I have tested the scripts in Python 3.7.1 in Jupyter Notebook.

我已经在Jupyter Notebook中的Python 3.7.1中测试了脚本。

Let’s make sure you have the following libraries installed before we start:◼️ Data manipulation/analysis: numpy, pandas◼️ Data partitioning: sklearn◼️ Text preprocessing/analysis: nltk◼️ Spelling checker: spellchecker (pyspellchecker when installing)

让我们确保你已经在我们开始之前安装以下库:◼️ 数据处理/分析:numpy 的,熊猫 ◼️ 数据分区:sklearn◼️ 文本预处理/分析:NLTK Sp️ 拼写检查器: 拼写检查器(安装时为pyspellchecker )

Once you have nltk installed, please make sure you have downloaded ‘stopwords’ and ‘wordnet’ corpora from nltk with the script below:

安装nltk之后 ,请确保使用以下脚本从nltk下载了' stopwords ''wordnet'语料库:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

If you have already downloaded, running this will notify you so.

如果您已经下载了,运行它会通知您。

Now, we are ready to import all the packages:

现在,我们准备导入所有软件包:

# Setting random seed
seed = 123# Measuring run time
from time import time# Data manipulation/analysis
import numpy as np
import pandas as pd# Data partitioning
from sklearn.model_selection import train_test_split# Text preprocessing/analysis
import re, random
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from spellchecker import SpellChecker

1.数据📦 (1. Data 📦)

We will use IMDB movie reviews dataset. You can download the dataset here and save it in your working directory. Once saved, let’s import it to Python:

我们将使用IMDB电影评论数据集。 您可以在此处下载数据集并将其保存在您的工作目录中。 保存后,让我们将其导入Python:

sample = pd.read_csv('IMDB Dataset.csv')
print(f"{sample.shape[0]} rows and {sample.shape[1]} columns")
sample.head()
Image for post

Let’s look at the split between sentiments:

让我们看一下情感之间的分歧:

sample['sentiment'].value_counts()
Image for post
  • 3
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值