python文本预处理_在python中预处理文本

python文本预处理This post is the second of three sequential articles on steps to build a sentiment classifier. Following our exploratory text analysis in the first post, it’s time to preprocess our text da...
摘要由CSDN通过智能技术生成

python文本预处理

This post is the second of three sequential articles on steps to build a sentiment classifier. Following our exploratory text analysis in the first post, it’s time to preprocess our text data. Simply put, preprocessing text data is to do a series of operations to convert the text into a tabular numeric data. In this post, we will look at 3 ways with varying complexity to preprocess text to tf-idf matrix as preparation for a model. If you are unsure what tf-idf is, this post explains with a simple example.

这篇文章是有关建立情感分类器的步骤的三篇连续文章的第二篇。 在第一篇文章中我们进行了探索性文本分析之后 ,该对文本数据进行预处理了。 简而言之,预处理文本数据是要执行一系列操作以将文本转换为表格数字数据。 在这篇文章中,我们将研究3种复杂程度不同的方法,以将文本预处理为tf-idf矩阵作为模型的准备。 如果您不确定什么是tf-idf, 本文将以一个简单的示例进行说明。

Before we dive in, let’s take a step back and look at the bigger picture real quickly. CRISP-DM methodology outlines the process flow for a successful data science project. Preprocessing data is one of the key tasks in the data preparation stage.

在我们深入之前,让我们退后一步,快速查看一下实际情况。 CRISP-DM方法论概述了成功的数据科学项目的流程。 预处理数据数据准备阶段的关键任务之一。

Image for post
Extract from CRISP-DM process flow
摘自CRISP-DMCraft.io流程

0. Python设置 (0. Python setup)

This post assumes that the reader (👀 yes, you!) has access to and is familiar with Python including installing packages, defining functions and other basic tasks. If you are new to Python, this is a good place to get started.

这篇文章假定读者(👀,是的,您!)可以访问并熟悉Python,包括安装软件包,定义函数和其他基本任务。 如果您不熟悉Python,那么是一个入门的好地方。

I have tested the scripts in Python 3.7.1 in Jupyter Notebook.

我已经在Jupyter Notebook中的Python 3.7.1中测试了脚本。

Let’s make sure you have the following libraries installed before we start:◼️ Data manipulation/analysis: numpy, pandas◼️ Data partitioning: sklearn◼️ Text preprocessing/analysis: nltk◼️ Spelling checker: spellchecker (pyspellchecker when installing)

让我们确保你已经在我们开始之前安装以下库:◼️ 数据处理/分析:numpy 的,熊猫 ◼️ 数据分区:sklearn◼️ 文本预处理/分析:NLTK Sp️ 拼写检查器: 拼写检查器(安装时为pyspellchecker )

Once you have nltk installed, please make sure you have downloaded ‘stopwords’ and ‘wordnet’ corpora from nltk with the script below:

安装nltk之后 ,请确保使用以下脚本从nltk下载了' stopwords ''wordnet'语料库:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

If you have already downloaded, running this will notify you so.

如果您已经下载了,运行它会通知您。

Now, we are ready to import all the packages:

现在,我们准备导入所有软件包:

# Setting random seed
seed = 123# Measuring run time
from time import time# Data manipulation/analysis
import numpy as np
import pandas as pd# Data partitioning
from sklearn.model_selection import train_test_split# Text preprocessing/analysis
import re, random
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from spellchecker import SpellChecker

1.数据📦 (1. Data 📦)

We will use IMDB movie reviews dataset. You can download the dataset here and save it in your working directory. Once saved, let’s import it to Python:

我们将使用IMDB电影评论数据集。 您可以在此处下载数据集并将其保存在您的工作目录中。 保存后,让我们将其导入Python:

sample = pd.read_csv('IMDB Dataset.csv')
print(f"{sample.shape[0]} rows and {sample.shape[1]} columns")
sample.head()
Image for post

Let’s look at the split between sentiments:

让我们看一下情感之间的分歧:

sample['sentiment'].value_counts()
Image for post
  • 3
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值