Python - 数据预处理-CSDN博客

以下内容来自《Python数据科学指南》
数据预处理：
修补数据、随机采样、缩放数据、标准化数据、实现分词化、删除停用词、删除标点符号、词提取、词形还原、词袋模型

1. 修补数据：处理不完整或存在内容丢失的数据。

采用模块：from sklearn.preprocessing import Imputer

#方法一：根据数据中特定值来修补
#imputer = Imputer(missing_values,strategy)
#missing_value是指出哪些是丢失的数据，strategy是如何修补数据的策略
#策略有三种：mean(平均值)、median(中位数)、most_frequent(最常用的值)

#值为0的单元格会被所属的列的平均值替换
imputer = Imputer(missing_values=0,strategy="mean")
x_imputerd = imputer.fit_transform(x) #将数据拟合后转化成修补好的数据


#方法二：基于类别标签进行修补
missing_y = y[2]
x_missing = np.where(y==missing_y)[0] #找到missing_y所有对应的位置编号
print np.mean(x[x_missing,:],axis=0) #行平均策略
print np.median(x[x_missing,:],axis=0) #行中位数策略

2. 随机采样

采用模块：import numpy as np

#从数据集x中随机采样10条记录
no_records = 10
x_sample_indx = np.random.choice(range(x.shape[0]),no_records)
print x[s_sample_indx,:]
#choice里有个函数是replace,若设为False,则为采样不带替换，被采样的数据会被从原始数据中删除，但默认True，则不会影响原始数据集

3. 缩放数据：最小最大缩放，数值分布在[0,1]区间内。
采用模块：from sklearn.preprocessing import MinMaxScaler

minmax = MinMaxScaler(feature_range=(0.0,1.0))
x_t = minmax.fit_transform(x)

4. 标准化数据：将输入的数值转换为平均值为0，标准差为1的形式。

采用模块：from sklearn.preprocessing import scale

#with_mean实现中心化
#with_std实现标准化，使不同特征值下的数据转为相同量纲的数据
x_centered_std = scale(x,with_mean = True,with_std = True)

5. 实施分词化：将文本分词。

采用模块：from nltk.tokenize import sent_tokenize,word_tokenize,line_tokenize

from collection import defaultdict

sentence = "Alvin is me. Welcome to show your code for communication."
sent_list = sent_tokenize(sentence) #分句子

#分词
word_dict = defaultdict(list)
for i,sent in enumerate(sent_list):
    word_dict[i].extend(word_tokenize(sent))

#分段落，前提使原文本中段落之间有段落换行符 \n
line_list = line_tokenize(sentence)

6. 删除停用词：删除常见词。

采用模块：from nltk.corpus import stopwords

stop_words = stopwords.words('english')
words = [w for w in words if w not in stop_words]

7. 删除标点符号
采用模块：import string

words = [w for w in words if w not in string punctuation]

8. 词提取：把词转换它们原本的形态，启发式地为了获得词根形态努力探求消除词的后缀

采用模板：from nltk import stem

#Porter - 波特词提取器，最常用的，转换回词根形态时不是很激进；
#Snowball - 雪球提取器，Porter改良版，省时；
#Lancaster - 兰卡斯特提取器，最激进的，前两种转换后的可读性还好，但这个完全不可读，但速度最快

input_words = ['alvin','ai']
porter = stem.porter.PorterStemmer()
p_words = [porter.stem(w) for w in input_words]

lancaster = stem.lancaster.LancasterStemmer()
l_words = [lancester.stems(w) for w in input_words]

snowball = stem.snowball.EnglishStemmer()
s_words = [snowball-stem(w) for w in input_words]

9. 词形还原：使用变形词形和词表来获得词的词元。只对词形变化的结尾进行转换，并从字典中获得词的基本形态。

采用模板：from nltk import stem

wordnet_lemm = stem.WordNetLemmatizer()
wn_words = [wordnet_lemm.lemmatize(w) for w in input_words]

#默认情况下，词形还原工具会把输入当作名词，然后再还原，但如果动词等，得通过POS标签来调整
wordnet_lemm.lemmatize('running','v')
>>u'run'

10. 词袋模型：创建一个向量，向量的列是本身，词构成了特征项，这些项的数值是二进制、频率或TFIDF
采用模板：from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

#ngram_tdm得到的词袋是个稀疏模型，其中ngram_range(1,1)是确保只有一个字或者词，若为（1，2）创建的是单字和多字
#CountVectorizer里面binary参数默认为False，如果设为True,最后矩阵将不统计元素个数，而是1或0，取决于是否出现在文档中
#lowercase默认为True，输出的文本再被映射成特征指标前会先转回小写
count_v_ngram = CountVectorizer(stop_words=stop_words,ngram_range=(1,2))
ngram_tdm = count_v.fit_transform(sentence)

#IDF文档频率的倒数 = 总文档数/词出现的文档数
#TF词频 = 词出现的次数/该文档内词的总数
#TDIDF = TF*IDF
count_v = CountVectorizer(stop_words=stop_words)
tdm = count_v.fit_transform(sentences)
tfidf = TfidfTransformer()
tdm_tfidf = tfidf.fit_transform(tdm)