特征构建(3)

1、前言

本节,我们使用现有特征构建全新的特征,主要从以下几个方面进行讲解

  • 检查数据集
  • 填充分类特征
  • 编码分类变量
  • 扩展数值特征
  • 针对文本的特征构建

2、基础知识

2.1数据集

自己创建数据集,展示不同的数据等级和类型

import pandas as pd

X = pd.DataFrame({'city':['tokyo', None, 'london', 'seattle', 'san francisco', 'tokyo'], 
                  'boolean':['yes', 'no', None, 'no', 'no', 'yes'], 
                  'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 'somewhat like', 'dislike'], 
                  'quantitative_column':[1, 11, -.5, 10, None, 20]})

print(X)
            city boolean ordinal_column  quantitative_column
0          tokyo     yes  somewhat like                  1.0
1           None      no           like                 11.0
2         london    None  somewhat like                 -0.5
3        seattle      no           like                 10.0
4  san francisco      no  somewhat like                  NaN
5          tokyo     yes        dislike                 20.0
  • boolean(布尔值):二元分类数据(是/否),定类等级
  • city(城市):分类数据,也是定类等级
  • ordinal_column(顺序列):顺序数据,定序等级
  • quantitative_column(定量列):是整数,定比等级。

2.2填充分类特征

用isnull和sum方法查看缺失值

X.isnull().sum()
city                   1
boolean                1
ordinal_column         0
quantitative_column    1
dtype: int64

scikit-learn的Imputer类有一个most_frequent方法可以用在定性数据上,但是只能处理整数型的分类数据。
对于数值数据,可以通过计算均值的方法填充缺失值;而对于分类数据,可以计算出最常见的类别用于填充。

# 寻找city列中最常见的元素
most_frequent = X['city'].value_counts().index[0]
most_frequent

'tokyo'

# 用最常见值填充city列
X['city'].fillna(most_frequent)
0            tokyo
1            tokyo
2           london
3          seattle
4    san francisco
5            tokyo
Name: city, dtype: object

自定义分类填充器

机器学习流水线:

  • 我们可以用流水线按顺序应用转换和最终的预测器
  • 流水线的中间步骤只能是转换,这意味着它们必须实现fit和transform方法
  • 最终的预测器只需要实现fit方法

流水线的目的是将几个可以交叉验证的步骤组装在一起,并设置不同的参数。在为每个需要填充的列构建好自定义转换器后,就可以把它们传入流水线,一口气转换好数据。

# 创建自定义分类填充器
from sklearn.base import TransformerMixin

class CustomCategoryImputer(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols
        
    def transform(self, df):
        X = df.copy()
        for col in self.cols:
            X[col].fillna(X[col].value_counts().index[0], inplace=True)
        return X
    
    def fit(self, *_):
        return self





# 在列上应用自定义分类填充器
cci = CustomCategoryImputer(cols=['city', 'boolean'])
cci.fit_transform(X)
citybooleanordinal_columnquantitative_column
0tokyoyessomewhat like1.0
1tokyonolike11.0
2londonnosomewhat like-0.5
3seattlenolike10.0
4san francisconosomewhat likeNaN
5tokyoyesdislike20.0

自定义定量填充器

# 自定义定量填充器
from sklearn.impute import SimpleImputer
from sklearn.base import TransformerMixin

class CustomQuantitativeImputer(TransformerMixin):
    def __init__(self, cols=None, strategy='mean'):
        self.cols = cols
        self.strategy = strategy
        
    def transform(self, df):
        X = df.copy()
        impute = SimpleImputer(strategy=self.strategy)
        for col in self.cols:
            X[col] = impute.fit_transform(X[[col]])
        return X
    
    def fit(self, *_):
        return self





cqi = CustomQuantitativeImputer(cols=['quantitative_column'], strategy='mean')

cqi.fit_transform(X)
citybooleanordinal_columnquantitative_column
0tokyoyessomewhat like1.0
1Nonenolike11.0
2londonNonesomewhat like-0.5
3seattlenolike10.0
4san francisconosomewhat like8.3
5tokyoyesdislike20.0

流水线调用

# 从sklearn导入Pipeline
from sklearn.pipeline import Pipeline

imputer = Pipeline([('quant', cqi), ('category', cci)]) 
imputer.fit_transform(X)
citybooleanordinal_columnquantitative_column
0tokyoyessomewhat like1.0
1tokyonolike11.0
2londonnosomewhat like-0.5
3seattlenolike10.0
4san francisconosomewhat like8.3
5tokyoyesdislike20.0

2.3编码分类变量

将分类数据转换为数值数据,以供机器学习模型使用

定类等级的编码

将分类数据转换为虚拟变量(dummy variable)

  • 用Pandas自动找到分类变量并进行编码
  • 创建自定义虚拟变量编码器,在流水线中工作
pd.get_dummies(X, 
               columns = ['city', 'boolean'],  # 要虚拟化的列
               prefix_sep='__') # 前缀(列名)和单元格值之间的分隔符
ordinal_columnquantitative_columncity__londoncity__san franciscocity__seattlecity__tokyoboolean__noboolean__yes
0somewhat like1.0000101
1like11.0000010
2somewhat like-0.5100000
3like10.0001010
4somewhat likeNaN010010
5dislike20.0000101
# 自定义虚拟变量编码器
class CustomDummifier(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols
        
    def transform(self, X):
        return pd.get_dummies(X, columns=self.cols)
    
    def fit(self, *_):
        return self




cd = CustomDummifier(cols=['boolean', 'city'])

cd.fit_transform(X)
ordinal_columnquantitative_columnboolean_noboolean_yescity_londoncity_san franciscocity_seattlecity_tokyo
0somewhat like1.0010001
1like11.0100000
2somewhat like-0.5001000
3like10.0100010
4somewhat likeNaN100100
5dislike20.0010001

定序等级的编码

为了保持顺序,我们使用标签编码器。标签编码器是指,顺序数据的每个标签都会有一个相关数值。在我们的例子中,这意味着顺序列的值(dislike、somewhat like和like)会用0、1、2来表示。

# 创建一个列表,顺序数据对应于列表索引
ordering = ['dislike', 'somewhat like', 'like']  # 0是dislike,1是somewhat like,2是like
# 在将ordering排序映射到顺序列之前,先看一下列

print(X['ordinal_column'])
0    somewhat like
1             like
2    somewhat like
3             like
4    somewhat like
5          dislike
Name: ordinal_column, dtype: object
# 将ordering映射到顺序列
print(X['ordinal_column'].map(lambda x: ordering.index(x)))
0    1
1    2
2    1
3    2
4    1
5    0
Name: ordinal_column, dtype: int64
# 将自定义标签编码器放进流水线中
class CustomEncoder(TransformerMixin):
    def __init__(self, col, ordering=None):
        self.ordering = ordering
        self.col = col
        
    def transform(self, df):
        X = df.copy()
        X[self.col] = X[self.col].map(lambda x: self.ordering.index(x))
        return X
    
    def fit(self, *_):
        return self





ce = CustomEncoder(col='ordinal_column', ordering = ['dislike', 'somewhat like', 'like'])

ce.fit_transform(X)
citybooleanordinal_columnquantitative_column
0tokyoyes11.0
1Noneno211.0
2londonNone1-0.5
3seattleno210.0
4san franciscono1NaN
5tokyoyes020.0

将连续特征分箱

用cut函数将数据分箱(binning),亦称为分桶(bucketing)。意思就是,它会创建数据的范围。

# 默认的类别名是分箱
pd.cut(X['quantitative_column'], bins=3)
0     (-0.52, 6.333]
1    (6.333, 13.167]
2     (-0.52, 6.333]
3    (6.333, 13.167]
4                NaN
5     (13.167, 20.0]
Name: quantitative_column, dtype: category
Categories (3, interval[float64]): [(-0.52, 6.333] < (6.333, 13.167] < (13.167, 20.0]]
# 不使用标签
pd.cut(X['quantitative_column'], bins=3, labels=False)
0    0.0
1    1.0
2    0.0
3    1.0
4    NaN
5    2.0
Name: quantitative_column, dtype: float64
class CustomCutter(TransformerMixin):
    def __init__(self, col, bins, labels=False):
        self.labels = labels
        self.bins = bins
        self.col = col
        
    def transform(self, df):
        X = df.copy()
        X[self.col] = pd.cut(X[self.col], bins=self.bins, labels=self.labels)
        return X
    
    def fit(self, *_):
        return self




cc = CustomCutter(col='quantitative_column', bins=3)

cc.fit_transform(X)
citybooleanordinal_columnquantitative_column
0tokyoyessomewhat like0.0
1Nonenolike1.0
2londonNonesomewhat like0.0
3seattlenolike1.0
4san francisconosomewhat likeNaN
5tokyoyesdislike2.0

创建流水线

流水线的顺序是:

  1. 用imputer填充缺失值
  2. 用虚拟变量填充分类列(one-hot编码)
  3. 对ordinal_column进行编码
  4. 将quantitative_column分箱
from sklearn.pipeline import Pipeline

pipe = Pipeline([("imputer", imputer), ('dummify', cd), ('encode', ce), ('cut', cc)])



# 进入流水线前的数据
print(X)
            city boolean ordinal_column  quantitative_column
0          tokyo     yes  somewhat like                  1.0
1           None      no           like                 11.0
2         london    None  somewhat like                 -0.5
3        seattle      no           like                 10.0
4  san francisco      no  somewhat like                  NaN
5          tokyo     yes        dislike                 20.0
# 拟合流水线
pipe.fit(X)
Pipeline(memory=None,
         steps=[('imputer',
                 Pipeline(memory=None,
                          steps=[('quant',
                                  <__main__.CustomQuantitativeImputer object at 0x7f2e8d747048>),
                                 ('category',
                                  <__main__.CustomCategoryImputer object at 0x7f2eda0e4320>)],
                          verbose=False)),
                ('dummify',
                 <__main__.CustomDummifier object at 0x7f2e87e65080>),
                ('encode', <__main__.CustomEncoder object at 0x7f2e87e7a358>),
                ('cut', <__main__.CustomCutter object at 0x7f2e87e7acf8>)],
         verbose=False)
pipe.transform(X)
ordinal_columnquantitative_columnboolean_noboolean_yescity_londoncity_san franciscocity_seattlecity_tokyo
010010001
121100001
210101000
321100010
411100100
502010001

2.4扩展数值特征

根据胸部加速度计识别动作的数据集

数据集按参与者划分,包含以下内容:

  • 序号
  • x轴加速度
  • y轴加速度
  • z轴加速度
  • 标签。标签是数字,每个数字代表一种动作(activity):1在电脑前工作;2站立、走路和上下楼梯;3站立;4走路;5上下楼梯;6与人边走边聊;7站立着讲话。
path = '/home/kesci/input/Chest_accelerat3744/activity_recognizer.csv'
df = pd.read_csv(path, header=None)
df.columns = ['index', 'x', 'y', 'z', 'activity']
df.head()
indexxyzactivity
00.01502221521531
11.01667207220471
22.01611195719061
33.01601193918311
44.01643196518791

查看空准确率

df['activity'].value_counts(normalize=True)
7    0.515369
1    0.207242
4    0.165291
3    0.068793
5    0.019637
6    0.017951
2    0.005711
0    0.000006
Name: activity, dtype: float64

空准确率是51.54%,意味着如果我们猜7(站立着讲话),正确率就超过一半了

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

X = df[['x', 'y', 'z']]
# 删除响应变量,建立特征矩阵
y = df['activity']

# 网格搜索所需的变量和实例

# 需要试验的KNN模型参数
knn_params = {'n_neighbors':[3, 4, 5, 6]}

knn = KNeighborsClassifier()
grid = GridSearchCV(knn, knn_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

0.720752487676999 {'n_neighbors': 5}

使用5个邻居作为参数时,KNN模型准确率达到了72.08%,比51.54%的空准确率高得多

多项式特征

使用Polynomial-Features创建新的列,它们是原有列的乘积,用于捕获特征交互。

  • degree是多项式特征的阶数,默认值是2。
  • interaction_only是布尔值:如果为True(默认False),表示只生成互相影响/交互的特征,也就是不同阶数特征的乘积。
  • include_bias也是布尔值:如果为True(默认),会生成一列阶数为0的偏差列,也就是说列中全是数字1。
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)

X_poly = poly.fit_transform(X)
X_poly.shape

(162501, 9)

pd.DataFrame(X_poly, columns=poly.get_feature_names()).head()
x0x1x2x0^2x0 x1x0 x2x1^2x1 x2x2^2
01502.02215.02153.02256004.03326930.03233806.04906225.04768895.04635409.0
11667.02072.02047.02778889.03454024.03412349.04293184.04241384.04190209.0
21611.01957.01906.02595321.03152727.03070566.03829849.03730042.03632836.0
31601.01939.01831.02563201.03104339.02931431.03759721.03550309.03352561.0
41643.01965.01879.02699449.03228495.03087197.03861225.03692235.03530641.0
探索性数据分析
%matplotlib inline
import seaborn as sns

sns.heatmap(pd.DataFrame(X_poly, columns=poly.get_feature_names()).corr())

# 将interaction_only被设置成了True
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True) 
X_poly = poly.fit_transform(X) 
print(X_poly.shape)

(162501, 6)

pd.DataFrame(X_poly, columns=poly.get_feature_names()).head()
x0x1x2x0 x1x0 x2x1 x2
01502.02215.02153.03326930.03233806.04768895.0
11667.02072.02047.03454024.03412349.04241384.0
21611.01957.01906.03152727.03070566.03730042.0
31601.01939.01831.03104339.02931431.03550309.0
41643.01965.01879.03228495.03087197.03692235.0
sns.heatmap(pd.DataFrame(X_poly, columns=poly.get_feature_names()).corr())

# 流水线
from sklearn.pipeline import Pipeline

pipe_params = {'poly_features__degree':[1, 2, 3], 'poly_features__interaction_only':[True, False], 'classify__n_neighbors':[3, 4, 5, 6]}

pipe = Pipeline([('poly_features', poly), ('classify', knn)])

grid = GridSearchCV(pipe, pipe_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

0.7211894080651812 {'classify__n_neighbors': 5, 'poly_features__degree': 2, 'poly_features__interaction_only': True}

2.5文本专用特征构建

词袋法

通过单词的出现来描述文档,完全忽略单词在文档中的位置。词袋的3个步骤是:

  • 分词(tokenizing)
  • 计数(counting)
  • 归一化(normalizing)
twitter_path = '/home/kesci/input/Twitter8140/twitter_sentiment.csv'
tweets = pd.read_csv(twitter_path, encoding='latin1')

tweets.head()
ItemIDSentimentSentimentText
010is so sad for my APL frie...
120I missed the New Moon trail...
231omg its already 7:30 :O
340.. Omgaga. Im sooo im gunna CRy. I'...
450i think mi bf is cheating on me!!! ...
del tweets['ItemID']

tweets.head()
SentimentSentimentText
00is so sad for my APL frie...
10I missed the New Moon trail...
21omg its already 7:30 :O
30.. Omgaga. Im sooo im gunna CRy. I'...
40i think mi bf is cheating on me!!! ...
from sklearn.feature_extraction.text import CountVectorizer

X = tweets['SentimentText']
y = tweets['Sentiment']

vect = CountVectorizer()
_ = vect.fit_transform(X)
print(_.shape)

(99989, 105849)

CountVectorizer的参数

CountVectorizer将文本列转换为矩阵,其中列是词项,单元值是每个文档中每个词项的出现次数

  • stop_words:停用词
  • min_df:忽略在文档中出现频率低于阈值的词,减少特征的数量
  • max_df:忽略在文档中出现频率高于阈值的词,减少特征的数量
  • ngram_range:接收一个元组,表示n值的范围(代表要提取的不同n-gram的数量)上下界
  • analyzer:设置分析器作为参数,以判断特征是单词还是短语。默认是单词
vect = CountVectorizer(stop_words='english')  # 删除英语停用词(if、a、the, 等等)
_ = vect.fit_transform(X)
print(_.shape)

(99989, 105545)

vect = CountVectorizer(min_df=.05)  # 只保留至少在5%文档中出现的单词
# 减少特征数
_ = vect.fit_transform(X)
print(_.shape)

(99989, 31)

vect = CountVectorizer(max_df=.8)  # 只保留至多在80%文档中出现的单词
# “推断”停用词
_ = vect.fit_transform(X)
print(_.shape)
(99989, 105849)
vect = CountVectorizer(ngram_range=(1, 5))  # 包括最多5个单词的短语
_ = vect.fit_transform(X)
print(_.shape)  # 特征数爆炸

(99989, 3219557)

vect = CountVectorizer(analyzer='word')  # 默认分析器,划分为单词
_ = vect.fit_transform(X)
print(_.shape)

(99989, 105849)

词干提取(stemming)是一种常见的自然语言处理方法,可以将词汇中的词干提取出来,也就是把单词转换为其词根,从而缩小词汇量

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')
stemmer.stem('interesting')

'interest'

# 将文本变成词根的函数
def word_tokenize(text, how='lemma'):
    words = text.split(' ')  # 按词分词
    return [stemmer.stem(word) for word in words]

word_tokenize("hello you are very interesting")

['hello', 'you', 'are', 'veri', 'interest']

vect = CountVectorizer(analyzer=word_tokenize)
_ = vect.fit_transform(X)
print(_.shape)   # 单词变小,特征少了

(99989, 154397)

TF-IDF矢量化

  • TF(term frequency,词频):衡量词在文档中出现的频率。由于文档的长度不同,词在长文中的出现次数有可能比在短文中出现的次数多得多。因此,一般会对词频进行归一化,用其除以文档长度或文档的总词数。

  • IDF(inverse document frequency,逆文档频率):衡量词的重要性。在计算词频时,我们认为所有的词都同等重要。但是某些词(如is、of和that)有可能出现很多次,但这些词并不重要。因此,我们需要减少常见词的权重,加大稀有词的权重。

再次强调,TfidfVectorizer和CountVectorizer相同,都从词项构造了特征,但是TfidfVectorizer进一步将词项计数按照在语料库中出现的频率进行了归一化

from sklearn.feature_extraction.text import TfidfVectorizer

# 用CountVectorizer生成文档-词矩阵
vect = CountVectorizer()
_ = vect.fit_transform(X)
print(_.shape, _[0,:].mean())

(99989, 105849) 6.613194267305311e-05

# TfidfVectorizer
vect = TfidfVectorizer()
_ = vect.fit_transform(X)
print(_.shape, _[0,:].mean())  # 行列数相同,内容不同

(99989, 105849) 2.1863060975751186e-05

在机器学习流水线中使用文本

from sklearn.naive_bayes import MultinomialNB

# 取空准确率
y.value_counts(normalize=True)
1    0.564632
0    0.435368
Name: Sentiment, dtype: float64

要让准确率超过56.5%。我们分两步创建流水线:

  • 用CountVectorizer将推文变成特征
  • 用朴素贝叶斯模型MultiNomialNB进行正负面情绪的分类
# 设置流水线参数
pipe_params = {'vect__ngram_range':[(1, 1), (1, 2)], 'vect__max_features':[1000, 10000], 'vect__stop_words':[None, 'english']}
 
# 实例化流水线
pipe = Pipeline([('vect', CountVectorizer()), ('classify', MultinomialNB())])
 
# 实例化网格搜索
grid = GridSearchCV(pipe, pipe_params)
# 拟合网格搜索对象
grid.fit(X, y)
 
# 取结果
print(grid.best_score_, grid.best_params_)

0.7557531328446129 {'vect__max_features': 10000, 'vect__ngram_range': (1, 2), 'vect__stop_words': None}

scikit-learn有一个FeatureUnion模块,可以水平(并排)排列特征。这样,在一个流水线中可以使用多种类型的文本特征构建器。

from sklearn.pipeline import FeatureUnion

# 单独的特征构建器对象
featurizer = FeatureUnion([('tfidf_vect', TfidfVectorizer()), 
                           ('count_vect', CountVectorizer())])

_ = featurizer.fit_transform(X)
print(_.shape) # 行数相同,但列数为2倍
(99989, 211698)
# 改变featurizer对象的参数,看看效果
featurizer.set_params(tfidf_vect__max_features=100, count_vect__ngram_range=(1, 2), count_vect__max_features=300)
# TfidfVectorizer只保留100个单词,而CountVectorizer保留300个1~2个单词的短语
_ = featurizer.fit_transform(X)
print(_.shape) # 行数相同,但列数为2倍

(99989, 400)

# 完整的流水线
pipe_params = {'featurizer__count_vect__ngram_range':[(1, 1), (1, 2)], 
               'featurizer__count_vect__max_features':[1000, 10000], 
               'featurizer__count_vect__stop_words':[None, 'english'],
               'featurizer__tfidf_vect__ngram_range':[(1, 1), (1, 2)],
               'featurizer__tfidf_vect__max_features':[1000, 10000], 
               'featurizer__tfidf_vect__stop_words':[None, 'english']} 

pipe = Pipeline([('featurizer', featurizer), ('classify', MultinomialNB())])
grid = GridSearchCV(pipe, pipe_params)

grid.fit(X, y)
print(grid.best_score_, grid.best_params_)
0.7584334276770445 {'featurizer__count_vect__max_features': 10000, 'featurizer__count_vect__ngram_range': (1, 2), 'featurizer__count_vect__stop_words': None, 'featurizer__tfidf_vect__max_features': 10000, 'featurizer__tfidf_vect__ngram_range': (1, 1), 'featurizer__tfidf_vect__stop_words': 'english'}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值