【精通特征工程】学习笔记（三）

最新推荐文章于 2024-07-07 11:27:01 发布

Janet_zyh

最新推荐文章于 2024-07-07 11:27:01 发布

阅读量419

点赞数

分类专栏：特征工程文章标签：机器学习人工智能 python

本文链接：https://blog.csdn.net/JanetHULAHA/article/details/104448555

版权

特征工程专栏收录该内容

4 篇文章 0 订阅

订阅专栏

【精通特征工程】学习笔记Day3&2.13&D4章&P52-64页

4、特征缩放的效果:从词袋到 tf-idf

4.1 tf-idf:词袋的一种简单扩展

tf-idf：词频 - 逆文档频率
tf-idf 计算的不是数据集中每个单词在每个文档中的原本计数，而是一个归一化的计数，其中每个单词的计数要除以这个单词出现在其中的文档数量

bow(w, d) = 单词 w 在文档 d 中出现的次数
tf-idf(w, d) = bow(w, d) * N / ( 单词 w 出现在其中的文档数量 )

4.2 tf-idf 方法测试

tf-idf 通过乘以一个常数，对单词计数特征进行了转换。因此，它是一种特征缩放方法

Step1:使用 Python 加载并清理 Yelp 点评数据集

>>> import json
     >>> import pandas as pd
# 加载Yelp商家数据
>>> biz_f = open('yelp_academic_dataset_business.json')
>>> biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()]) >>> biz_f.close()
# 加载Yelp点评数据
>>> review_file = open('yelp_academic_dataset_review.json')
>>> review_df = pd.DataFrame([json.loads(x) for x in review_file.readlines()]) >>> review_file.close()
# 选取出夜店和餐馆
>>> two_biz = biz_df[biz_df.apply(lambda x: 'Nightlife' in x['categories'] or ... 'Restaurants' in x['categories'], ... axis=1)]
# 与点评数据连接，得到两种类型商家的所有点评
>>> twobiz_reviews = two_biz.merge(review_df, on='business_id', how='inner')
# 去除我们不需要的特征
 >>> twobiz_reviews = twobiz_reviews[['business_id',
... 'name',
     ...                                  'stars_y',
... 'text',
     ...                                  'categories']]
# 创建目标列——夜店类型的商家为True，否则为False
>>> two_biz_reviews['target'] = \
... twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'], ... axis=1)

4.2.1 创建分类数据集

Yelp商店点评数据为一个类别不平衡数据集，故可做如下处理：

对夜店点评数据进行 10% 的随机抽样，对餐馆点评数据进行 2.1% 的随机抽样(选择这样的比例可以使两个类别的抽样数据基本相当)。
按照 70/30 的比例将这个数据集划分为训练集和测试集。在这个例子中，训练集有 29 264 条点评数据，测试集有 12 542 条点评数据。
训练数据包含 46 924 个唯一单词，这就是词袋表示法的特征数量。
Step2：创建平衡的分类数据集

# 创建一个类别平衡的子样本，供练习使用
>>> nightlife = \
... twobiz_reviews[twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'],
...
>>> restaurants = \
... twobiz_reviews[twobiz_reviews.apply(lambda x: 'Restaurants' in x['categories'], ... axis=1)]
>>> nightlife_subset = nightlife.sample(frac=0.1, random_state=123)
>>> restaurant_subset = restaurants.sample(frac=0.021, random_state=123)
>>> combined = pd.concat([nightlife_subset, restaurant_subset])
# 划分训练集和测试集
>>> training_data, test_data = modsel.train_test_split(combined,
...
...
>>> training_data.shape
(29264, 5)
>>> test_data.shape
(12542, 5)
train_size=0.7,
random_state=123)
axis=1)]

4.2.2 使用 tf-idf 变换来缩放词袋

Step3：转换特征

# 用词袋表示点评文本
>>> bow_transform = text.CountVectorizer()
>>> X_tr_bow = bow_transform.fit_transform(training_data['text']) 
>>> X_te_bow = bow_transform.transform(test_data['text'])
>>> len(bow_transform.vocabulary_)
46924
>>> y_tr = training_data['target']
>>> y_te = test_data['target']
# 使用词袋矩阵创建tf-idf表示
>>> tfidf_trfm = text.TfidfTransformer(norm=None) 
>>> X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow) 
>>> X_te_tfidf = tfidf_trfm.transform(X_te_bow)
# 仅出于练习的目的，对词袋表示进行l2归一化
>>> X_tr_l2 = preproc.normalize(X_tr_bow, axis=0)
>>> X_te_l2 = preproc.normalize(X_te_bow, axis=0)

注：测试集上的特征缩放特征缩放的微妙之处在于，它要求我们知道一些实际中我们很可能不知道的特征统计量，比如均值、方差、文档频率、l2 范数，等等。为了计算出 tf-idf 表示，我们必须基于训练数据计算出逆文档频率，并用这些统计量既缩放训练数据也缩放测试数据。在 scikit-learn 中，在训练数据上拟合特征转换器相当于收集相关统计量。然后可以将拟合好的特征转换器应用到测试数据上。

4.2.3 使用逻辑回归进行分类

Step4：使用默认参数训练逻辑回归分类器

 >>> def simple_logistic_classify(X_tr, y_tr, X_test, y_test, description):
            ### 辅助函数，用来训练逻辑回归分类器，并在测试数据上进行评分。
     ...     m = LogisticRegression().fit(X_tr, y_tr)
     ...     s = m.score(X_test, y_test)
     ...     print ('Test score with', description, 'features:', s)
     return m
 >>> m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow')
 >>> m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized')
 >>> m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf')

Test score with bow features: 0.775873066497
Test score with l2-normalized features: 0.763514590974
Test score with tf-idf features: 0.743182905438

结果显示准确率最高的分类器使用的是词袋特征，实际上，出现这种情况的原因在于分类器没有很好地“调优”

4.2.4 使用正则化对逻辑回归进行调优

scikit-learn 中的 GridSearchCV 函数可以执行带交叉验证的网格搜索
Step5：使用网格搜索对逻辑回归进行调优

>>> import sklearn.model_selection as modsel
# 确定一个搜索网格，然后对每种特征集合执行5-折网格搜索
>>> param_grid_ = {'C': [1e-5, 1e-3, 1e-1, 1e0, 1e1, 1e2]}
# 为词袋表示法进行分类器调优
>>> bow_search = modsel.GridSearchCV(LogisticRegression(), cv=5, ... param_grid=param_grid_)
>>> bow_search.fit(X_tr_bow, y_tr)
# 为L2-归一化词向量进行分类器调优
>>> l2_search = modsel.GridSearchCV(LogisticRegression(), cv=5, ... param_grid=param_grid_)
>>> l2_search.fit(X_tr_l2, y_tr)
# 为tf-idf进行分类器调优
>>> tfidf_search = modsel.GridSearchCV(LogisticRegression(), cv=5, ... param_grid=param_grid_)
>>> tfidf_search.fit(X_tr_tfidf, y_tr)
# 检查网格搜索的一个输出，看看它是如何运行的
>>> bow_search.cv_results_
{'mean_fit_time': array([ 0.43648252, 0.94630651,
               5.64090128,  15.31248307,  31.47010217,  42.44257565]),
     'mean_score_time': array([ 0.00080056,  0.00392466,  0.00864897,  0 .00784755,
              0.01192751,  0.0072515 ]),
     'mean_test_score': array([ 0.57897075,  0.7518111 ,  0.78283898,  0.77381766,
              0.75515992,  0.73937261]),
'mean_train_score': array([ 0.5792185 ,  0.76731652,  0.87697341,  0.94629064,
         0.98357195,  0.99441294]),
'param_C': masked_array(data = [1e-05 0.001 0.1 1.0 10.0 100.0],
              mask = [False False False False False False],
        fill_value = ?),
'params': ({'C': 1e-05},
  {'C': 0.001},
  {'C': 0.1},
  {'C': 1.0},
  {'C': 10.0},
  {'C': 100.0}),
'rank_test_score': array([6, 4, 1, 2, 3, 5]),
'split0_test_score': array([ 0.58028698,  0.75025624,  0.7799795 ,  0.7726341 ,
         0.75247694,  0.74086095]),
'split0_train_score': array([ 0.57923964,  0.76860316,  0.87560871,  0.94434003,
         0.9819308 ,  0.99470312]),
'split1_test_score': array([ 0.5786776 ,  0.74628396,  0.77669571,  0.76627371,
         0 .74867589,  0.73176149]),
'split1_train_score': array([ 0.57917218,  0.7684849 ,  0.87945837,  0.94822946,
         0.98504976,  0.99538678]),
'split2_test_score': array([ 0.57816504,  0.75533914,  0.78472578,  0.76832394,
         0.74799248,  0.7356911 ]),
'split2_train_score': array([ 0.57977019,  0.76613558,  0.87689548,  0.94566657,
         0.98368288,  0.99397719]),
'split3_test_score': array([ 0.57894737,  0.75051265,  0.78332194,  0.77682843,
         0.75768968,  0.73855092]),
'split3_train_score': array([ 0.57914745,  0.76678626,  0.87634546,  0.94558346,
         0.98385443,  0.99474628]),
'split4_test_score': array([ 0.57877649,  0.75666439,  0.78947368,  0.78503076,
         0.76896787,  0.75      ]),
'split4_train_score': array([ 0.57876303,  0.7665727 ,  0.87655903,  0.94763369,
         0.98334188,  0.99325132]),
'std_fit_time': array([ 0.03874582,  0.02297261,  1.18862097,  1.83901079,
         4.21516797,  2.93444269]),
'std_score_time': array([ 0.00160112,  0.00605009,  0.00623053,  0.00698687,
         0.00713112,  0.00570195]),
'std_test_score': array([ 0.00070799,  0.00375907,  0.00432957,  0.00668246,
         0.00612049]),
'std_train_score': array([ 0.00032232,  0.00102466,  0.00131222,  0.00143229,
         0.00100223,  0.00073252])}
# 在箱线图中绘制出交叉验证结果
# 对分类器性能进行可视化比较
>>> search_results = pd.DataFrame.from_dict({
...
...
...
...
'bow': bow_search.cv_results_['mean_test_score'], 'tfidf': tfidf_search.cv_results_['mean_test_score'], 'l2': l2_search.cv_results_['mean_test_score']
# 常用的matplotlib设置
# seaborn用来美化图形
>>> import matplotlib.pyplot as plt >>> import seaborn as sns
>>> sns.set_style("whitegrid")
>>> ax = sns.boxplot(data=search_results, width=0.4)
>>> ax.set_ylabel('Accuracy', size=14)
>>> ax.tick_params(labelsize=14)

Step6：比较不同特征集合的最终训练与测试步骤

# 使用前面找到的最优超参数设置，在整个训练集上训练一个最终模型
# 在测试集上测量准确度
>>> m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow',
... _C=bow_search.best_params_['C'])
>>> m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized', ... _C=l2_search.best_params_['C'])
>>> m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf', ... _C=tfidf_search.best_params_['C'])
Test score with bow features: 0.78360708021
Test score with l2-normalized features: 0.780178599904
Test score with tf-idf features: 0.788470738319

4.3 深入研究，发生了什么

tf-idf = 列缩放
tf-idf 和 l2 归一化都是数据矩阵上的列操作
正确的特征缩放有助于分类问题。正确缩放可以突出有信息量的单词，并削弱普通单词的影响。它还可以减少数据矩阵的条件数。正确的缩放不一定是标准的列缩放。

参考：《精通特征工程》爱丽丝·郑·阿曼达·卡萨丽

面向机器学习的特征工程学习笔记：
【精通特征工程】学习笔记（一）
【精通特征工程】学习笔记（二）

Janet_zyh

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【精通特征工程】学习笔记（三）

【精通特征工程】学习笔记Day3&2.13&D4章&P52-64页4、特征缩放的效果:从词袋到 tf-idf4.1 tf-idf:词袋的一种简单扩展tf-idf：词频 - 逆文档频率tf-idf 计算的不是数据集中每个单词在每个文档中的原本计数，而是一个归一化的计数，其中每个单词的计数要除以这个单词出现在其中的文档数量bow(w, d) = 单词 w 在文...
复制链接

扫一扫

专栏目录