1.好书推荐:
北大出版社,人工智能原理与实践 人工智能和数据科学从入门到精通 详解机器学习深度学习算法原理
人工智能原理与实践 全面涵盖人工智能和数据科学各个重要体系经典
2. 引言
ColumnTransformer是数据科学项目中最有用的数据转换工具之一。在这里,我们展示了一个完整的例子,应用columnTransformer作为预处理工具,然后应用RandomForrest回归模型预测在线文章的搜索份额(SOV)。
3. 数据介绍
数据中特征已经创建,理解这些特征名称就可以极大地帮助我们理解影响在线文章排名的因素。
数据的具体构成如下:
Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)
Attribute Information:
0. url: URL of the article (non-predictive)
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the content
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel ‘Lifestyle’?
14. data_channel_is_entertainment: Is data channel ‘Entertainment’?
15. data_channel_is_bus: Is data channel ‘Business’?
16. data_channel_is_socmed: Is data channel ‘Social Media’?
17. data_channel_is_tech: Is data channel ‘Tech’?
18. data_channel_is_world: Is data channel ‘World’?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in Mashable
29. self_reference_max_shares: Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
31. weekday_is_monday: Was the article published on a Monday?
32. weekday_is_tuesday: Was the article published on a Tuesday?
33. weekday_is_wednesday: Was the article published on a Wednesday?
34. weekday_is_thursday: Was the article published on a Thursday?
35. weekday_is_friday: Was the article published on a Friday?
36. weekday_is_saturday: Was the article published on a Saturday?
37. weekday_is_sunday: Was the article published on a Sunday?
38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0
40. LDA_01: Closeness to LDA topic 1
41. LDA_02: Closeness to LDA topic 2
42. LDA_03: Closeness to LDA topic 3
43. LDA_04: Closeness to LDA topic 4
44. global_subjectivity: Text subjectivity
45. global_sentiment_polarity: Text sentiment polarity
46. global_rate_positive_words: Rate of positive words in the content
47. global_rate_negative_words: Rate of negative words in the content
48. rate_positive_words: Rate of positive words among non-neutral tokens
49. rate_negative_words: Rate of negative words among non-neutral tokens
50. avg_positive_polarity: Avg. polarity of positive words
51. min_positive_polarity: Min. polarity of positive words
52. max_positive_polarity: Max. polarity of positive words
53. avg_negative_polarity: Avg. polarity of negative words
54. min_negative_polarity: Min. polarity of negative words
55. max_negative_polarity: Max. polarity of negative words
56. title_subjectivity: Title subjectivity
57. title_sentiment_polarity: Title polarity
58. abs_title_subjectivity: Absolute subjectivity level
59. abs_title_sentiment_polarity: Absolute polarity level
60. shares: Number of shares (target)
4. 加载库
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from matplotlib import pyplot as plt
5. import data
df = pd.read_csv('data/OnlineNewsPopularity.csv')
df
url | timedelta | n_tokens_title | n_tokens_content | n_unique_tokens | n_non_stop_words | n_non_stop_unique_tokens | num_hrefs | num_self_hrefs | num_imgs | ... | min_positive_polarity | max_positive_polarity | avg_negative_polarity | min_negative_polarity | max_negative_polarity | title_subjectivity | title_sentiment_polarity | abs_title_subjectivity | abs_title_sentiment_polarity | shares | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | http://mashable.com/2013/01/07/amazon-instant-... | 731.0 | 12.0 | 219.0 | 0.663594 | 1.0 | 0.815385 | 4.0 | 2.0 | 1.0 | ... | 0.100000 | 0.70 | -0.350000 | -0.600 | -0.200000 | 0.500000 | -0.187500 | 0.000000 | 0.187500 | 593 |
1 | http://mashable.com/2013/01/07/ap-samsung-spon... | 731.0 | 9.0 | 255.0 | 0.604743 | 1.0 | 0.791946 | 3.0 | 1.0 | 1.0 | ... | 0.033333 | 0.70 | -0.118750 | -0.125 | -0.100000 | 0.000000 | 0.000000 | 0.500000 | 0.000000 | 711 |
2 | http://mashable.com/2013/01/07/apple-40-billio... | 731.0 | 9.0 | 211.0 | 0.575130 | 1.0 | 0.663866 | 3.0 | 1.0 | 1.0 | ... | 0.100000 | 1.00 | -0.466667 | -0.800 | -0.133333 | 0.000000 | 0.000000 | 0.500000 | 0.000000 | 1500 |
3 | http://mashable.com/2013/01/07/astronaut-notre... | 731.0 | 9.0 | 531.0 | 0.503788 | 1.0 | 0.665635 | 9.0 | 0.0 | 1.0 | ... | 0.136364 | 0.80 | -0.369697 | -0.600 | -0.166667 | 0.000000 | 0.000000 | 0.500000 | 0.000000 | 1200 |
4 | http://mashable.com/2013/01/07/att-u-verse-apps/ | 731.0 | 13.0 | 1072.0 | 0.415646 | 1.0 | 0.540890 | 19.0 | 19.0 | 20.0 | ... | 0.033333 | 1.00 | -0.220192 | -0.500 | -0.050000 | 0.454545 | 0.136364 | 0.045455 | 0.136364 | 505 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
39639 | http://mashable.com/2014/12/27/samsung-app-aut... | 8.0 | 11.0 | 346.0 | 0.529052 | 1.0 | 0.684783 | 9.0 | 7.0 | 1.0 | ... | 0.100000 | 0.75 | -0.260000 | -0.500 | -0.125000 | 0.100000 | 0.000000 | 0.400000 | 0.000000 | 1800 |
39640 | http://mashable.com/2014/12/27/seth-rogen-jame... | 8.0 | 12.0 | 328.0 | 0.696296 | 1.0 | 0.885057 | 9.0 | 7.0 | 3.0 | ... | 0.136364 | 0.70 | -0.211111 | -0.400 | -0.100000 | 0.300000 | 1.000000 | 0.200000 | 1.000000 | 1900 |
39641 | http://mashable.com/2014/12/27/son-pays-off-mo... | 8.0 | 10.0 | 442.0 | 0.516355 | 1.0 | 0.644128 | 24.0 | 1.0 | 12.0 | ... | 0.136364 | 0.50 | -0.356439 | -0.800 | -0.166667 | 0.454545 | 0.136364 | 0.045455 | 0.136364 | 1900 |
39642 | http://mashable.com/2014/12/27/ukraine-blasts/ | 8.0 | 6.0 | 682.0 | 0.539493 | 1.0 | 0.692661 | 10.0 | 1.0 | 1.0 | ... | 0.062500 | 0.50 | -0.205246 | -0.500 | -0.012500 | 0.000000 | 0.000000 | 0.500000 | 0.000000 | 1100 |
39643 | http://mashable.com/2014/12/27/youtube-channel... | 8.0 | 10.0 | 157.0 | 0.701987 | 1.0 | 0.846154 | 1.0 | 1.0 | 0.0 | ... | 0.100000 | 0.50 | -0.200000 | -0.200 | -0.200000 | 0.333333 | 0.250000 | 0.166667 | 0.250000 | 1300 |
39644 rows × 61 columns
6. 特征和目标变量定义
# be careful the features has a prefix of white space
numeric_features = [' n_tokens_title', ' n_tokens_content', ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens', ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos', ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle', ' data_channel_is_entertainment', ' data_channel_is_bus', ' data_channel_is_socmed', ' data_channel_is_tech', ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min', ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg', ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares', ' self_reference_max_shares', ' self_reference_avg_sharess', ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday', ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday', ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02', ' LDA_03', ' LDA_04', ' global_subjectivity', ' global_sentiment_polarity', ' global_rate_positive_words', ' global_rate_negative_words', ' rate_positive_words', ' rate_negative_words', ' avg_positive_polarity', ' min_positive_polarity', ' max_positive_polarity', ' avg_negative_polarity', ' min_negative_polarity', ' max_negative_polarity', ' title_subjectivity', ' title_sentiment_polarity', ' abs_title_subjectivity', ' abs_title_sentiment_polarity']
categorical_features =[]
target = ' shares'
X = df.drop(target,axis=1)
y = df[target]
7. 训练和测试数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8. 定义转换器
# steps to handle numeric features
numeric_transformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
# steps to handel categorical features
categorical_transformer = Pipeline(steps = [
("cat_imputer", SimpleImputer(strategy='constant', fill_value = "missing")),
("encoder", OneHotEncoder(handle_unknown='ignore'))
])
# TFIDF vectorizer to handel the url only
tfidfvectorizer = TfidfVectorizer(max_features = 100, stop_words='english')
# a customize transformer to be used in the ColumnTransformer
# which takes a list of one or string features, and return the word counts for the list of features
class Counter(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, _X):
X = pd.DataFrame(_X)
for col in X.columns:
X[col]= X[col].apply(lambda x: len(x.split('-')))
return X
counter = Counter()
# integrate a preprocessor to handel different types of features
# notice in the transformer list of the ColumnTransformer,the parameter is 'url', a string type
# while parameters for other transformers are list type
preprocessor = ColumnTransformer(
transformers=[
('url', tfidfvectorizer, 'url'),
('num', numeric_transformer, numeric_features),
('count', counter, ['url']),
#('cat', categorical_transformer, categorical_features)
])
# check the format of the training data after preprocessing steps
# preprocessor.fit_transform(X_train)
9. 定义包括模型的整个pipeline
pipeline = Pipeline(
steps =[
('preprocessor', preprocessor),
('model',RandomForestRegressor(verbose=0, n_jobs=-1))
])
pipeline.fit(X_train,y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('url',
TfidfVectorizer(max_features=100,
stop_words='english'),
'url'),
('num',
Pipeline(steps=[('num_imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
[' n_tokens_title',
' n_tokens_content',
' n_unique_tokens',
' n_non_stop_words',
' n_non_stop_unique_tokens',
' num_hrefs',
' num_self_hrefs',
' n...
' data_channel_is_tech',
' data_channel_is_world',
' kw_min_min', ' kw_max_min',
' kw_avg_min', ' kw_min_max',
' kw_max_max', ' kw_avg_max',
' kw_min_avg', ' kw_max_avg',
' kw_avg_avg',
' self_reference_min_shares',
' self_reference_max_shares',
' '
'self_reference_avg_sharess',
' weekday_is_monday', ...]),
('count', Counter(),
['url'])])),
('model', RandomForestRegressor(n_jobs=-1))])
10. 回归模型lift plot可视化
def plot_predicted_vs_actuals(y_true, y_predicted, title = "Pred Vs Actual"):
rmse= mean_squared_error(y_true= y_true, y_pred=y_predicted,squared=False)
print("RMSE:", rmse)
predVsActual = pd.DataFrame( {'y_true': y_true,'y_pred': y_predicted})
predVsActual["pred_deciles"] = pd.qcut(predVsActual["y_pred"],q=10 ,labels=False, duplicates = 'drop')
aggregations = {
'y_true':'mean',
'y_pred': 'mean'
}
counts = predVsActual.groupby("pred_deciles").count()["y_pred"]
predVsActual= predVsActual.groupby("pred_deciles").agg(aggregations)
predVsActual["counts"] = counts
predVsActual[["y_true","y_pred"]].plot(title=title)
plt.show()
print(predVsActual)
test_preds = pipeline.predict(X_test)
plot_predicted_vs_actuals(y_test, test_preds)
RMSE: 11240.043547942365
y_true y_pred counts
pred_deciles
0 1410.976040 1226.474250 793
1 2334.482976 1623.177175 793
2 2066.532156 1958.356570 793
3 2311.552333 2279.694124 793
4 2859.727617 2645.796381 793
5 2926.430556 3046.564318 792
6 3554.585120 3543.738184 793
7 4165.051702 4249.105233 793
8 4935.030265 5450.838789 793
9 6709.374527 11123.519130 793
11. 全部代码和数据下载
参考下面英文版末尾有github 链接:
datasciencebyexample