上午
熵的理解:
物体内部的混乱程度:内部种类越多,代表不确定性越强。
熵越大意味着信息量越不稳定,状态不稳定
公式:
H(x)=−∑pi∗log(pi) H ( x ) = − ∑ p i ∗ l o g ( p i )
- 种类越多 求和越多 越大
概率越小 信息越多 越大
集合A=[1,1,1,1,1,,1,1,1,1]
集合B=[1,2,3,4,7,5,4,3,4,5]
B的熵肯定大于A- 信息增益:当一种策略进行时,熵下降了多少
决策树的构造
ID3算法
- 根节点的选择:依据信息增益(ID3算法)
- 信息增益最大的选择为根节点,找到根节点后再使用相同的方法找到下一个节点
- 问题:当以ID为跟节点的时候,每个节点都是自己的确定的元素,熵为0,这样信息增益最大,但是不能解决问题
C4.5算法
解决了ID3的问题,考虑了自身熵
Crate:使用GINI系数作为衡量标准
Gini(p)=∑kk=1pk(1−pk) Gini ( p ) = ∑ k = 1 k p k ( 1 − p k )
连续值怎么办
- 连续值离散化:二分
- 可以使用贪婪算法
决策树剪枝策略
示例 决策树学习
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
- 下载fetch_california_housing信息
from sklearn.datasets.california_housing import fetch_california_housing
housing=fetch_california_housing()
print(housing.DESCR)
California housing dataset.
The original database is available from StatLib
http://lib.stat.cmu.edu/datasets/
The data contains 20,640 observations on 9 variables.
This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.
References
----------
Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.
housing.data.shape
(20640, 8)
housing.data[0]
array([ 8.3252 , 41. , 6.98412698, 1.02380952,
322. , 2.55555556, 37.88 , -122.23 ])
from sklearn import tree
dtr=tree.DecisionTreeRegressor(max_depth=3)
dtr.fit(housing.data[:,[6,7]],housing.target)
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
dot_data=tree.export_graphviz(
dtr,out_file=None,
feature_names=housing.feature_names[6:8],
filled=True,
impurity=False,
rounded=True)
import pydotplus
graph=pydotplus.graph_from_dot_data(dot_data)
#graph.get_nodes()[7].set_fillcolor("#FFF2DD")
from IPython.display import Image
Image(graph.create_png())
from sklearn.model_selection import train_test_split
data_train,data_test,target_train,target_test=train_test_split(housing.data,housing.target,test_size=0.1,random_state=42)
dtr=tree.DecisionTreeRegressor(random_state=42)
dtr.fit(data_train,target_train)
dtr.score(data_test,target_test)
0.637318351331017
from sklearn.grid_search import GridSearchCV
from sklearn import svm, datasets
from sklearn import tree
parameters = {"min_samples_split":list((3,6,9)),'max_depth':list((10,50,500))}
grid=GridSearchCV(tree.ExtraTreeRegressor(),param_grid=parameters,cv=5)
grid.fit(data_train,target_train)
grid.grid_scores_,grid.best_params_
([mean: 0.61296, std: 0.03027, params: {'max_depth': 10, 'min_samples_split': 3},
mean: 0.59972, std: 0.02221, params: {'max_depth': 10, 'min_samples_split': 6},
mean: 0.61384, std: 0.01006, params: {'max_depth': 10, 'min_samples_split': 9},
mean: 0.56797, std: 0.03443, params: {'max_depth': 50, 'min_samples_split': 3},
mean: 0.62359, std: 0.01998, params: {'max_depth': 50, 'min_samples_split': 6},
mean: 0.64330, std: 0.03146, params: {'max_depth': 50, 'min_samples_split': 9},
mean: 0.59926, std: 0.02974, params: {'max_depth': 500, 'min_samples_split': 3},
mean: 0.60422, std: 0.01359, params: {'max_depth': 500, 'min_samples_split': 6},
mean: 0.64961, std: 0.02670, params: {'max_depth': 500, 'min_samples_split': 9}],
{'max_depth': 500, 'min_samples_split': 9})
集成算法:
- 目的:学习效果更好
- bagging:训练多个分类器取平均(并行进行训练,并且把结果平均化)
- boosting
- stacking
bagging模型:
决策树的升级
* 典型代表是随机森林(随机才能保证差异化)
* 数据采样随机,特征选择随机
* 很多个决策树并行放在一起,最后求值
下午
随机森林的优势
- 能够处理高纬度的数据,并且不用特征选择
- 训练完成后能够给出那些feature比较重要
- 容易并行
- 结果易可视化,有力分析
boosting 从弱学习器开始加强,通过加权来进行训练
弱学习器 ,串联算法 一步一步增强结果
典型代表: Adaboost,Xgboost
Adaboost会根据前一次分类效果调整数据权重,如果某一个数据分类错误,就在下一次给更大的权重
stacking:
聚合多个分类器或者回归模型。
堆叠:很暴力,拿来一堆直接上
可以堆叠各种分类器 一般分为两个阶段 第二阶段用第一阶段的结果作为输入
示例:利用泰坦尼克数据对比多种机器学习方法代码
import pandas
titanic=pandas.read_csv('titanic_train.csv')
titanic.head()
print (titanic.describe())
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
age是存在缺失的,所以可以用age的均值填充
titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
print (titanic.describe())
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.361582 0.523008
std 257.353842 0.486592 0.836071 13.019697 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 22.000000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 35.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
sex是”男”或者“女”,所以需要转换成数值
len(titanic[titanic['Embarked']=='S']) ##统计s登录口的个数
644
print (titanic["Sex"].unique())
titanic.loc[titanic["Sex"]=="male","Sex"]=0
titanic.loc[titanic["Sex"]=="female","Sex"]=1
['male' 'female']
将登陆口“Embarked”的空值添加为数量最多的S
==可否添加成别的字母,例如X?==
print (titanic["Embarked"].unique())
titanic['Embarked']=titanic['Embarked'].fillna('X')
titanic.loc[titanic["Embarked"]=="C","Embarked"]=1
titanic.loc[titanic["Embarked"]=="S","Embarked"]=0
titanic.loc[titanic["Embarked"]=="Q","Embarked"]=2
titanic.loc[titanic["Embarked"]=="X","Embarked"]=3
['S' 'C' 'Q' nan]
下面是使用线性回归对数据进行分析预测,可以看到只有26%的准确率
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold
# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)
predictions = []
for train, test in kf:
# The predictors we're using the train the algorithm. Note how we only take the rows in the train folds.
train_predictors = (titanic[predictors].iloc[train,:])
# The target we're using to train the algorithm.
train_target = titanic["Survived"].iloc[train]
# Training the algorithm using the predictors and target.
alg.fit(train_predictors, train_target)
# We can now make predictions on the test fold
test_predictions = alg.predict(titanic[predictors].iloc[test,:])
predictions.append(test_predictions)
import numpy as np
# The predictions are in three separate numpy arrays. Concatenate them into one.
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)
# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)
0.261503928171
使用逻辑回归做预测,准确率提高到了78%
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
0.789001122334
使用随机森林做预测
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
0.783389450056
修改随机森林的参数之后的准确率
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], 3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
0.819304152637
增加一些可能有关联的属性(特征),再对结果进行测试
# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]
# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))
import re
# A function to get the title from a name.
def get_title(name):
# Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))
# Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
titles[titles == k] = v
# Verify that we converted everything.
print(pandas.value_counts(titles))
# Add in the title column.
titanic["Title"] = titles
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Col 2
Mlle 2
Major 2
Mme 1
Capt 1
Jonkheer 1
Countess 1
Don 1
Lady 1
Ms 1
Sir 1
Name: Name, dtype: int64
1 517
2 183
3 125
4 40
5 7
6 6
7 5
10 3
8 3
9 2
Name: Name, dtype: int64
验证每个特征对于分类的重要性,也是特征选择的一种方式
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"]
# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])
# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)
# Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()
# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
[GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]],
[LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]
# Initialize the cross validation folds
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)
predictions = []
for train, test in kf:
train_target = titanic["Survived"].iloc[train]
full_test_predictions = []
# Make predictions for each algorithm on each fold
for alg, predictors in algorithms:
# Fit the algorithm on the training data.
alg.fit(titanic[predictors].iloc[train,:], train_target)
# Select and predict on the test fold.
# The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
full_test_predictions.append(test_predictions)
# Use a simple ensembling scheme -- just average the predictions to get the final classification.
test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
# Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
test_predictions[test_predictions <= .5] = 0
test_predictions[test_predictions > .5] = 1
predictions.append(test_predictions)
# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)
# Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)
0.279461279461