2018-5-10 学习内容汇总

上午

熵的理解:

物体内部的混乱程度:内部种类越多,代表不确定性越强。
熵越大意味着信息量越不稳定,状态不稳定
公式:

H(x)=pilog(pi) H ( x ) = − ∑ p i ∗ l o g ( p i )

  • 种类越多 求和越多 越大
  • 概率越小 信息越多 越大

    集合A=[1,1,1,1,1,,1,1,1,1]
    集合B=[1,2,3,4,7,5,4,3,4,5]
    B的熵肯定大于A

    • 信息增益:当一种策略进行时,熵下降了多少

决策树的构造

ID3算法

  • 根节点的选择:依据信息增益(ID3算法)
  • 信息增益最大的选择为根节点,找到根节点后再使用相同的方法找到下一个节点
  • 问题:当以ID为跟节点的时候,每个节点都是自己的确定的元素,熵为0,这样信息增益最大,但是不能解决问题

C4.5算法

解决了ID3的问题,考虑了自身熵

Crate:使用GINI系数作为衡量标准

Gini(p)=kk=1pk(1pk) Gini ( p ) = ∑ k = 1 k p k ( 1 − p k )

连续值怎么办

  • 连续值离散化:二分
  • 可以使用贪婪算法

决策树剪枝策略

PNG

示例 决策树学习

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
  • 下载fetch_california_housing信息
from sklearn.datasets.california_housing import fetch_california_housing
housing=fetch_california_housing()
print(housing.DESCR)
California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/datasets/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.
housing.data.shape
(20640, 8)
housing.data[0]
array([   8.3252    ,   41.        ,    6.98412698,    1.02380952,
        322.        ,    2.55555556,   37.88      , -122.23      ])
from sklearn import tree
dtr=tree.DecisionTreeRegressor(max_depth=3)
dtr.fit(housing.data[:,[6,7]],housing.target)
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
dot_data=tree.export_graphviz(
    dtr,out_file=None,
    feature_names=housing.feature_names[6:8],
    filled=True,
    impurity=False,
    rounded=True)
import pydotplus
graph=pydotplus.graph_from_dot_data(dot_data)
#graph.get_nodes()[7].set_fillcolor("#FFF2DD")
from IPython.display import Image
Image(graph.create_png())

png

from sklearn.model_selection import train_test_split

data_train,data_test,target_train,target_test=train_test_split(housing.data,housing.target,test_size=0.1,random_state=42)
dtr=tree.DecisionTreeRegressor(random_state=42)
dtr.fit(data_train,target_train)
dtr.score(data_test,target_test)
0.637318351331017
from sklearn.grid_search import GridSearchCV
from sklearn import svm, datasets
from sklearn import tree
parameters = {"min_samples_split":list((3,6,9)),'max_depth':list((10,50,500))}
grid=GridSearchCV(tree.ExtraTreeRegressor(),param_grid=parameters,cv=5)
grid.fit(data_train,target_train)
grid.grid_scores_,grid.best_params_
([mean: 0.61296, std: 0.03027, params: {'max_depth': 10, 'min_samples_split': 3},
  mean: 0.59972, std: 0.02221, params: {'max_depth': 10, 'min_samples_split': 6},
  mean: 0.61384, std: 0.01006, params: {'max_depth': 10, 'min_samples_split': 9},
  mean: 0.56797, std: 0.03443, params: {'max_depth': 50, 'min_samples_split': 3},
  mean: 0.62359, std: 0.01998, params: {'max_depth': 50, 'min_samples_split': 6},
  mean: 0.64330, std: 0.03146, params: {'max_depth': 50, 'min_samples_split': 9},
  mean: 0.59926, std: 0.02974, params: {'max_depth': 500, 'min_samples_split': 3},
  mean: 0.60422, std: 0.01359, params: {'max_depth': 500, 'min_samples_split': 6},
  mean: 0.64961, std: 0.02670, params: {'max_depth': 500, 'min_samples_split': 9}],
 {'max_depth': 500, 'min_samples_split': 9})

集成算法:

  • 目的:学习效果更好
    1. bagging:训练多个分类器取平均(并行进行训练,并且把结果平均化)
    2. boosting
    3. stacking

bagging模型:

决策树的升级
* 典型代表是随机森林(随机才能保证差异化)
* 数据采样随机,特征选择随机
* 很多个决策树并行放在一起,最后求值

下午

随机森林的优势
  1. 能够处理高纬度的数据,并且不用特征选择
  2. 训练完成后能够给出那些feature比较重要
  3. 容易并行
  4. 结果易可视化,有力分析

boosting 从弱学习器开始加强,通过加权来进行训练

弱学习器 ,串联算法 一步一步增强结果

典型代表: Adaboost,Xgboost

Adaboost会根据前一次分类效果调整数据权重,如果某一个数据分类错误,就在下一次给更大的权重

stacking:

聚合多个分类器或者回归模型。
堆叠:很暴力,拿来一堆直接上
可以堆叠各种分类器 一般分为两个阶段 第二阶段用第一阶段的结果作为输入

示例:利用泰坦尼克数据对比多种机器学习方法代码

import pandas
titanic=pandas.read_csv('titanic_train.csv')
titanic.head()
print (titanic.describe())
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

age是存在缺失的,所以可以用age的均值填充

titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
print (titanic.describe())
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.361582    0.523008   
std     257.353842    0.486592    0.836071   13.019697    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   22.000000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   35.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

sex是”男”或者“女”,所以需要转换成数值

len(titanic[titanic['Embarked']=='S']) ##统计s登录口的个数
644
print (titanic["Sex"].unique())

titanic.loc[titanic["Sex"]=="male","Sex"]=0
titanic.loc[titanic["Sex"]=="female","Sex"]=1
['male' 'female']

将登陆口“Embarked”的空值添加为数量最多的S

==可否添加成别的字母,例如X?==

print (titanic["Embarked"].unique())
titanic['Embarked']=titanic['Embarked'].fillna('X')
titanic.loc[titanic["Embarked"]=="C","Embarked"]=1
titanic.loc[titanic["Embarked"]=="S","Embarked"]=0
titanic.loc[titanic["Embarked"]=="Q","Embarked"]=2
titanic.loc[titanic["Embarked"]=="X","Embarked"]=3
['S' 'C' 'Q' nan]

下面是使用线性回归对数据进行分析预测,可以看到只有26%的准确率

# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)
0.261503928171

使用逻辑回归做预测,准确率提高到了78%

from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
0.789001122334

使用随机森林做预测

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
0.783389450056

修改随机森林的参数之后的准确率

alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], 3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
0.819304152637

增加一些可能有关联的属性(特征),再对结果进行测试

# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

import re

# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v

# Verify that we converted everything.
print(pandas.value_counts(titles))

# Add in the title column.
titanic["Title"] = titles
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Col           2
Mlle          2
Major         2
Mme           1
Capt          1
Jonkheer      1
Countess      1
Don           1
Lady          1
Ms            1
Sir           1
Name: Name, dtype: int64
1     517
2     183
3     125
4      40
5       7
6       6
7       5
10      3
8       3
9       2
Name: Name, dtype: int64

验证每个特征对于分类的重要性,也是特征选择的一种方式

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)

png

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

# Initialize the cross validation folds
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)

# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)
0.279461279461
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值