2018-5-10 学习内容汇总

最新推荐文章于 2021-09-22 18:17:17 发布

Clumsy__Boy

最新推荐文章于 2021-09-22 18:17:17 发布

阅读量226

点赞数

分类专栏：学习总结文章标签： 18年5月

本文链接：https://blog.csdn.net/glq_big89/article/details/80272033

版权

学习总结专栏收录该内容

1 篇文章 0 订阅

订阅专栏

上午

熵的理解：

物体内部的混乱程度：内部种类越多，代表不确定性越强。
熵越大意味着信息量越不稳定，状态不稳定
公式：

$H(x)=-\sum p{_{i}}*log(p_{i})$

种类越多求和越多越大
概率越小信息越多越大

集合A=[1,1,1,1,1,,1,1,1,1]
集合B=[1,2,3,4,7,5,4,3,4,5]
B的熵肯定大于A
- 信息增益：当一种策略进行时，熵下降了多少

决策树的构造

ID3算法

根节点的选择：依据信息增益（ID3算法）
信息增益最大的选择为根节点，找到根节点后再使用相同的方法找到下一个节点
问题：当以ID为跟节点的时候，每个节点都是自己的确定的元素，熵为0，这样信息增益最大，但是不能解决问题

C4.5算法

解决了ID3的问题，考虑了自身熵

Crate：使用GINI系数作为衡量标准

$\textbf{Gini}(p)=\sum_{k=1}^{k}p_{k}(1-p_{k})$

连续值怎么办

连续值离散化：二分
可以使用贪婪算法

决策树剪枝策略

PNG

示例决策树学习

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

下载fetch_california_housing信息

from sklearn.datasets.california_housing import fetch_california_housing
housing=fetch_california_housing()
print(housing.DESCR)

California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/datasets/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.

housing.data.shape

(20640, 8)

housing.data[0]

array([   8.3252    ,   41.        ,    6.98412698,    1.02380952,
        322.        ,    2.55555556,   37.88      , -122.23      ])

from sklearn import tree
dtr=tree.DecisionTreeRegressor(max_depth=3)
dtr.fit(housing.data[:,[6,7]],housing.target)

DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

dot_data=tree.export_graphviz(
    dtr,out_file=None,
    feature_names=housing.feature_names[6:8],
    filled=True,
    impurity=False,
    rounded=True)

import pydotplus
graph=pydotplus.graph_from_dot_data(dot_data)
#graph.get_nodes()[7].set_fillcolor("#FFF2DD")
from IPython.display import Image
Image(graph.create_png())

png

from sklearn.model_selection import train_test_split

data_train,data_test,target_train,target_test=train_test_split(housing.data,housing.target,test_size=0.1,random_state=42)
dtr=tree.DecisionTreeRegressor(random_state=42)
dtr.fit(data_train,target_train)
dtr.score(data_test,target_test)

0.637318351331017

from sklearn.grid_search import GridSearchCV
from sklearn import svm, datasets
from sklearn import tree
parameters = {"min_samples_split":list((3,6,9)),'max_depth':list((10,50,500))}
grid=GridSearchCV(tree.ExtraTreeRegressor(),param_grid=parameters,cv=5)
grid.fit(data_train,target_train)
grid.grid_scores_,grid.best_params_

([mean: 0.61296, std: 0.03027, params: {'max_depth': 10, 'min_samples_split': 3},
  mean: 0.59972, std: 0.02221, params: {'max_depth': 10, 'min_samples_split': 6},
  mean: 0.61384, std: 0.01006, params: {'max_depth': 10, 'min_samples_split': 9},
  mean: 0.56797, std: 0.03443, params: {'max_depth': 50, 'min_samples_split': 3},
  mean: 0.62359, std: 0.01998, params: {'max_depth': 50, 'min_samples_split': 6},
  mean: 0.64330, std: 0.03146, params: {'max_depth': 50, 'min_samples_split': 9},
  mean: 0.59926, std: 0.02974, params: {'max_depth': 500, 'min_samples_split': 3},
  mean: 0.60422, std: 0.01359, params: {'max_depth': 500, 'min_samples_split': 6},
  mean: 0.64961, std: 0.02670, params: {'max_depth': 500, 'min_samples_split': 9}],
 {'max_depth': 500, 'min_samples_split': 9})

集成算法：

目的：学习效果更好
1. bagging：训练多个分类器取平均（并行进行训练，并且把结果平均化）
2. boosting
3. stacking

bagging模型：

决策树的升级
* 典型代表是随机森林（随机才能保证差异化）
* 数据采样随机，特征选择随机
* 很多个决策树并行放在一起，最后求值

下午

随机森林的优势

能够处理高纬度的数据，并且不用特征选择
训练完成后能够给出那些feature比较重要
容易并行
结果易可视化，有力分析

boosting 从弱学习器开始加强，通过加权来进行训练

弱学习器，串联算法一步一步增强结果

典型代表： Adaboost,Xgboost

Adaboost会根据前一次分类效果调整数据权重，如果某一个数据分类错误，就在下一次给更大的权重

stacking：

聚合多个分类器或者回归模型。
堆叠：很暴力，拿来一堆直接上
可以堆叠各种分类器一般分为两个阶段第二阶段用第一阶段的结果作为输入

示例：利用泰坦尼克数据对比多种机器学习方法代码

import pandas
titanic=pandas.read_csv('titanic_train.csv')
titanic.head()
print (titanic.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200

age是存在缺失的，所以可以用age的均值填充

titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
print (titanic.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.361582    0.523008   
std     257.353842    0.486592    0.836071   13.019697    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   22.000000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   35.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200

sex是”男”或者“女”，所以需要转换成数值

len(titanic[titanic['Embarked']=='S']) ##统计s登录口的个数

print (titanic["Sex"].unique())

titanic.loc[titanic["Sex"]=="male","Sex"]=0
titanic.loc[titanic["Sex"]=="female","Sex"]=1

['male' 'female']

将登陆口“Embarked”的空值添加为数量最多的S

==可否添加成别的字母，例如X？==

print (titanic["Embarked"].unique())
titanic['Embarked']=titanic['Embarked'].fillna('X')
titanic.loc[titanic["Embarked"]=="C","Embarked"]=1
titanic.loc[titanic["Embarked"]=="S","Embarked"]=0
titanic.loc[titanic["Embarked"]=="Q","Embarked"]=2
titanic.loc[titanic["Embarked"]=="X","Embarked"]=3

['S' 'C' 'Q' nan]

下面是使用线性回归对数据进行分析预测,可以看到只有26%的准确率

# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)

0.261503928171

使用逻辑回归做预测，准确率提高到了78%

from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.789001122334

使用随机森林做预测

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.783389450056

修改随机森林的参数之后的准确率

alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], 3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.819304152637

增加一些可能有关联的属性（特征），再对结果进行测试

# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

import re

# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v

# Verify that we converted everything.
print(pandas.value_counts(titles))

# Add in the title column.
titanic["Title"] = titles

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Col           2
Mlle          2
Major         2
Mme           1
Capt          1
Jonkheer      1
Countess      1
Don           1
Lady          1
Ms            1
Sir           1
Name: Name, dtype: int64
1     517
2     183
3     125
4      40
5       7
6       6
7       5
10      3
8       3
9       2
Name: Name, dtype: int64

验证每个特征对于分类的重要性，也是特征选择的一种方式

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)

png

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

# Initialize the cross validation folds
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)

# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)

0.279461279461

Clumsy__Boy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2018-5-10 学习内容汇总

上午熵的理解：物体内部的混乱程度：内部种类越多，代表不确定性越强。熵越大意味着信息量越不稳定，状态不稳定公式：$ H(x)=-\sum p{_{i}}*log(p_{i}) $种类越多求和越多越大概率越小信息越多越大集合A=[1,1,1,1,1,,1,1,1,1] 集合B=[1,2,3,4,7,5,4,3,4,5] B的熵肯定大于A信息增益：当...
复制链接

扫一扫