20221007 课堂笔记



import warnings


from sklearn import datasets
boston = datasets.load_boston()
 Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.
 'filename': 'D:\\Anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv'}

#拆分为训练集与测试集 sklern.model_selection.train_test_split(

*arrays :等长度的需要拆分的数据对象
格式可以是1ists, numpy arrays, sci py稀疏矩阵或者panda s数据框显然,对于有监督类模型,x和y需要按相同标准同时进行拆分
test_ size = 0.25 : float, int, None, 用于验证模型的样本比例,范围在0到1为None时所有样本都将用于训练
train_ size = None : float, int, or None, 用于训练模型的样本比例,0到1为None时自动基于test_ size计算
random_ state = None随机种子(随便设 )
shuffle = True :是否在拆分前对样本做随机排列
stratify = None : array-like or None, 是否按指定类别标签对数据做分层拆分
)返回:对输入对象进行拆分后的list, length = 2 * ln(arays)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(boston.data, 
                                           test_size= 0.3, 
(354, 152, 354, 152)

s折交叉验证:首先随机地将已给数据切分为s个互不相交的大小相同的子集,然后利用s-1个子集的数据训练模型,利用余下的子集测试模型;将这一过程对可能 的s种选择重复进行。最后选择s次评测中平均测试误差最小的模型


sklearn.model_sekection包括cross_val_score (将拆分与评价合并并执行)
cross_validate (同时使用多个评价指标)
cross_val_predict (使用交互验证后的模型进行预测)



sklearn.model_selection.cross_ val_ score(

estimator :用于拟合数据的估计器对象名称
X : array-like, 用于拟合模型的数据阵
y = None : array-like,有监督模型使用的因变量
groups = None : array-like, 形如(n_ samples,), 样本拆分时使用的分组标签
cv = None : int, 设定交互验证时的样本拆分策略
n_ jobs = 1, verbose = O, fit_ params = None
pre_ dispatch = '2*n_ jobs'


from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
scores = cross_val_score(reg, boston.data, boston.target,cv = 10)
array([ 0.73376082,  0.4730725 , -1.00631454,  0.64113984,  0.54766046,
        0.73640292,  0.37828386, -0.12922703, -0.76843243,  0.4189435 ])
scores.mean(), scores.std()
(0.20252899006055367, 0.5952960169512383)
scores = cross_val_score(reg, boston.data, boston.target, scoring = 'explained_variance', cv = 10)
array([ 0.74784412,  0.5381936 , -0.80757662,  0.66844779,  0.5586898 ,
        0.74128804,  0.41981565, -0.11666214, -0.44561819,  0.42197365])



KFold等函数有一个内置 的参数shuffle,可以要求在拆分数据前将数据索引随机排序(但该参数默认NFalse) 
cross_ val_ score等函数无此参数,因此必要时应当先对数据进行随机排序。
import numpy as np
indices = np.arange(y.shape[0])
X,y = X[indices],y[indices]
reg = LinearRegression()
scores = cross_val_score(reg, X, y, cv = 10)
array([0.68525792, 0.86167017, 0.43360162, 0.77029655, 0.82813619,
       0.67159777, 0.43783055, 0.68626901, 0.66072544, 0.84392753])
scores.mean(), scores.std()
(0.6879312757654554, 0.14464172569122236)



cross_ validate函数使用的参数基本和cross_ val_ score相同,但是功能上有以下扩展:

sklearn.model selection.cross_ _validate(

estimator :用于拟合数据的估计器对象名称
x : array-like, 用于拟合模型的数据阵
y = None : array-like, 有监督模型使用的因变量
groups = None : array-like, 形如(n_ samples,), 样本拆分时使用的分组标签scoring = None : string, callable, list/tuple, dict or None
模型评分的计算方法,多评估指标时使用1 ist/dict等方式提供
cV = None : int, 设定交互验证时的样本拆分策略
object / iterable 用于设定拆分
n_obs = 1,verbose = 0,fit_params = None
pre_dispatch = ‘2*n_jobs’
return_train_score = True : boolean,是否返回训练集评分
)返回:每轮模型对应评分的字典,shape = (n_splits,)

from sklearn.model_selection import cross_validate
scoring = ['r2','explained_variance']
scores = cross_validate(reg,X, y,cv = 10, scoring = scoring,return_train_score = False)
{'fit_time': array([0.00144815, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.00400066, 0.        , 0.00400066]),
 'score_time': array([0.        , 0.        , 0.00400949, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]),
 'test_r2': array([0.68525792, 0.86167017, 0.43360162, 0.77029655, 0.82813619,
        0.67159777, 0.43783055, 0.68626901, 0.66072544, 0.84392753]),
 'test_explained_variance': array([0.68627925, 0.86996345, 0.43698251, 0.77496606, 0.83089913,
        0.67990516, 0.47111897, 0.7236839 , 0.66077957, 0.8462367 ])}
from sklearn.model_selection import cross_val_predict
pred = cross_val_predict(reg, X, y, cv = 10)

array([16.64007022, 34.46615482, 19.90505228, 36.18559111, 26.33976946,
       33.22900938, 16.98796546, 37.06187884, 23.11126718, 25.48051526])
from sklearn.metrics import r2_score
r2_score(y, pred)



class sklearn.tree.DecisionTreeClassifier(

criterion = ‘gini’ :衡量节点拆分质量的指标,{‘gini’, ‘entropy’}
splitter = ‘best’ :节点拆分时的策略
'best '代表最佳拆分,'random '为最佳随机拆分
max_ depth = None :树生长的最大深度(高度)
min_ samples_ split = 2 :节点允许进一步分枝时 的最低样本数
min_ samples_ leaf = 1 :叶节点的最低样本量
min_ weight_ fraction_ leaf = 0.0 :有权重时叶节点的最低权重分值
max_ features = ‘auto’ : int/float/ string/None,搜索分支时考虑的特征数
‘auto’/‘sqrt’, max_ features = sqrt(n_ features)
‘log2’, max_ features = log2(n_ features)
None, max_ features = n_ features
random_ state = None
max_ leaf_ nodes = None :最高叶节点数量
min_ impurity_ decrease = 0.0 :分枝时需要的最低信息量下降量
class_ weight = None, presort = False


classes_ : array of shape = [n_ classes] or a list of such arraysfeature_ importances : array of shape = [n_ features],特征重要性评价总和为1,也被称为gini重要性
max_ features_ : int
n_ classes_ : int or list
n_ features_ : int
n_ outputs_ :int
tree_ : Tree object )
注意:树模型也可以用于数值变量预测,对应的方法为sklearn.tree .DecisionTreeRegressor

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
iris = load_iris()
ct = DecisionTreeClassifier()
array([0.01333333, 0.        , 0.06405596, 0.92261071])
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
from sklearn.metrics import classification_report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        50
           2       1.00      1.00      1.00        50

    accuracy                           1.00       150
   macro avg       1.00      1.00      1.00       150
weighted avg       1.00      1.00      1.00       150


http://www. gr aphviz.org,'下载gr aphvi z的安装包(可选择ms i格式)


sklearn.tree.export_ graphviz(

decision_ tree, out_ file = “tree. dot”
max_ depth = None, feature_ names = None, class_ names = Nonelabel = ‘all’ : {‘all’, ‘root’, ‘none’}, 是否显示杂质测量指标filled = False :是否对节点填色加强表示
leaves_ parallel = False :是否在树底部绘制所有叶节点
impurity = True, node_ ids = False
proportion = False :是否给出节点样本占比而不是样本量
rotate = False :是否从左至右绘图
rounded = False :是否绘制圆角框而不是直角长方框
special_ characters = False :是否忽略PS兼容的特殊字符
precision = 3

from sklearn.tree import export_graphviz
export_graphviz(ct, out_file = 'tree.dot',
               feature_names= iris.feature_names,
from sklearn.tree import export_graphviz
export_graphviz(ct, out_file = 'tree.dot1',
               feature_names= iris.feature_names,
               rounded = True,
               filled = True)

