nlp——机器学习（4）

最新推荐文章于 2023-07-27 16:08:55 发布

soobinnim

最新推荐文章于 2023-07-27 16:08:55 发布

阅读量308

点赞数 1

文章标签：机器学习自然语言处理 python

本文链接：https://blog.csdn.net/soobinnim/article/details/124496856

版权

案例：泰坦尼克号
特征值 目标值
1.获取数据
2.数据处理：缺失值处理，特征值 --> 字典类型
3.筛选特征值目标值
4.划分数据集
5.特征工程：字典特征抽取
6.决策树预估器流畅
7.模型评估


1.随机森林！！！（大数据集，高维特征的输入样本）
（集成学习方法的一种，包含多个决策树，众数）
集成学习方法：建立几个模型组合解决单一预测问题。
随机 ：两个随机：特征值随机（从M个特征随机抽取m个特征 M>>m  降维）；训练值随机（bootstrap随机有放回抽样）
随机森林分类器：
class sklearn.ensemble.RandonForestClassifier(n_estimators=10,criterion='gini',max_depth=None,boostrap=True,random_state=None,min_samples_split=2)


2.线性回归（房价预测，销售额度预测，金融）
回归问题：目标值--连续型数据
线性回归（linear regression），只有一个自变量的情况是单变量回归，其他是多元回归

线性关系一定是线性模型，但是线性模型不一定是线性关系

目标：求模型参数，能够使得预测准确
损失函数（cost/成本函数/目标函数/最小二乘法）：J
优化损失：1.正规方程（直接求解）；2.梯度下降（！！！）（吴恩达的课讲解得非常好）
sklearn.liner_model.LinearRegression(fit_intercept=True)
fit_intercept:是否计算偏置

sklearn.liner_model.SGDRegression(loss='squared_loss,fit_intercept=True,learning_rate='invscaling',eta0=0.01)

案例：波士顿房价预测
流程：1.获取数据集；
          2.划分数据集；
          3.特征工程：无量纲化处理 -- 标准化
          4.预估器流程：fit()--模型  coef_intercept_
          5.模型评估？？

回归的性能评估：均方误差（MSE）
sklearn.metrics.mean_squared_error(y_true,y_pred)
y_true:真实值
y_pred:预测值
return:浮点数结果

小规模数据：linearRegression（不能解决拟合问题），岭回归
大规模下降：SDGRegression
梯度下降优化：GD梯度下降,SGD随机梯度下降,SAG随机平均梯度法


3.欠拟合和过拟合
过拟合：训练集上表现很好，测试集上表现不好
解决方案：兼顾各个测试数据点（正则化L1，L2，更常用L2）
L2正则化： 损失函数 + 惩罚项（w的平方和），ridge--岭回归，削弱某个特征的影响
L1正则化： 损失函数 + 惩罚项（w的绝对值），lasso，删除这个特征的影响

欠拟合：特征太少，训练集和测试集上都表现不好
解决方案：增加数据的特征数量


4.线性回归的改进-->岭回归（L2正则化）：解决过拟合
（ridge方法相当于SGDRegression）
sklearn.linear_model.Rifge(alpha=1.0,fit_intercept=True,solver='auto',normalize=False)
alpha:正则化力度 = 惩罚项系数，取值0~1 1~10
solver:会根据数据自动选择优化方法，如果数据集、特征都比较大，选择SAG
Ridge.coef_:回归权重
Ridge.intercept_:回归偏置

sklearn.linear_model.RidgeCV(_BaseRidgeCV,RegressorMixin)
交叉验证，coef_：回归系数


5.分类算法：逻辑回归与二分法

逻辑回归：解决二分类问题的利器  正例/反例
原理：逻辑回归的输入就是一个线性回归的结果 
激活函数：sigmoid函数，回归的结果输入到该函数中，[0,1]之间，默认0.5为阙值
损失：对数似然损失
损失函数
优化：梯度下降优化，减少损失函数的值。
逻辑回归：sklearn.linear_model.LogisticRegression(solver='liblinear',penalty='12,C=1.0)
penalty:正则化的种类
C:正则化力度

案例：癌症分类
流程：1.获取数据：读取加上names；2.数据处理：处理缺失值；3.数据集划分；
4.特征工程：无量纲化处理--标准化；5.逻辑回归预估器；6.模型评估


分类的评估方法：癌症，工厂，质量检测
1.精确率(Precision)：预测结果为正例样本中真实为正例的比例
召回率(Recall)：真实为正例的样本中预测结果为正例的比例，查得全不全
F1-score,模型的稳健型

sklearn.metrics.classification_report(y_true,y_pred,labels=[],target_names=None)
labels:指定类别对应的数字
return:每个类别精确率与召回率


ROU曲线与AUC指标：
1.ROC曲线：TPR(正例召回率)与FPR(反例召回率)
2.AUC越接近1越好，越接近0.5越不好

from sklearn.metrics import roc_auc_score
sklearn.metrics.roc_auc_score(y_true,y_score)
计算ROC曲线面积，即AUC值
y_true:每个样本的真实类别，必须为0，1标记
y_score:预测得分


6.模型保存和加载
from sklearn.externals import joblib
保存：joblib.dump(rf,'test.pkl')
加载：estimator = joblib.load('test.pkl')


7.k-means算法（无监督学习）：没有目标值
优点：迭代算法，直观易懂并实用
缺点：容易收敛到局部最优解（多次聚类）
应用场景：没有目标值，分类，聚类一般作用在分类之前

聚类：K-means(K均值聚类)
降维：PCA
k--超参数：1.看需求；2.调节超参数

sklearn.cluster.KMeans(n_clusters=8,init='k-means++')
n_clusters:开始的聚类中心数量
init:初始化方法
lables_:默认标记的类型，可以和真实值比较（不是值比较）


案例：instacart
降维之后的数据
1.预估器流程
from sklearn.cluster import KMeans
estimator = KMeans(n_clusters=3)
estimator.fit(data_new)
2.看结果
3.模型评估
y_predict = estimator.predict(data_new)


查看聚类效果（模型评估）：   高内聚低耦合
轮廓系数
sklearn.metrics.siljouette_score(X,labels)
计算所有样本的平均轮廓系数
X：特征值
lables：被聚类标记的目标值

# 模型评估
from sklearn.metics import siljouette_score
siljouette_score(data_new,y_predict)

# 案例：泰坦尼克号


def taitan():
    # 1.获取数据
    titanic = pd.read_csv('titanic.csv')
    # 2.数据处理：缺失值处理，特征值 --> 字典类型
    x = titanic[['pclass','age','sex']]
    y = titanic['survived']
    # age里有缺失值
    x['age'].fillna(x['age'].mean(),inplace=True)
    # 字典
    x = x.to_dict(orient='records')
    # 3.筛选特征值目标值
    # 4.划分数据集
    x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=22)

    # 5.特征工程：字典特征抽取
    tran = DictVectorizer()
    x_train = tran.fit_transform(x_train)
    x_test = tran.transform(x_test)

    # 3.决策树预估器
    estimator = DecisionTreeClassifier(criterion='entropy')
    estimator.fit(x_train, y_train)

    # 4.模型评估
    # 方法1：
    y_predict = estimator.predict(x_test)
    print('y_predict:\n', y_predict)
    print('真实值和预测值对比', y_predict == y_test)

    # 方法2：
    score = estimator.score(x_test, y_test)
    print('准确率是\n', score)

    # 可视化决策树
    export_graphviz(estimator, out_file='titanic_tree.dot', feature_names=tran.get_feature_names_out())
    return None

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


def forest():
    # 1.获取数据
    titanic = pd.read_csv('titanic.csv')
    # 2.数据处理：缺失值处理，特征值 --> 字典类型
    x = titanic[['pclass', 'age', 'sex']]
    y = titanic['survived']
    # age里有缺失值
    x['age'].fillna(x['age'].mean(), inplace=True)
    # 字典
    x = x.to_dict(orient='records')
    # 3.筛选特征值目标值
    # 4.划分数据集
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

    # 5.特征工程：字典特征抽取
    tran = DictVectorizer()
    x_train = tran.fit_transform(x_train)
    x_test = tran.transform(x_test)

    estimator = RandomForestClassifier()

    param_dict = {'n_estimators': [120,200,300,500,800,1200],'max_depth':[5,8,15,25,30]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
    estimator.fit(x_train, y_train)

    # 5.模型评估
    # 1.直接比对真实值和与测试
    y_predict = estimator.predict(x_test)
    print('y_predict：\n', y_predict)
    print('直接比对\n', y_test == y_predict)
    # 2.计算准确率
    score = estimator.score(x_test, y_test)
    print('score:\n', score)

    print('最佳参数\n', estimator.best_params_)
    print('最佳结果\n', estimator.best_score_)
    print('最佳估计器\n', estimator.best_estimator_)
    print('交叉验证结果\n', estimator.cv_results_)


if __name__ == '__main__':
    # taitan()
    forest()

泰坦尼克号案例数据集：http://链接：https://pan.baidu.com/s/19aAtgc6VJ_BtRNsw2QVZoA 提取码：u28d

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,SGDRegressor,Ridge
from sklearn.metrics import mean_squared_error
import joblib


def linear1():
    # 正规方程的优化方法
    # 1.获取数据集；
    boston = load_boston()

    # 2.划分数据集；
    x_train,x_test,y_train,y_test = train_test_split(boston.data,boston.target,random_state=22)

    # 3.特征工程：无量纲化处理 -- 标准化
    tran = StandardScaler()
    x_train = tran.fit_transform(x_train)
    x_test = tran.transform(x_test)

    # 4.预估器流程：fit()--模型  coef_intercept_
    estimator = LinearRegression()
    estimator.fit(x_train,y_train)
    print('正规方程权重系数为\n',estimator.coef_)
    print('正规方程偏置为\n',estimator.intercept_)

    # 5.模型评估
    y_predict = estimator.predict(x_test)
    print('预测房价\n',y_predict)
    error = mean_squared_error(y_test,y_predict)
    print('正规方程均方误差\n',error)


def linear2():
    # 梯度下降的优化方法
    # 1.获取数据集；
    boston = load_boston()
    print('特征数量：\n',boston.data.shape)

    # 2.划分数据集；
    x_train,x_test,y_train,y_test = train_test_split(boston.data,boston.target,random_state=22)

    # 3.特征工程：无量纲化处理 -- 标准化
    tran = StandardScaler()
    x_train = tran.fit_transform(x_train)
    x_test = tran.transform(x_test)

    # 4.预估器流程：fit()--模型  coef_intercept_
    estimator = SGDRegressor()
    estimator.fit(x_train,y_train)
    print('梯度下降权重系数为\n',estimator.coef_)
    print('梯度下降偏置为\n',estimator.intercept_)

    # 5.模型评估
    y_predict = estimator.predict(x_test)
    print('预测房价\n', y_predict)
    error = mean_squared_error(y_test, y_predict)
    print('梯度下降均方误差\n', error)


def linear3():
    # 岭回归
    # 1.获取数据集；
    boston = load_boston()
    print('特征数量：\n',boston.data.shape)

    # 2.划分数据集；
    x_train,x_test,y_train,y_test = train_test_split(boston.data,boston.target,random_state=22)

    # 3.特征工程：无量纲化处理 -- 标准化
    tran = StandardScaler()
    x_train = tran.fit_transform(x_train)
    x_test = tran.transform(x_test)

    # 4.预估器流程：fit()--模型  coef_intercept_
    # estimator = Ridge(max_iter=10000)  # 调参数
    # estimator.fit(x_train,y_train)

    # 保存模型
    # joblib.dump(estimator,'my_ridge.pkl')
    # 加载模型,测试模型是否成功
    estimator = joblib.load('my_ridge.pkl')

    print('岭回归权重系数为\n',estimator.coef_)
    print('岭回归偏置为\n',estimator.intercept_)

    # 5.模型评估
    y_predict = estimator.predict(x_test)
    print('预测房价\n', y_predict)
    error = mean_squared_error(y_test, y_predict)
    print('岭回归均方误差\n', error)




import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score


def cancer():
    column_name = ['Sample code number','Clump Thickness', 'Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
    data =pd.read_csv('breast-cancer-wisconsin.data',names=column_name)
    # print('data',data)

    # 缺失值处理
    # 1.替换 np.nan
    data = data.replace(to_replace='?',value=np.nan)
    # 2.删除缺失样本
    data.dropna(inplace=True)
    # print('data',data.isnull().any())  验证不存在缺失值

    # 划分数据集
    # 筛选特征值和目标值
    x = data.iloc[:,1:-1]
    y = data['Class']
    x_train,x_test,y_train,y_test = train_test_split(x,y)

    # 特征工程:标准化
    tran = StandardScaler()
    x_train = tran.fit_transform(x_train)
    x_test = tran.transform(x_test)

    # 预估器流程
    estimator = LogisticRegression()
    estimator.fit(x_train,y_train)
    # 逻辑回归的模型参数
    print('逻辑回归权重系数为\n', estimator.coef_)
    print('逻辑回归偏置为\n', estimator.intercept_)

    # 模型评估
    # 1.直接比对真实值和与测试
    y_predict = estimator.predict(x_test)
    print('y_predict：\n', y_predict)
    print('直接比对\n', y_test == y_predict)
    # 2.计算准确率
    score = estimator.score(x_test, y_test)
    print('score:\n', score)

    # 查看精确率，召回率，F1-score
    report = classification_report(y_test,y_predict,labels=[2,4],target_names=['良性','恶性'])
    print('report:\n',report)

    # 要将y_test转换为0，1
    # 计算roc_auc指标
    y_true = np.where(y_test > 3,1,0)
    roc_auc = roc_auc_score(y_true,y_predict)
    print('roc_auc\n',roc_auc)


if __name__ == '__main__':
    # linear1()
    # linear2()
    linear3()
    # cancer()

案例预测癌症数据集：http://链接：https://pan.baidu.com/s/1qgyWYVOqNKXziL41-wBHQw 提取码：oqxw

结束了黑马三天入门课，基本有了大致方向，继续跟着吴恩达了解原理，真的讲解得很易懂对于概念原理，然后还跟着学了pr，ps，准备将numpy和pandas学一下，关于数据挖掘也要进行学习，希望这几个月能有所成！

soobinnim

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nlp——机器学习（4）

案例：泰坦尼克号特征值目标值1.获取数据2.数据处理：缺失值处理，特征值 --> 字典类型3.筛选特征值目标值4.划分数据集5.特征工程：字典特征抽取6.决策树预估器流畅7.模型评估1.随机森林！！！（大数据集，高维特征的输入样本）（集成学习方法的一种，包含多个决策树，众数）集成学习方法：建立几个模型组合解决单一预测问题。随机：两个随机：特征值随机（从M个特征随机抽取m个特征 M>>m 降维）；训练值随机（bootstrap随机有放回抽样）随机森林分类.
复制链接

扫一扫

nlp——机器学习（4）

“相关推荐”对你有帮助么？