ML学习笔记-2021-08-24-分类算法-模型选择与调优

最新推荐文章于 2022-07-13 10:51:34 发布

燥栋

最新推荐文章于 2022-07-13 10:51:34 发布

阅读量133

点赞数

分类专栏： ML

本文链接：https://blog.csdn.net/qq_45363979/article/details/119889540

版权

ML 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

3. 模型选择和调优

3.1 交叉验证

定义
目的为了让模型得精度更加可信

3.2 超参数搜索 Grid Search

对K值进行选择。k=[1,2,3,4,5,6]循环遍历搜索。
API
参数1：传入预估器。
参数2：超参数得取值，字典类型，{‘超参数名称’：[参数列表]}
参数3：cv 几折交叉验证
返回值：可查看最佳参数啥的。

3.3 鸢尾花案例增加K值调优

def KNN_optimal():  # 模型选择和调优
    # 网格搜索和交叉验证
    x_train, x_test, y_train, y_test = load_data()
    estimator = KNeighborsClassifier()  # 默认都是欧式距离, 采用的是minkowski推广算法,p=1是曼哈顿, p=2是欧式, 而默认值为2
    # 开始调优
    # 第一个参数是estimator
    # 第二个是估计器参数，参数名称（字符串）作为key，要测试的参数列表作为value的字典，或这样的字典构成的列表
    # 第三个是指定cv=K,  K折交叉验证
    # https://www.cnblogs.com/dblsha/p/10161798.html
    param_dict = {"n_neighbors": [1, 3, 5, 7, 9, 11]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)
    # 结束调优
    estimator.fit(x_train, y_train)

    # 传入测试值通过前面的预估器获得预测值
    y_predict = estimator.predict(x_test)
    print("预测值为:", y_predict, "\n真实值为:", y_test, "\n比较结果为:", y_test == y_predict)
    score = estimator.score(x_test, y_test)
    print("准确率为: ", score)
    # ------------------
    print("最佳参数:\n", estimator.best_params_)
    print("最佳结果:\n", estimator.best_score_)
    print("最佳估计器:\n", estimator.best_estimator_)
    print("交叉验证结果:\n", estimator.cv_results_)
    # -----------------以上是自动筛选出的最佳参数, 调优结果
    return None

3.4 预测 facebook 签到位置

数据集介绍
. 流程分析
1）获取数据
2）数据处理：
特征值：x
目标值：y
a.缩小范围：2<x<2.5, 1.0<y<1.5
b.time -> 年月日时分秒
c.过滤签到次数少的地点
3）特征工程：特征提取，特征预处理：标准化，特征降维
4）算法训练：KNN算法得预估流程
5）模型评估：模型选择与调优
6）应用

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

def load_data():
    data = pd.read_csv("../../resources/FBlocation/train.csv")
    # data = data.query("x<2.5 & x>1 & y<1.5 & y>1.0")  # query 方法处理数据
    data = data.copy()
    # 处理时间特征
    time_value = pd.to_datetime(data["time"], unit="s")  # 通用datetime时间类型数据
    date = pd.DatetimeIndex(time_value)  # 转换为可筛选的时间格式
    data["day"] = date.day
    data["weekday"] = date.weekday
    data["hour"] = date.hour
    # 过滤签到次数少的地点
    print("计数count统计\n", data.groupby("place_id").count())  # 展示为可观测列表数据, 这里计数, 并且后面的所有的字段数据全是代表出现的总次数
    place_count = data.groupby("place_id").count()[
        "row_id"]  # 签到place的次数统计, 方便直观展示而只过滤place和次数, row_id是随便加的, 现在所有字段都代表count值, 所以可以取其他的也行
    print("签到place的次数统计\n", place_count)

    place_count[place_count > 3]  # 过滤所有数据筛选出签到(这里是row_id>3)次数大于3的
    print("过滤所有数据,筛选出签到次数大于10的\n", place_count[place_count > 10])

    data["place_id"].isin(place_count[place_count > 3].index.values)  # 布尔值索引
    print("布尔值索引\n", data["place_id"].isin(place_count[place_count > 10].index.values))

    final_data = data[data["place_id"].isin(place_count[place_count > 10].index.values)]  # 通过布尔索引筛选
    print("处理后的data:\n", final_data)

    return final_data


def implement():
    used_data_x = load_data()[["x", "y", "accuracy", "day", "weekday", "hour"]]
    used_data_y = load_data()["place_id"]

    x_train, x_test, y_train, y_test = \
        train_test_split(used_data_x, used_data_y)
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = KNeighborsClassifier()
    param_dict = {"n_neighbors": [5, 10, 15, 20]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=4)
    estimator.fit(x_train, y_train)  # 训练集里面的数据和目标值

    # 传入测试值通过前面的预估器获得预测值
    y_predict = estimator.predict(x_test)
    print("预测值为:", y_predict, "\n真实值为:", y_test, "\n比较结果为:", y_test == y_predict)
    score = estimator.score(x_test, y_test)
    print("准确率为: ", score)
    # ------------------
    print("最佳参数:\n", estimator.best_params_)
    print("最佳结果:\n", estimator.best_score_)
    print("最佳估计器:\n", estimator.best_estimator_)
    print("交叉验证结果:\n", estimator.cv_results_)
    # -----------------以上是自动筛选出的最佳参数, 调优结果

    return None


if __name__ == '__main__':
    implement()

燥栋

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ML学习笔记-2021-08-24-分类算法-模型选择与调优

3. 模型选择和调优3.1 交叉验证定义目的为了让模型得精度更加可信3.2 超参数搜索 Grid Search对K值进行选择。k=[1,2,3,4,5,6]循环遍历搜索。API参数1：传入预估器。参数2：超参数得取值，字典类型，{‘超参数名称’：[参数列表]}参数3：cv 几折交叉验证返回值：可查看最佳参数啥的。3.3 鸢尾花案例增加K值调优def KNN_optimal(): # 模型选择和调优 # 网格搜索和交叉验证 x_train, x_test
复制链接

扫一扫

专栏目录