流程分析
1)获取数据
2)数据处理
目的:
特征值 x
目标值 y
a.缩小数据范围
2 < x < 2.5
1.0 < y < 1.5
b.time -> 年月日时分秒
c.过滤签到次数少的地点
数据集划分
3)特征工程:标准化
4)KNN算法预估流程
5)模型选择与调优
6)模型评估
代码实现
import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29118021 entries, 0 to 29118020
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 row_id int64
1 x float64
2 y float64
3 accuracy int64
4 time int64
5 place_id int64
dtypes: float64(2), int64(4)
memory usage: 1.3 GB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 83197 entries, 112 to 29117493
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 row_id 83197 non-null int64
1 x 83197 non-null float64
2 y 83197 non-null float64
3 accuracy 83197 non-null int64
4 time 83197 non-null int64
5 place_id 83197 non-null int64
dtypes: float64(2), int64(4)
memory usage: 4.4 MB
# 3)特征工程:标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4)KNN算法预估器
estimator = KNeighborsClassifier()
# 加入网格搜索与交叉验证
# 参数准备
param_dict = {"n_neighbors": [3, 5, 7, 9]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
estimator.fit(x_train, y_train)
# 5)模型评估
# 方法1:直接比对真实值和预测值
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("直接比对真实值和预测值:\n", y_test == y_predict)
# 方法2:计算准确率
score = estimator.score(x_test, y_test)
print("准确率为:\n", score)
# 最佳参数:best_params_
print("最佳参数:\n", estimator.best_params_)
# 最佳结果:best_score_
print("最佳结果:\n", estimator.best_score_)
# 最佳估计器:best_estimator_
print("最佳估计器:\n", estimator.best_estimator_)
# 交叉验证结果:cv_results_
print("交叉验证结果:\n", estimator.cv_results_)
y_predict:
[1188605085 7547051259 4861093827 ... 3948427562 4712992402 8790355618]
直接比对真实值和预测值:
9410437 False
18527230 False
16723729 False
12847470 False
19728191 False
...
14490700 False
22043724 False
12880367 True
15500322 False
16762280 True
Name: place_id, Length: 20228, dtype: bool
准确率为:
0.3592050622898952
最佳参数:
{'n_neighbors': 5}
最佳结果:
0.336475349258919
最佳估计器:
KNeighborsClassifier()
交叉验证结果:
{'mean_fit_time': array([0.11700066, 0.10891525, 0.11783703, 0.10671488]), 'std_fit_time': array([0.02167838, 0.00449742, 0.02246615, 0.00293658]), 'mean_score_time': array([1.11003582, 1.2267855 , 1.43521214, 1.32364019]), 'std_score_time': array([0.03227009, 0.00453402, 0.2423346 , 0.04249626]), 'param_n_neighbors': masked_array(data=[3, 5, 7, 9],
mask=[False, False, False, False],
fill_value='?',
dtype=object), 'params': [{'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 7}, {'n_neighbors': 9}], 'split0_test_score': array([0.32875222, 0.33933162, 0.33527783, 0.33433854]), 'split1_test_score': array([0.32298413, 0.33440451, 0.33484946, 0.33030108]), 'split2_test_score': array([0.32545607, 0.33568992, 0.33573936, 0.33163593]), 'mean_test_score': array([0.32573081, 0.33647535, 0.33528888, 0.33209185]), 'std_test_score': array([0.00236281, 0.00208675, 0.00036338, 0.00167952]), 'rank_test_score': array([4, 1, 2, 3])}