机器学习应用（1）

最新推荐文章于 2024-04-07 13:51:51 发布

じんじん

最新推荐文章于 2024-04-07 13:51:51 发布

阅读量106

点赞数

分类专栏：机器学习文章标签：机器学习 python

本文链接：https://blog.csdn.net/weixin_43575791/article/details/105472744

版权

机器学习专栏收录该内容

21 篇文章 1 订阅

订阅专栏

一、波士顿房价预测

这是一个回归问题
利用boston数据集，对数据标准化后进行回归，并进行多模型对比。
代码如下：

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# 1、数据准备（506x14,无缺失值)
boston = load_boston()
print(boston.DESCR)

x = boston.data
y = boston.target

# 2、训练测试数据分离
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
print('x_train.shape：', x_train.shape, '\n', 'x_test.shape：', x_test.shape, '\n',
      'y_train.shape：', y_train.shape, '\n', 'y_test.shape：', y_test.shape)
print(y_train.mean(), y_test.mean(), y_train.max())

# 3、查看数据
df = pd.DataFrame(np.hstack((x, y.reshape(506, 1))))
df.describe()

# 4、标准化
ss_x = StandardScaler()
ss_y = StandardScaler()

x_train = ss_x.fit_transform(x_train)
x_test = ss_x.transform(x_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1)).ravel()
y_test = ss_y.transform(y_test.reshape(-1, 1)).ravel()

print('x_train.shape：', x_train.shape, '\n', 'x_test.shape：', x_test.shape, '\n',
      'y_train.shape：', y_train.shape, '\n', 'y_test.shape：', y_test.shape)
print(y_train.mean(), y_test.mean(), y_train.max())

# 5、回归
rfr = RandomForestRegressor()  # 初始化LinearRegression
rfr.fit(x_train, y_train)  # 拟合
rfr_y_predict = rfr.predict(x_test)  # 预测

# 6、性能评估
print('模型自带评分结果：', rfr.score(x_test, y_test))
print('R-squared：', r2_score(y_test, rfr_y_predict))
print('MSE：', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
print('MAE：', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))

# 多模型对比
estimators = {"svr kernel=linear regressor": SVR(kernel="linear"),
              "svr kernel=rbf regressor": SVR(kernel="rbf"),
              "svr kernel=poly regressor": SVR(kernel="poly"),
              "knr weights=uniform regressor": KNeighborsRegressor(weights='uniform'),
              "knr weights=distance regressor": KNeighborsRegressor(weights='distance'),
              "dtr regressor": DecisionTreeRegressor(),
              "randomforest regressor": RandomForestRegressor(),
              "GradientBoostingRegressor": GradientBoostingRegressor(),
              "lr": LinearRegression(),
              "sgdr": SGDRegressor()}

for key, estimator in estimators.items():
    estimator.fit(x_train, y_train)
    y_predict = estimator.predict(x_test)
    print(key, "模型R-squared:", r2_score(y_test, y_predict))

二、titanic数据集生存预测
这是一个分类问题
通过特征选择，缺失数据处理，特征向量化等处理数据，最后用决策树，随机森林等模型预测
代码如下：

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

# 1、数据准备
train = pd.read_csv('./basic-ml/data/titanic/train.csv')
test = pd.read_csv('./basic-ml/data/titanic/test.csv')

print(train.info())  # age,cabin,embarked含缺失值
print(test.info())  # age,Fare,cabin含缺失值

# 特征选择
selected_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

x_train = train[selected_features]
x_test = test[selected_features]
y_train = train['Survived']

# 2、处理缺失值
x_train = x_train.copy()
x_test = x_test.copy()
print(x_train['Embarked'].value_counts())  # S

x_train['Embarked'].fillna('S', inplace=True)
x_train['Age'].fillna(x_train['Age'].mean(), inplace=True)
x_test['Age'].fillna(x_test['Age'].mean(), inplace=True)
x_test['Fare'].fillna(x_test['Fare'].mean(), inplace=True)

# 重新检查数据是否含有缺失值
x_train.info()
x_test.info()

# 3、类别特征向量化
vec = DictVectorizer(sparse=False)
x_train = vec.fit_transform(x_train.to_dict(orient='record'))
x_test = vec.transform(x_test.to_dict(orient='record'))
print(vec.feature_names_)

# 4、训练
estimators = {"DecisionTree": DecisionTreeClassifier(),
              "RandomForest": RandomForestClassifier(),
              "GradientBoosting": GradientBoostingClassifier(),
              "XGBC": XGBClassifier()}

for key, estimator in estimators.items():
    print(key,':', cross_val_score(estimator, x_train, y_train, cv=5).mean())

# 使用GradientBoosting预测
gbc = GradientBoostingClassifier()
gbc.fit(x_train,y_train)
y_predict = gbc.predict(x_test)

じんじん

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习应用（1）

利用boston数据集，对数据标准化后进行回归，并进行多模型对比。代码如下：import pandas as pdimport numpy as npfrom sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing ...
复制链接

扫一扫

专栏目录