机器学习应用(1)

一、波士顿房价预测

这是一个回归问题
利用boston数据集,对数据标准化后进行回归,并进行多模型对比。
代码如下:

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# 1、数据准备(506x14,无缺失值)
boston = load_boston()
print(boston.DESCR)

x = boston.data
y = boston.target

# 2、训练测试数据分离
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
print('x_train.shape:', x_train.shape, '\n', 'x_test.shape:', x_test.shape, '\n',
      'y_train.shape:', y_train.shape, '\n', 'y_test.shape:', y_test.shape)
print(y_train.mean(), y_test.mean(), y_train.max())

# 3、查看数据
df = pd.DataFrame(np.hstack((x, y.reshape(506, 1))))
df.describe()

# 4、标准化
ss_x = StandardScaler()
ss_y = StandardScaler()

x_train = ss_x.fit_transform(x_train)
x_test = ss_x.transform(x_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1)).ravel()
y_test = ss_y.transform(y_test.reshape(-1, 1)).ravel()

print('x_train.shape:', x_train.shape, '\n', 'x_test.shape:', x_test.shape, '\n',
      'y_train.shape:', y_train.shape, '\n', 'y_test.shape:', y_test.shape)
print(y_train.mean(), y_test.mean(), y_train.max())

# 5、回归
rfr = RandomForestRegressor()  # 初始化LinearRegression
rfr.fit(x_train, y_train)  # 拟合
rfr_y_predict = rfr.predict(x_test)  # 预测

# 6、性能评估
print('模型自带评分结果:', rfr.score(x_test, y_test))
print('R-squared:', r2_score(y_test, rfr_y_predict))
print('MSE:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
print('MAE:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))

# 多模型对比
estimators = {"svr kernel=linear regressor": SVR(kernel="linear"),
              "svr kernel=rbf regressor": SVR(kernel="rbf"),
              "svr kernel=poly regressor": SVR(kernel="poly"),
              "knr weights=uniform regressor": KNeighborsRegressor(weights='uniform'),
              "knr weights=distance regressor": KNeighborsRegressor(weights='distance'),
              "dtr regressor": DecisionTreeRegressor(),
              "randomforest regressor": RandomForestRegressor(),
              "GradientBoostingRegressor": GradientBoostingRegressor(),
              "lr": LinearRegression(),
              "sgdr": SGDRegressor()}

for key, estimator in estimators.items():
    estimator.fit(x_train, y_train)
    y_predict = estimator.predict(x_test)
    print(key, "模型R-squared:", r2_score(y_test, y_predict))

 
 
二、titanic数据集生存预测
这是一个分类问题
通过特征选择,缺失数据处理,特征向量化等处理数据,最后用决策树,随机森林等模型预测
代码如下:

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

# 1、数据准备
train = pd.read_csv('./basic-ml/data/titanic/train.csv')
test = pd.read_csv('./basic-ml/data/titanic/test.csv')

print(train.info())  # age,cabin,embarked含缺失值
print(test.info())  # age,Fare,cabin含缺失值

# 特征选择
selected_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

x_train = train[selected_features]
x_test = test[selected_features]
y_train = train['Survived']

# 2、处理缺失值
x_train = x_train.copy()
x_test = x_test.copy()
print(x_train['Embarked'].value_counts())  # S

x_train['Embarked'].fillna('S', inplace=True)
x_train['Age'].fillna(x_train['Age'].mean(), inplace=True)
x_test['Age'].fillna(x_test['Age'].mean(), inplace=True)
x_test['Fare'].fillna(x_test['Fare'].mean(), inplace=True)

# 重新检查数据是否含有缺失值
x_train.info()
x_test.info()

# 3、类别特征向量化
vec = DictVectorizer(sparse=False)
x_train = vec.fit_transform(x_train.to_dict(orient='record'))
x_test = vec.transform(x_test.to_dict(orient='record'))
print(vec.feature_names_)

# 4、训练
estimators = {"DecisionTree": DecisionTreeClassifier(),
              "RandomForest": RandomForestClassifier(),
              "GradientBoosting": GradientBoostingClassifier(),
              "XGBC": XGBClassifier()}

for key, estimator in estimators.items():
    print(key,':', cross_val_score(estimator, x_train, y_train, cv=5).mean())

# 使用GradientBoosting预测
gbc = GradientBoostingClassifier()
gbc.fit(x_train,y_train)
y_predict = gbc.predict(x_test)


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
分类:分类可以找出这些不同种类客户之间的特征,让用户了解不同行为类别客户的分布特征,从而进行商业决策和业务活动,如:在银行行业,可以通过阿里云机器学习对客户进行分类,以便进行风险评估和防控;在销售领域,可以通过对客户的细分,进行潜客挖掘、客户提升和交叉销售、客户挽留等 聚类:通常”人以群分,物以类聚”,通过对数据对象划分为若干类,同一类的对象具有较高的相似度,不同类的对象相似度较低,以便我们度量对象间的相似性,发现相关性。如在安全领域,通过异常点的检测,可以发现异常的安全行为。通过人与人之间的相似性,实现团伙犯罪的发掘 预测:通过对历史事件的学习来积累经验,得出事物间的相似性和关联性,从而对事物的未来状况做出预测。比如:预测销售收入和利润,预测用户下一个阶段的消费行为等 关联:分析各个物品或者商品之间同时出现的机率,典型的场景如:购物篮分析。比如超市购物时,顾客购买记录常常隐含着很多关联规则,比如购买圆珠笔的顾客中有65%也购买了笔记本,利用这些规则,商场人员可以很好的规划商品摆放问题。在电商网站中,利用关联规则可以发现哪些用户更喜欢哪类的商品,当发现有类似的客户的时候,可以将其它客户购买的商品推荐给相类似的客户,以提高网站的收入。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值