读取csv文件数据集,然后进行学习
数据集散点图如下:(横坐标为玩偶个数,纵坐标为生产成本)
# 参见GitUploading/ML/linear regression
from sklearn import linear_model
import pandas as pd
import matplotlib.pyplot as plt
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
print(reg.coef_)
print(reg.intercept_)
import numpy as np
# dataset = pd.loadtxt('simple_example.csv')
dataset= pd.read_csv('simple_example.csv')
# X = dataset[list(dataset.columns)[:-1]]
X = dataset['x']
print(type(X))
# X = dataset[1:2]
y = dataset['y']
# print("size:",len(dataset))
print(X)
print(y)
print('===============')
print(dataset)
plt.plot(X, y, 'k.')
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state =33,test_size=0.25) #对训练样本进行划分
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train = np.array(X_train).reshape(-1, 1)
# y = [7, 9, 13, 17.5, 18]
# Fit the model on the training data
model.fit(X_train, y_train) #开始进行迭代
print('Coefficients: n', model.coef_)
print('Coefficients: n', model.intercept_)#显示回归系数
X = np.array(X).reshape(-1, 1)
y_pred = model.predict(X)
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.show()
print(X_train)
from sklearn.metrics import mean_squared_error, r2_score
print('========评估结果,均方差=======')
X_test = np.array(X_test).reshape(-1, 1)
diabetes_y_pred = model.predict(X_test)
print("Mean squared error: %.2f"
% mean_squared_error(y_test, diabetes_y_pred))
print('Variance score(决定系数): %.2f' % r2_score(y_test, diabetes_y_pred))
运行结果:(注意原始数据集,以及对原始数据集的区别X,y进行读取)
Coefficients:
[ 1.03705579]
intercept:
-1.19837809917
均方差的定义:
表示用此模型估计生产成本,平均误差为
决定系数:(注意,决定系数要用测试数据)
,
,
表示92%的成本变化可由模型解释