处理后的数据集
Global_active_power Global_reactive_power Global_intensity b
1 5.360 0.436 23.0 1.0
2 5.374 0.498 23.0 1.0
3 5.388 0.502 23.0 1.0
4 3.666 0.528 15.8 1.0
5 3.520 0.522 15.0 1.0
6 3.702 0.520 15.8 1.0
7 3.700 0.520 15.8 1.0
8 3.668 0.510 15.8 1.0
9 3.662 0.510 15.8 1.0
10 4.448 0.498 19.6 1.0
11 5.412 0.470 23.2 1.0
12 5.224 0.478 22.4 1.0
13 5.268 0.398 22.6 1.0
14 4.054 0.422 17.6 1.0
15 3.384 0.282 14.2 1.0
16 3.270 0.152 13.8 1.0
17 3.430 0.156 14.4 1.0
18 3.266 0.000 13.8 1.0
19 3.728 0.000 16.4 1.0
20 5.894 0.000 25.4 1.0
代码及注释
import numpy as np
import pandas as pd
path='E:\PHP\Pycharmprojects\python_ML\ML_Two\datas\household_power_consumption_1000.txt'
df = pd.read_csv(path,sep=';') #使用pandas加载csv
## 对数据集增加偏置,并去除掉不需要的列
b = np.ones(len(df)) #建立一列为1的dataframe,起名为b作为偏置项
df['b'] = b #将偏置项合并到原始dataframe
#获取Global_active_power、Global_reactive_power、Global_intensity和偏置列b的dataframe
a = df[['Global_active_power','Global_reactive_power','Global_intensity','b']]
## 异常数据处理
a = a.replace('?',np.nan).dropna()# 只要有特征为空,就进行删除操作
print(a.head(20))
#将字符串转为浮点
print(a.info())
a['Global_active_power'] = a['Global_active_power'].astype('float64')
print(a.info())
## 分离X和Y
#获取"Global_active_power","Global_reactive_power","b"为X
X = a[['Global_active_power','Global_reactive_power','b']]
Y = a[['Global_intensity']]#获取Global_intensity作为Y
## 将dataframe转为numpy矩阵
X = np.mat(X) #将X转为numpy矩阵
Y = np.mat(Y) #将Y转为numpy矩阵
##使用正规方程法求解模型参数
theta = np.dot(np.linalg.inv(np.dot(X.T,X)),np.dot(X.T,Y))
print(theta)
效果展示
Global_active_power Global_reactive_power Global_intensity b
1 5.360 0.436 23.0 1.0
2 5.374 0.498 23.0 1.0
3 5.388 0.502 23.0 1.0
4 3.666 0.528 15.8 1.0
5 3.520 0.522 15.0 1.0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 1 to 999
Data columns (total 4 columns):
Global_active_power 999 non-null object
Global_reactive_power 999 non-null float64
Global_intensity 999 non-null float64
b 999 non-null float64
dtypes: float64(3), object(1)
memory usage: 39.0+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 1 to 999
Data columns (total 4 columns):
Global_active_power 999 non-null float64
Global_reactive_power 999 non-null float64
Global_intensity 999 non-null float64
b 999 non-null float64
dtypes: float64(4)
memory usage: 39.0 KB
None
[[4.10551185]
[0.68820978]
[0.35884791]]