机器学习项目
- 数据预处理
1.1可视化
导入库:
“import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns #统计建模
from sklearn.preprocessing import StandardScaler
from scipy.stats import norm
from scipy import stats #统计
import warnings
warnings.filterwarnings(‘ignore’)
“
直方图-查看相关数据分布:
“
sns.distplot(df_train[‘SalePrice’])
“
箱线图-查看数据分布
“
var = ‘OverallQual’
data = pd.concat([df_train[‘SalePrice’],df_train[‘OverallQual’]],axis = 1)
f,ax = plt.subplots(figsize = (8,6)) #subplots 创建一个画像(figure)和一组子图(subplots)
fig = sns.boxplot(x = var,y = ‘SalePrice’,data = data)
fig.axis (ymin = 0,ymax = 800000)
“
散点图-查看数据之间关联关系:
“
var = ‘GrLivArea’
data = pd.concat([df_train[‘SalePrice’],df_train[var]],axis = 1)
data.plot.scatter(x = var,y = ‘SalePrice’,ylim = (0,800000));
“
1.2数据清洗
1.缺失值处理:
1)忽略元组:
2)人工填充缺失值:
3)自动填充缺失值:全局常数,属性的中心度量(均值或中位数),最可能的值
统计各个属性缺失值:
“total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum() / df.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total,percent],axis = 1,keys = [‘Total’,‘Percent’])
missing_data
“
忽略属性都为null的元组:
“df_cleaned = df.dropna(how=‘all’)
df_cleaned“
缺失值填充,采用列均值:
“imp = SimpleImputer(missing_values = np.nan,strategy = ‘mean’)
imp.fit(df)
df_cleaned_1 = imp.transform(df_cleaned)
df_cleaned_1”
- 特征工程
数值型数据:
–归一化:
相关方法,导入“
from sklearn.preprocessing import MinMaxScaler
自定义数据
data = [[180,75,25],[175,80,19],[159,50,40],[160,60,32]]
导入归一化方法
from sklearn.preprocessing import MinMaxScaler
接收该方法
scaler = MinMaxScaler(feature_range=(0,2)) #指定归一化区间
scaler = MinMaxScaler()
将数据传入归一化方法,产生返回值列表类型
result = scaler.fit_transform(data)”
–标准化:
“from sklearn.preprocessing import StandardScaler
自定义数据
data = [[180,75,25],[175,80,19],[159,50,40],[160,60,32]]
导入标准化
from sklearn.preprocessing import StandardScaler
接收标准化
scaler = StandardScaler()
将数据传入标准化方法产生返回值是列表类型
result = scaler.fit_transform(data) “
类别型数据:
时间类别:
关联分析:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
创建示例数据集
data = {‘Feature1’: [1, 2, 3, 4, 5],
‘Feature2’: [5, 4, 3, 2, 1],
‘Feature3’: [2, 3, 1, 5, 4]}
df = pd.DataFrame(data)
计算相关性矩阵
correlation_matrix = df.corr()
绘制相关性热力图
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap=‘coolwarm’, vmin=-1, vmax=1)
plt.title(“Correlation Heatmap”)
plt.show()