数据集pima.arff:
【1】Pregnant:怀孕次数
【2】plasma-glucose:葡萄糖
【3】diastolic-blood-pressure:血压
【4】Triceps-skin-fold-pressure:皮层厚度
【5】2-Hour-serum-Insulin:胰岛素 2小时血清胰岛素
【6】Body-mass-index:体重指数 (体重/身高)^2
【7】Diabetes-pedigree-function:糖尿病谱系功能
【8】Age:年龄
【9】class:类标签 {0, 1}
第一步:读取数据集
从scipy.io库中导入arff函数,使用pandas库的数据类型Dataframe进行存取:
# 横纵向显示参数
pd.set_option('display.max_columns', 10)
# pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)
data, meta = arff.loadarff('./pima.arff')
df = pd.DataFrame(data)
第二步:计算各个特征的均值方差
均值(公式略)代码如下:
num_rows = df.shape[0]
num_cols = df.shape[1]-1
# 均值
totals = num_cols * [0.0]
time_begin = time.time()
for row in data:
for i in range(num_cols):
totals[i] += row[i]
mean = [total / num_rows for total in totals]
time_end = time.time()
# print(totals)
for i in range(num_cols):
print("特征{0}均值:{1:.5f}".format(df.columns[i], mean[i]))
print("用时:{0}s \r\n".format(time_end-time_begin))
方差公式、代码如下:
# 方差
demos = num_cols * [0.0]
time_begin = time.time()
for row in data:
for i in range(num_cols):
demos[i] += (row[i]-mean[i])**2
var = [demo / num_rows for demo in demos]
time_end = time.time()
for i in range(num_cols):
print("特征{0}方差:{1:.5f}".format(df.columns[i], var[i]))
print("用时:{0}s \r\n".format(time_end-time_begin))
如上计算均值方差时需要各扫描一趟数据集,即扫描两趟才能得出结果。 想要扫描一趟数据集同时得出均值和方差,则需要用到方差的另一个变形公式:
# 扫一趟数据集
print("扫描一趟数据集:")
nums1 = num_cols * [0.0]
nums2 = num_cols * [0.0]
time_begin = time.time()
for row in data:
for i in range(num_cols):
nums1[i] += row[i]
nums2[i] += (row[i])**2
mean2 = [num1 / num_rows for num1 in nums1]
var2 = [num2 / num_rows - (num1) ** 2 / (num_rows) ** 2 for num1, num2 in zip(nums1, nums2)]
time_end = time.time()
for i in range(num_cols):
print("特征{0}均值:{1:.5f}".format(df.columns[i], mean2[i]))
for i in range(num_cols):
print("特征{0}方差:{1:.5f}".format(df.columns[i], var2[i]))
print("总用时:{0}s \r\n".format(time_end-time_begin))
第三步:计算特征两两之间的相关性(类标签列除外)
相关系数公式代码如下,其中Cov(X, Y)为X与Y的协方差,Var[X]为X的方差,Var[Y]为Y的方差:
或
def cov(x, mean_x, y, mean_y):
total = 0.0
for index in range(num_rows):
total += (x[index] - mean_x) * (y[index] - mean_y)
cov_xy: float = total / num_rows
# print(cov_xy)
return cov_xy
r = np.ones((num_cols, num_cols))
for i in range(0, num_cols):
for j in range(i + 1, num_cols):
r[i][j] = cov(df.values[:, i], mean[i], df.values[:, j], mean[j]) / ((var[i] * var[j]) ** 0.5)
j += 1
i += 1
print(r)
第四步:调用函数进行校对
# 校对
print(df.mean())
# pandas_var的标准偏差ddof默认为1
print(df.var(axis=0, ddof=0))
# numpy_var的标准偏差ddof默认为0
print(np.var(df, axis=0))
# 或者:
# df.loc[len(df)] = df.mean()
# df.loc[len(df)] = df[:len(df)].var()
print(df.corr())
最后,关于Dataframe的一些常用函数:
取列名 .columns
通过列名x, y, ...取列 .loc[:, 'x', ' y', ...]
通过列索引x取列 .iloc[:, x]
保存代码所在py:1001_2.py 1004.py 1006.py