Python数据预处理:数据离散化
离散化:
简单来说就是将连续型数据转变成离散型数据。
离散化的好处/h2>
1.某些算法计算的需要
2.二值化处理的需要
3.提高数据处理效率的需要
4.提高数据处理准备度的需要
数据离散化的方法:
1.等距离散
2.K-means模型
3.4分位离散
4.等频率离散
5.二值化离散
代码实现:
需要的包:from sklearn.cluster import KMeanseans
from sklearn import preprocessing
import numpy as np
导入数据:
df = pd.read_csv('https://raw.githubusercontent.com/ffzs/dataset/master/Mall_Customers.csv',
usecols=['Age','Annual Income (k$)','Spending Score (1-100)'])
df.columns = ['Age','Income','Spend']
等距离散化
df['Age_discretized'] = pd.cut(df.Age,4,labels=range(4))
df.groupby('Age_discretized').count()
使用聚类实现离散化
data = np.array(df['Income'])
data_re = data.reshape((data.size,1))
# K-means模型创建
#创建模型
km_model= KMeans(n_clusters=4,random_state=2018)
#模型导入数据
result = km_model.fit_predict(data_re)
result
df['Income_discretized'] = result
df.groupby('Income_discretized').count()
使用4分位离散数据
df['Spend_discretized'] = pd.qcut(df['Spend'],4,labels=['C','B','A','S'])
df.groupby('Spend_discretized').count()
等频率离散
k = 4 #设置离散区间
data = df['Age'] #获取数据
w = [1.0*i/k for i in range(k + 1)]
#[0.0, 0.25, 0.5, 0.75, 1.0]
ws = data.describe(percentiles = [0.0, 0.25, 0.5, 0.75, 1.0])[4:4 + k + 1]
#无法用w代替[0.0, 0.25, 0.5, 0.75, 1.0],不知道为啥???
#获取离散区域的分界点
df['Age2'] = pd.cut(data,w,labels = range(k))
df.groupby('Age2').count()
数据二值化
data = np.array(df['Income'])
binarizer_scaler = preprocessing.Binarizer(threshold=data.mean())
result = binarizer_scaler.fit_transform(data.reshape(-1,1))
df['Income2'] = result
df.groupby('Income2').count()