读取数据
data = pd.read_csv('C://Users//TD//Desktop//hosptdata.csv')
data1 = pd.read_csv('C://Users//TD//Desktop//adjestdata.csv') # 纯数值数据
数据归一化处理
(1)最小-最大规范化
newdata=(data1 - data1.min())/(data1.max() - data1.min())
(2)均值和标准差归一化
import numpy as np
arr_mean = np.mean(data) #求均值
arr_std = np.std(data,ddof=1) #求标准差
newdata=(data-arr_mean )/arr_std
数据框合并
(1) 直接合并数据框
data=pandas.concat([data1,data2,data3])
(2)合并数据框中指的的行列
data=pandas.concat([
data1[[0,1]],
data2[[1,2]],
data3[[0,2]]
]) # 数据框中的列有选择的合并
X和Y的设定
X = data[['年龄', '性别', '病种','确诊天数','门诊与入院诊断符合情况','是否感染','是否住院超30天','检验时长','检查时长']] # 选取列名为x1,x2,x3的列作为X
Y = data[['住院天数']] # 选取列名为y的作为Y.
单独对X和Y 训练集和测试集划分
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
数据集训练集和测试集划分
from sklearn.model_selection import train_test_split#data为数据集
train, test = train_test_split(data, test_size = 0.1)