肝炎案例（清洗+模型）

最新推荐文章于 2021-05-26 10:38:27 发布

Best_CLW

最新推荐文章于 2021-05-26 10:38:27 发布

阅读量285

点赞数

分类专栏：方法总结

本文链接：https://blog.csdn.net/CLW1218/article/details/109294974

版权

本文探讨了一例肝炎数据分析项目，重点介绍了数据预处理阶段的清洗步骤，以及后续运用统计模型进行分析的过程。通过对数据的深入理解和处理，揭示了肝炎的相关风险因素，为临床决策提供了有力支持。

摘要由CSDN通过智能技术生成

## 导入需要的库
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
# !pip install regex
import regex as re
import matplotlib.pyplot as plt
import seaborn as sns
## 读入csv格式的文件
train_data=pd.read_csv(r'C:\Users\hp\Desktop\tr.csv')
test_data = pd.read_csv(r'C:\Users\hp\Desktop\te.csv')
true_data = pd.read_csv(r'C:\Users\hp\Desktop\true_data.csv',encoding='gbk')

train_data.shape,test_data.shape

#获取列名
dcols = list(train_data.columns)

#去除训练集中的缺失y
train_data = train_data[train_data['肝炎'].isnull()==False]
train_data.shape

ntr = len(train_data)

test_data['肝炎'] = 'nan'
test_data.shape

#合并训练集和测试集
d0 = pd.concat([train_data[dcols],test_data[dcols]]).reset_index(drop=True)

d0.drop('ID',axis= 1,inplace=True)

d0['护理来源'].unique()

## 对性别进行编码
def gender(x):
    if x=='M':
        return 0
    else:
        return 1
d0['性别']=d0['性别'].apply(gender)
## 对区域进行编码
def district(x):
    if x=='east':
        return 1
    elif x=='south':
        return 2
    elif x=='north':
        return 3
    else:
        return 4
d0['区域']=d0['区域'].apply(district)
## 对护理来源进行编码
def care(x):
    if x=='Governament Hospital':
        return 1
    if x=='Never Counsulted':
        return 2
    if x=='Private Hospital' or x==' ':
        return 3
    if x=='clinic':
        return 4
d0['护理来源']=d0['护理来源'].apply(care)

d_org = d0.copy()

d0 = d_org.copy()
d0.isnull().sum()

# 高血压数据调整
d0['最高血压'][d0['最高血压'].isnull() & d0['高血压']==1] = 141
d0['最高血压'][d0['最高血压'].isnull() & d0['高血压']==0] = 119
d0['最低血压'][d0['最低血压'].isnull()] = 70

d0.isnull().sum()

d0['out'] = [0]*len(d0)
import numpy as np
# 体重指数缺三项或两项 （即无法判断正确性或者换算）
tmp = d0['体重'].isnull().astype(int)+d0['身高'].isnull().astype(int)+d0['体重指数'].isnull().astype(int)
d0['out'][tmp==3] = 1
d0['out'][tmp==2] = 1
# 胆固醇数据缺三项或两项 （即无法判断正确性或者换算）
tmp = d0['好胆固醇'].isnull().astype(int)+d0['坏胆固醇'].isnull().astype(int)+d0['总胆固醇'].isnull().astype(int)
d0['out'][tmp==3] = 3
d0['out'][tmp==2] = 3
#循环处理 异常
for i in range(len(d0)):
# 体重数据缺失值填充
    l1 = np.isnan(d0.loc[i,'体重'])
    l2 = np.isnan(d0.loc[i,'身高'])
    l3 = np.isnan(d0.loc[i,'体重指数'])
    if l1:
        d0.loc[i,'体重'] = d0.loc[i,'体重指数']*(d0.loc[i,'身高']/100)**2
    if l2:
        d0.loc[i,'身高'] = np.sqrt(d0.loc[i,'体重']/d0.loc[i,'体重指数'] )*100
    if l3:
        d0.loc[i,'体重指数'] = d0.loc[i,'体重']/(d0.loc[i,'身高']/100)**2
    #体重数据异常别
    if abs(d0.loc[i,'体重']/(d0.loc[i,'身高']/100)**2-d0.loc[i,'体重指数'])>= 0.005 :
        d0.loc[i,'out']