Python - SciPy - 令人头疼的Table 1? 临床科研常用统计描述与统计推断方法总结

一、概述

在临床科研中,通常大部分研究的Table 1要求对纳入研究的样本的基本信息进行统计描述。根据研究目的的不同,有时候还需要将样本分为多组,例如在机器学习中,通常分为训练组和验证组。需要对样本的基本信息进行统计推断,检验它们在不同组间是否有差异。通常数据可以分为数值变量,有序分类变量和无序分类变量。下表将各种变量应该使用的统计描述与统计推断方法进行了总结。

指标变量统计描述格式
服从正态分布的数值变量均数±标准差 (最小值-最大值)
不服从正态分布的数值变量中位数 (下四分位数-上四分位数)
无序分类变量频数(频率)
有序分类变量频数(频率)
指标变量统计推断 - 2组统计推断 - 2组以上
满足正态性、独立性和方差齐性的数值变量t检验方差分析
其他的数值变量Mann Whitney U检验Kruskal Wallis H检验
无序分类变量卡方检验或Fisher确切概率法卡方检验
有序分类变量Mann Whitney U检验Kruskal Wallis H检验

二、数值变量

首先应该对数值变量进行正态性检验,通常可使用Shapiro-Wilk检验,P>0.05说明服从正态分布

import scipy.stats

#构造一组服从总体均数为5,总体标准差为1的正态分布的变量
a = scipy.stats.norm.rvs(loc = 5, scale = 1, size = 500)
statistic, p = scipy.stats.shapiro(a)
print('statistic:', statistic)
print('p:', p)
'''
statistic: 0.9985348582267761
p: 0.9555099606513977
'''

1. 统计描述

服从正态分布的变量,统计描述的格式通常为均数±标准差 (最小值-最大值)。这里Python有个细节是计算样本标准差时,需要手动标注ddof=1,否则计算的是总体标准差

import scipy.stats
import numpy as np

a = scipy.stats.norm.rvs(loc = 5, scale = 1, size = 500)
mean = np.mean(a)
std = np.std(a, ddof = 1)
maximum = np.max(a)
minimum = np.min(a)
print('a: %.2f±%.2f (%.2f-%.2f)' % (mean, std, maximum, minimum))
'''
a: 4.99±0.93 (7.62-1.52)
'''

不服从正态分布的变量,统计描述的格式通常为中位数 (下四分位数-上四分位数)

import scipy.stats
import numpy as np

t = scipy.stats.t.rvs(df = 10, loc = 10, scale = 1, size = 1000)
statistic, p = scipy.stats.shapiro(t)
print('statistic:', statistic)
print('p:', p)

median = np.percentile(t, 50, interpolation = 'lower')
p25 = np.percentile(t, 25, interpolation = 'lower')
p75 = np.percentile(t, 75, interpolation = 'lower')
print('t: %.2f (%.2f-%.2f)' % (median, p25, p75))
'''
statistic: 0.9941165447235107
p: 0.0005829539732076228
t: 10.08 (9.30-10.74)
'''

2. 统计推断

统计推断前,首先需要对分组后的变量的方差齐性进行检验,通常选择Levene检验,P>0.05说明方差齐

import scipy.stats

a = scipy.stats.norm.rvs(loc = 14, scale = 1, size = 500)
b = scipy.stats.norm.rvs(loc = 15, scale = 1, size = 500)
c = scipy.stats.norm.rvs(loc = 16, scale = 1, size = 500)
statistic, p = scipy.stats.levene(a, b, c)
print('statistic:', statistic)
print('p:', p)
'''
statistic: 1.7860593248859278
p: 0.16797640180524823
'''

满足正态性、独立性和方差齐性的变量,选择t检验或方差分析(ANOVA)

import scipy.stats

a = scipy.stats.norm.rvs(loc = 14, scale = 1, size = 500)
b = scipy.stats.norm.rvs(loc = 15, scale = 1, size = 500)
c = scipy.stats.norm.rvs(loc = 16, scale = 1, size = 500)
statistic, p = scipy.stats.ttest_ind(a, b)
print('t-statistic:', statistic)
print('t-p:', p)
statistic,p = scipy.stats.f_oneway(a, b, c)
print('F-statistic:', statistic)
print('F-p:', p)
'''
t-statistic: -14.1443043435967
t-p: 1.562697280872398e-41
F-statistic: 518.4924180778447
F-p: 8.057769374547428e-172
'''

不满足其中任意一个的,选择Mann Whitney U检验或Kruskal Wallis H检验

import scipy.stats

#构造三组满足t分布的变量
t1 = scipy.stats.t.rvs(df = 10, loc = 0, scale = 5, size = 100)
t2 = scipy.stats.t.rvs(df = 10, loc = 5, scale = 4, size = 100)
t3 = scipy.stats.t.rvs(df = 10, loc = 10, scale = 3, size = 100)

statistic, p = scipy.stats.mannwhitneyu(t1, t2)
print('statistic:', statistic)
print('p:', p)

statistic, p = scipy.stats.kruskal(t1, t2, t3)
print('statistic:', statistic)
print('p:', p)
'''
statistic: 2038.0
p: 4.617838670739567e-13
statistic: 138.61406777408638
p: 7.949459951836258e-31
'''

三、无序分类变量

1. 统计描述

import scipy.stats

group = ['group1'] * 10 + ['group2'] * 10 + ['group1'] * 5 + ['group2'] * 5
value = ['A'] * 5 + ['B'] * 10 + ['C'] * 10 + ['D'] * 5

obs_frequency = scipy.stats.contingency.crosstab(group, value)
print(obs_frequency)
'''
((array(['group1', 'group2'], dtype='<U6'), 
array(['A', 'B', 'C', 'D'], dtype='<U1')), 
array([[5, 5, 5, 0], 
	[0, 5, 5, 5]]))
'''

输出的结果等同于

ABCD
group15550
group20555

为了更好的计算频率和格式化输出结果,可以对数据先分组,后分别描述

import scipy.stats

group = ['group1'] * 10 + ['group2'] * 10 + ['group1'] * 5 + ['group2'] * 5
value = ['A'] * 5 + ['B'] * 10 + ['C'] * 10 + ['D'] * 5

group_names = list(set(group))#set后顺序不定
print(group_names)
levels = ['A', 'B', 'C', 'D']

for group_name in group_names:
	subgroup = [value[i] for i in range(len(group)) if group[i] == group_name]
	print(subgroup)
	obs_frequency = scipy.stats.contingency.crosstab(subgroup, levels = (levels,))
	array = obs_frequency[1]
	fre = array/array.sum()*100
	describe = ['{} ({}%)'.format(round(i, 2), round(j, 2)) for i, j in zip(array, fre)]
	print('%s: %s' % (group_name, describe))
'''
['group2', 'group1']
['B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D']
group2: ['0 (0.0%)', '5 (33.33%)', '5 (33.33%)', '5 (33.33%)']
['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C']
group1: ['5 (33.33%)', '5 (33.33%)', '5 (33.33%)', '0 (0.0%)']
'''

2. 统计推断

对于两组无序分类变量的统计推断,若变量只有两个等级,在样本量小于40或最小理论频数小于1时,选择Fisher确切概率法

import scipy.stats

group = ['group1'] * 10 + ['group2'] * 10 + ['group1'] * 5 + ['group2'] * 5
value = ['A'] * 15 + ['B'] * 15

obs_frequency = scipy.stats.contingency.crosstab(group, value)[1]
exp_frequency = scipy.stats.contingency.expected_freq(obs_frequency)
sample_size = obs_frequency.sum()
print('sample_size:', sample_size)
print('expected frequency:\n', exp_frequency)

oddsratio, p = scipy.stats.fisher_exact(obs_frequency)
print('oddsratio:', oddsratio)
print('p:', p)
'''
sample_size: 30
expected frequency:
 [[7.5 7.5]
 [7.5 7.5]]
oddsratio: 4.0
p: 0.1431109780507063
'''

两分类两等级的变量,若样本量大于40,但是最小理论频数大于等于1且小于5,选择校正后的Pearson卡方检验

import scipy.stats

group = ['group1'] * 10 + ['group2'] * 60
value = ['A'] * 5 + ['B'] * 5 + ['A'] * 5 + ['B'] * 55

obs_frequency = scipy.stats.contingency.crosstab(group, value)[1]
exp_frequency = scipy.stats.contingency.expected_freq(obs_frequency)
sample_size = obs_frequency.sum()
print('sample_size:', sample_size)
print('expected frequency:\n', exp_frequency)

statistic, p, dof, expctd = scipy.stats.chi2_contingency(obs_frequency, correction = True)
print('statistic:', statistic)
print('p:', p)
print('degree of freedom:', dof)
'''
sample_size: 70
expected frequency:
 [[ 1.42857143  8.57142857]
 [ 8.57142857 51.42857143]]
statistic: 8.988194444444444
p: 0.002717293527058299
degree of freedom: 1
'''

其他情况下,包括多分类多等级的变量,Pearson卡方检验不需要校正

import scipy.stats

group = ['group1'] * 500 + ['group2'] * 500 + ['group3'] * 500
value = ['A'] * 10 + ['B'] * 490 + ['A'] * 250 + ['B'] * 250 + ['A'] * 490 + ['B'] * 10

obs_frequency = scipy.stats.contingency.crosstab(group, value)[1]
exp_frequency = scipy.stats.contingency.expected_freq(obs_frequency)
sample_size = obs_frequency.sum()
print('sample_size:', sample_size)
print('expected frequency:\n', exp_frequency)

statistic, p, dof, expctd = scipy.stats.chi2_contingency(obs_frequency)
print('statistic:', statistic)
print('p:', p)
print('degree of freedom:', dof)
'''
sample_size: 1500
expected frequency:
 [[250. 250.]
 [250. 250.]
 [250. 250.]]
statistic: 921.6
p: 7.53533802560768e-201
degree of freedom: 2
'''

四、有序分类变量

1. 统计描述

有序分类变量的统计描述,方法与无序分类变量的统计描述相同

2. 统计推断

分组变量为无序变量,指标变量为有序分类变量,所构建的列联表不宜采用Pearson卡方检验,应该使用秩转化的非参数检验,如Mann Whitney U检验或Kruskal Wallis H检验

import scipy.stats

tumor_type = ['large'] * 50 + ['small'] * 50 + ['adeno'] * 50
tumor_grade = ['grade 1'] * 20 + ['grade 2'] * 20 + ['grade 3'] * 10 + ['grade 1'] * 5 + ['grade 2'] * 10 + ['grade 3'] * 35 + ['grade 1'] * 45 + ['grade 2'] * 5 + ['grade 3'] * 0
obs_frequency = scipy.stats.contingency.crosstab(tumor_type, tumor_grade)
print(obs_frequency)
'''
((array(['adeno', 'large', 'small'], dtype='<U5'), 
array(['grade 1', 'grade 2', 'grade 3'], dtype='<U7')), 
array([[45,  5,  0],
       [20, 20, 10],
       [ 5, 10, 35]]))
'''

构建一个多分类多等级的有序分类变量

grade 1grade 2grade 3
adeno4550
large202010
small51035

按tumor type分类,拆分数据,构造成需要的格式,并进行统计推断

grade_rank = {'grade 1': 1, 'grade 2': 2, 'grade 3': 3}
variable = []
for t_type in list(set(tumor_type)):
	variable.append([grade_rank[tumor_grade[i]] for i in range(len(tumor_type)) if tumor_type[i] == t_type])

statistic, p = scipy.stats.kruskal(*variable)
print('statistic:', statistic)
print('p:', p)
'''
statistic: 75.59127846790895
p: 3.850901770374324e-17
'''

五、实战应用

实际工作中,通常样本的基本信息被保存在excel文件中,需要用pandas读取出来,并将数据预处理为符合SciPy函数要求的格式

import scipy.stats
import numpy as np
import pandas as pd

#构建一个100例样本的数据,包含分组变量gender,正态分布的数值变量age,无序分类变量smoke和有序分类变量T_grade
gender = ['M'] * 50 + ['F'] * 50
age = np.append(scipy.stats.norm.rvs(loc = 50, scale = 10, size = 50), scipy.stats.norm.rvs(loc = 40, scale = 10, size = 50))
smoke = [0] * 10 + [1] * 40 + [0] * 40 + [1] * 10
T_grade = ['grade 1'] * 10 + ['grade 2'] * 30 + ['grade 3'] * 10 + ['grade 1'] * 30 + ['grade 2'] * 10 + ['grade 3'] * 10
patient_info = pd.DataFrame({
	'gender': gender, 
	'age': age, 
	'smoke': smoke, 
	'T_grade': T_grade
	})
print(patient_info)
'''
   gender        age  smoke  T_grade
0       M  49.247392      0  grade 1
1       M  52.176272      0  grade 1
2       M  33.843182      0  grade 1
3       M  50.104804      0  grade 1
4       M  51.601580      0  grade 1
..    ...        ...    ...      ...
95      F  25.617178      1  grade 3
96      F  51.324556      1  grade 3
97      F  27.539726      1  grade 3
98      F  31.268796      1  grade 3
99      F  29.931567      1  grade 3

[100 rows x 4 columns]
'''

为了将数据按照分组拆分,可以用如下方法

groupby = 'gender'
group = list(set(patient_info[groupby]))
group_num = len(group)
#将数据先按照分组拆分,此例中data是list of two dataframes
data = [patient_info[patient_info[groupby] == group[i]] for i in range(group_num)]
#将每个dataframe中变量提取成为一个series,此例中是list of two series
age_by_gender = [data[i]['age'] for i in range(len(data))]
smoke_by_gender = [data[i]['smoke'] for i in range(len(data))]
T_grade_rank = {'grade 1': 1, 'grade 2': 2, 'grade 3': 3}
T_grade_by_gender = [data[i]['T_grade'].map(lambda x : T_grade_rank[x]) for i in range(len(data))]

这样处理后,就能很轻松的调用函数了

statistic, p = scipy.stats.ttest_ind(*age_by_gender)
print('statistic:', statistic)
print('p:', p)

statistic, p = scipy.stats.kruskal(*T_grade_by_gender)
print('statistic:', statistic)
print('p:', p)
'''
statistic: -2.9976193849488832
p: 0.00344782005912233
statistic: 8.800000000000008
p: 0.0030123054507454694
'''
  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值