一、概述
在临床科研中,通常大部分研究的Table 1要求对纳入研究的样本的基本信息进行统计描述。根据研究目的的不同,有时候还需要将样本分为多组,例如在机器学习中,通常分为训练组和验证组。需要对样本的基本信息进行统计推断,检验它们在不同组间是否有差异。通常数据可以分为数值变量,有序分类变量和无序分类变量。下表将各种变量应该使用的统计描述与统计推断方法进行了总结。
指标变量 | 统计描述格式 |
---|---|
服从正态分布的数值变量 | 均数±标准差 (最小值-最大值) |
不服从正态分布的数值变量 | 中位数 (下四分位数-上四分位数) |
无序分类变量 | 频数(频率) |
有序分类变量 | 频数(频率) |
指标变量 | 统计推断 - 2组 | 统计推断 - 2组以上 |
---|---|---|
满足正态性、独立性和方差齐性的数值变量 | t检验 | 方差分析 |
其他的数值变量 | Mann Whitney U检验 | Kruskal Wallis H检验 |
无序分类变量 | 卡方检验或Fisher确切概率法 | 卡方检验 |
有序分类变量 | Mann Whitney U检验 | Kruskal Wallis H检验 |
二、数值变量
首先应该对数值变量进行正态性检验,通常可使用Shapiro-Wilk检验,P>0.05说明服从正态分布
import scipy.stats
#构造一组服从总体均数为5,总体标准差为1的正态分布的变量
a = scipy.stats.norm.rvs(loc = 5, scale = 1, size = 500)
statistic, p = scipy.stats.shapiro(a)
print('statistic:', statistic)
print('p:', p)
'''
statistic: 0.9985348582267761
p: 0.9555099606513977
'''
1. 统计描述
服从正态分布的变量,统计描述的格式通常为均数±标准差 (最小值-最大值)。这里Python有个细节是计算样本标准差时,需要手动标注ddof=1,否则计算的是总体标准差
import scipy.stats
import numpy as np
a = scipy.stats.norm.rvs(loc = 5, scale = 1, size = 500)
mean = np.mean(a)
std = np.std(a, ddof = 1)
maximum = np.max(a)
minimum = np.min(a)
print('a: %.2f±%.2f (%.2f-%.2f)' % (mean, std, maximum, minimum))
'''
a: 4.99±0.93 (7.62-1.52)
'''
不服从正态分布的变量,统计描述的格式通常为中位数 (下四分位数-上四分位数)
import scipy.stats
import numpy as np
t = scipy.stats.t.rvs(df = 10, loc = 10, scale = 1, size = 1000)
statistic, p = scipy.stats.shapiro(t)
print('statistic:', statistic)
print('p:', p)
median = np.percentile(t, 50, interpolation = 'lower')
p25 = np.percentile(t, 25, interpolation = 'lower')
p75 = np.percentile(t, 75, interpolation = 'lower')
print('t: %.2f (%.2f-%.2f)' % (median, p25, p75))
'''
statistic: 0.9941165447235107
p: 0.0005829539732076228
t: 10.08 (9.30-10.74)
'''
2. 统计推断
统计推断前,首先需要对分组后的变量的方差齐性进行检验,通常选择Levene检验,P>0.05说明方差齐
import scipy.stats
a = scipy.stats.norm.rvs(loc = 14, scale = 1, size = 500)
b = scipy.stats.norm.rvs(loc = 15, scale = 1, size = 500)
c = scipy.stats.norm.rvs(loc = 16, scale = 1, size = 500)
statistic, p = scipy.stats.levene(a, b, c)
print('statistic:', statistic)
print('p:', p)
'''
statistic: 1.7860593248859278
p: 0.16797640180524823
'''
满足正态性、独立性和方差齐性的变量,选择t检验或方差分析(ANOVA)
import scipy.stats
a = scipy.stats.norm.rvs(loc = 14, scale = 1, size = 500)
b = scipy.stats.norm.rvs(loc = 15, scale = 1, size = 500)
c = scipy.stats.norm.rvs(loc = 16, scale = 1, size = 500)
statistic, p = scipy.stats.ttest_ind(a, b)
print('t-statistic:', statistic)
print('t-p:', p)
statistic,p = scipy.stats.f_oneway(a, b, c)
print('F-statistic:', statistic)
print('F-p:', p)
'''
t-statistic: -14.1443043435967
t-p: 1.562697280872398e-41
F-statistic: 518.4924180778447
F-p: 8.057769374547428e-172
'''
不满足其中任意一个的,选择Mann Whitney U检验或Kruskal Wallis H检验
import scipy.stats
#构造三组满足t分布的变量
t1 = scipy.stats.t.rvs(df = 10, loc = 0, scale = 5, size = 100)
t2 = scipy.stats.t.rvs(df = 10, loc = 5, scale = 4, size = 100)
t3 = scipy.stats.t.rvs(df = 10, loc = 10, scale = 3, size = 100)
statistic, p = scipy.stats.mannwhitneyu(t1, t2)
print('statistic:', statistic)
print('p:', p)
statistic, p = scipy.stats.kruskal(t1, t2, t3)
print('statistic:', statistic)
print('p:', p)
'''
statistic: 2038.0
p: 4.617838670739567e-13
statistic: 138.61406777408638
p: 7.949459951836258e-31
'''
三、无序分类变量
1. 统计描述
import scipy.stats
group = ['group1'] * 10 + ['group2'] * 10 + ['group1'] * 5 + ['group2'] * 5
value = ['A'] * 5 + ['B'] * 10 + ['C'] * 10 + ['D'] * 5
obs_frequency = scipy.stats.contingency.crosstab(group, value)
print(obs_frequency)
'''
((array(['group1', 'group2'], dtype='<U6'),
array(['A', 'B', 'C', 'D'], dtype='<U1')),
array([[5, 5, 5, 0],
[0, 5, 5, 5]]))
'''
输出的结果等同于
A | B | C | D | |
---|---|---|---|---|
group1 | 5 | 5 | 5 | 0 |
group2 | 0 | 5 | 5 | 5 |
为了更好的计算频率和格式化输出结果,可以对数据先分组,后分别描述
import scipy.stats
group = ['group1'] * 10 + ['group2'] * 10 + ['group1'] * 5 + ['group2'] * 5
value = ['A'] * 5 + ['B'] * 10 + ['C'] * 10 + ['D'] * 5
group_names = list(set(group))#set后顺序不定
print(group_names)
levels = ['A', 'B', 'C', 'D']
for group_name in group_names:
subgroup = [value[i] for i in range(len(group)) if group[i] == group_name]
print(subgroup)
obs_frequency = scipy.stats.contingency.crosstab(subgroup, levels = (levels,))
array = obs_frequency[1]
fre = array/array.sum()*100
describe = ['{} ({}%)'.format(round(i, 2), round(j, 2)) for i, j in zip(array, fre)]
print('%s: %s' % (group_name, describe))
'''
['group2', 'group1']
['B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D']
group2: ['0 (0.0%)', '5 (33.33%)', '5 (33.33%)', '5 (33.33%)']
['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C']
group1: ['5 (33.33%)', '5 (33.33%)', '5 (33.33%)', '0 (0.0%)']
'''
2. 统计推断
对于两组无序分类变量的统计推断,若变量只有两个等级,在样本量小于40或最小理论频数小于1时,选择Fisher确切概率法
import scipy.stats
group = ['group1'] * 10 + ['group2'] * 10 + ['group1'] * 5 + ['group2'] * 5
value = ['A'] * 15 + ['B'] * 15
obs_frequency = scipy.stats.contingency.crosstab(group, value)[1]
exp_frequency = scipy.stats.contingency.expected_freq(obs_frequency)
sample_size = obs_frequency.sum()
print('sample_size:', sample_size)
print('expected frequency:\n', exp_frequency)
oddsratio, p = scipy.stats.fisher_exact(obs_frequency)
print('oddsratio:', oddsratio)
print('p:', p)
'''
sample_size: 30
expected frequency:
[[7.5 7.5]
[7.5 7.5]]
oddsratio: 4.0
p: 0.1431109780507063
'''
两分类两等级的变量,若样本量大于40,但是最小理论频数大于等于1且小于5,选择校正后的Pearson卡方检验
import scipy.stats
group = ['group1'] * 10 + ['group2'] * 60
value = ['A'] * 5 + ['B'] * 5 + ['A'] * 5 + ['B'] * 55
obs_frequency = scipy.stats.contingency.crosstab(group, value)[1]
exp_frequency = scipy.stats.contingency.expected_freq(obs_frequency)
sample_size = obs_frequency.sum()
print('sample_size:', sample_size)
print('expected frequency:\n', exp_frequency)
statistic, p, dof, expctd = scipy.stats.chi2_contingency(obs_frequency, correction = True)
print('statistic:', statistic)
print('p:', p)
print('degree of freedom:', dof)
'''
sample_size: 70
expected frequency:
[[ 1.42857143 8.57142857]
[ 8.57142857 51.42857143]]
statistic: 8.988194444444444
p: 0.002717293527058299
degree of freedom: 1
'''
其他情况下,包括多分类多等级的变量,Pearson卡方检验不需要校正
import scipy.stats
group = ['group1'] * 500 + ['group2'] * 500 + ['group3'] * 500
value = ['A'] * 10 + ['B'] * 490 + ['A'] * 250 + ['B'] * 250 + ['A'] * 490 + ['B'] * 10
obs_frequency = scipy.stats.contingency.crosstab(group, value)[1]
exp_frequency = scipy.stats.contingency.expected_freq(obs_frequency)
sample_size = obs_frequency.sum()
print('sample_size:', sample_size)
print('expected frequency:\n', exp_frequency)
statistic, p, dof, expctd = scipy.stats.chi2_contingency(obs_frequency)
print('statistic:', statistic)
print('p:', p)
print('degree of freedom:', dof)
'''
sample_size: 1500
expected frequency:
[[250. 250.]
[250. 250.]
[250. 250.]]
statistic: 921.6
p: 7.53533802560768e-201
degree of freedom: 2
'''
四、有序分类变量
1. 统计描述
有序分类变量的统计描述,方法与无序分类变量的统计描述相同
2. 统计推断
分组变量为无序变量,指标变量为有序分类变量,所构建的列联表不宜采用Pearson卡方检验,应该使用秩转化的非参数检验,如Mann Whitney U检验或Kruskal Wallis H检验
import scipy.stats
tumor_type = ['large'] * 50 + ['small'] * 50 + ['adeno'] * 50
tumor_grade = ['grade 1'] * 20 + ['grade 2'] * 20 + ['grade 3'] * 10 + ['grade 1'] * 5 + ['grade 2'] * 10 + ['grade 3'] * 35 + ['grade 1'] * 45 + ['grade 2'] * 5 + ['grade 3'] * 0
obs_frequency = scipy.stats.contingency.crosstab(tumor_type, tumor_grade)
print(obs_frequency)
'''
((array(['adeno', 'large', 'small'], dtype='<U5'),
array(['grade 1', 'grade 2', 'grade 3'], dtype='<U7')),
array([[45, 5, 0],
[20, 20, 10],
[ 5, 10, 35]]))
'''
构建一个多分类多等级的有序分类变量
grade 1 | grade 2 | grade 3 | |
---|---|---|---|
adeno | 45 | 5 | 0 |
large | 20 | 20 | 10 |
small | 5 | 10 | 35 |
按tumor type分类,拆分数据,构造成需要的格式,并进行统计推断
grade_rank = {'grade 1': 1, 'grade 2': 2, 'grade 3': 3}
variable = []
for t_type in list(set(tumor_type)):
variable.append([grade_rank[tumor_grade[i]] for i in range(len(tumor_type)) if tumor_type[i] == t_type])
statistic, p = scipy.stats.kruskal(*variable)
print('statistic:', statistic)
print('p:', p)
'''
statistic: 75.59127846790895
p: 3.850901770374324e-17
'''
五、实战应用
实际工作中,通常样本的基本信息被保存在excel文件中,需要用pandas读取出来,并将数据预处理为符合SciPy函数要求的格式
import scipy.stats
import numpy as np
import pandas as pd
#构建一个100例样本的数据,包含分组变量gender,正态分布的数值变量age,无序分类变量smoke和有序分类变量T_grade
gender = ['M'] * 50 + ['F'] * 50
age = np.append(scipy.stats.norm.rvs(loc = 50, scale = 10, size = 50), scipy.stats.norm.rvs(loc = 40, scale = 10, size = 50))
smoke = [0] * 10 + [1] * 40 + [0] * 40 + [1] * 10
T_grade = ['grade 1'] * 10 + ['grade 2'] * 30 + ['grade 3'] * 10 + ['grade 1'] * 30 + ['grade 2'] * 10 + ['grade 3'] * 10
patient_info = pd.DataFrame({
'gender': gender,
'age': age,
'smoke': smoke,
'T_grade': T_grade
})
print(patient_info)
'''
gender age smoke T_grade
0 M 49.247392 0 grade 1
1 M 52.176272 0 grade 1
2 M 33.843182 0 grade 1
3 M 50.104804 0 grade 1
4 M 51.601580 0 grade 1
.. ... ... ... ...
95 F 25.617178 1 grade 3
96 F 51.324556 1 grade 3
97 F 27.539726 1 grade 3
98 F 31.268796 1 grade 3
99 F 29.931567 1 grade 3
[100 rows x 4 columns]
'''
为了将数据按照分组拆分,可以用如下方法
groupby = 'gender'
group = list(set(patient_info[groupby]))
group_num = len(group)
#将数据先按照分组拆分,此例中data是list of two dataframes
data = [patient_info[patient_info[groupby] == group[i]] for i in range(group_num)]
#将每个dataframe中变量提取成为一个series,此例中是list of two series
age_by_gender = [data[i]['age'] for i in range(len(data))]
smoke_by_gender = [data[i]['smoke'] for i in range(len(data))]
T_grade_rank = {'grade 1': 1, 'grade 2': 2, 'grade 3': 3}
T_grade_by_gender = [data[i]['T_grade'].map(lambda x : T_grade_rank[x]) for i in range(len(data))]
这样处理后,就能很轻松的调用函数了
statistic, p = scipy.stats.ttest_ind(*age_by_gender)
print('statistic:', statistic)
print('p:', p)
statistic, p = scipy.stats.kruskal(*T_grade_by_gender)
print('statistic:', statistic)
print('p:', p)
'''
statistic: -2.9976193849488832
p: 0.00344782005912233
statistic: 8.800000000000008
p: 0.0030123054507454694
'''