Introduction to Data Science in Python Week 4_introduction to data science in python 第四周得分为0-CSDN博客

本文链接：https://blog.csdn.net/yaoyao_chen/article/details/105945472

Week 4: Statistical Analysis in Python and Project

Distributions
Hypothesis Testing in Python
- T-test
- P-hacking (Dredging)

Distributions

Set of all possible random variables.

Binomial Distribution

import pandas as pd
import Numpy as np
np.random.binomial(1,0.5) #第一个参数是运行的次数，第二个参数是得到0的几率

如果要计算两天连续发生龙卷风的概率

chance_of_tornado = 0.01 #概率

tornado_events = np.random.binomial(1, chance_of_tornado, 1000000)#采样次数100万次
    
two_days_in_a_row = 0
for j in range(1,len(tornado_events)):
    if tornado_events[j]==1 and tornado_events[j-1]==1:
        two_days_in_a_row+=1

print('{} tornadoes back to back in {} years'.format(two_days_in_a_row, 1000000/365))

Uniform Distribution

np.random.uniform(0, 1)
# np.random.uniform(low,high,size)

Normal (Gaussian) Distribution (Mean is zero)

np.random.normal(0.75) #scale=0.75
#np.random.normal(loc,scale,size)
#loc: 分布中心
#scale：标准差，scale越大，正态分布曲线越宽越矮

How to calculate the standard deviation

distribution = np.random.normal(0.75,size=1000)
np.sqrt(np.sum((np.mean(distribution)-distribution)**2)/len(distribution))
or
np.std(distribution)

Kurtosis

import scipy.stats as stats
stats.kurtosis(distribution) #负值表示比normal distribution更加平坦，正值表示比normal distribution更加陡峭
stats.skew(distribution)#查看是否有太多偏差（于正态分布中心相比）

Chi-Squared Distribution

Left-skewed
Degrees of freedom (one parameter)

chi_squared_df2 = np.random.chisquare(2, size=10000)
stats.skew(chi_squared_df2)
>>>2.067857561010524
chi_squared_df5 = np.random.chisquare(5, size=10000)
stats.skew(chi_squared_df5)
>>>1.3091894938388848
#随着 degrees of freedom 增加，曲线左偏移值减小

Modality Distribution

有多个峰值

Hypothesis Testing in Python

Hypothesis: A statement we can test
- Alternative hypothesis: our idea, e.g. there’s a difference between groups
- Null hypothesis（零假设）: the alternative of our idea, there’s no difference between groups
- 需要证明的是有证据使零假设不成立。

df = pd.read_csv('grades.csv')
df.head()
len(df)
early = df[df['assignment1_submission'] <= '2015-12-31']
late = df[df['assignment1_submission'] > '2015-12-31']
early.mean()
late.mean()

Critical Value alpha $\alpha$
- The threshold as to how much chance you are willing to accept
- typical values in social science are 0.1, 0.05 or 0.01

T-test

from scipy import stats
stats.ttest_ind?
>>>Signature: stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')
>>>Docstring:
Calculates the T-test for the means of *two independent* samples of scores.

stats.ttest_ind(early['assignment1_grade'], late['assignment1_grade'])
>>>Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)
# P-value is large, there's no significant difference between these two sample means, we can not reject the null hypothesis.

P-hacking (Dredging)

虚假的相关性，而不是一般化的结果
Doing many tests until you find one which is of statistical significance
At a confidence level of 0.05, we expect to find one positive result 1 time out of 20 test
Remedies
- Bonferroni correction (随着测试次数增多而减小 $\alpha$ )
- hold-out sets
- investigation pre-registration