Introduction to Data Science in Python Week 4

Distributions

  • Set of all possible random variables.

Binomial Distribution

import pandas as pd
import Numpy as np
np.random.binomial(1,0.5) #第一个参数是运行的次数,第二个参数是得到0的几率

如果要计算两天连续发生龙卷风的概率

chance_of_tornado = 0.01 #概率

tornado_events = np.random.binomial(1, chance_of_tornado, 1000000)#采样次数100万次
    
two_days_in_a_row = 0
for j in range(1,len(tornado_events)):
    if tornado_events[j]==1 and tornado_events[j-1]==1:
        two_days_in_a_row+=1

print('{} tornadoes back to back in {} years'.format(two_days_in_a_row, 1000000/365))

Uniform Distribution

np.random.uniform(0, 1)
# np.random.uniform(low,high,size)

Normal (Gaussian) Distribution (Mean is zero)

np.random.normal(0.75) #scale=0.75
#np.random.normal(loc,scale,size)
#loc: 分布中心
#scale:标准差,scale越大,正态分布曲线越宽越矮

How to calculate the standard deviation

distribution = np.random.normal(0.75,size=1000)
np.sqrt(np.sum((np.mean(distribution)-distribution)**2)/len(distribution))
or
np.std(distribution)

Kurtosis

import scipy.stats as stats
stats.kurtosis(distribution) #负值表示比normal distribution更加平坦,正值表示比normal distribution更加陡峭
stats.skew(distribution)#查看是否有太多偏差(于正态分布中心相比)

Chi-Squared Distribution

  • Left-skewed
  • Degrees of freedom (one parameter)
chi_squared_df2 = np.random.chisquare(2, size=10000)
stats.skew(chi_squared_df2)
>>>2.067857561010524
chi_squared_df5 = np.random.chisquare(5, size=10000)
stats.skew(chi_squared_df5)
>>>1.3091894938388848
#随着 degrees of freedom 增加,曲线左偏移值减小

Modality Distribution

有多个峰值

Hypothesis Testing in Python

  • Hypothesis: A statement we can test
    • Alternative hypothesis: our idea, e.g. there’s a difference between groups
    • Null hypothesis(零假设): the alternative of our idea, there’s no difference between groups
    • 需要证明的是有证据使零假设不成立。
df = pd.read_csv('grades.csv')
df.head()
len(df)
early = df[df['assignment1_submission'] <= '2015-12-31']
late = df[df['assignment1_submission'] > '2015-12-31']
early.mean()
late.mean()
  • Critical Value alpha α \alpha α
    • The threshold as to how much chance you are willing to accept
    • typical values in social science are 0.1, 0.05 or 0.01

T-test

from scipy import stats
stats.ttest_ind?
>>>Signature: stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')
>>>Docstring:
Calculates the T-test for the means of *two independent* samples of scores.
stats.ttest_ind(early['assignment1_grade'], late['assignment1_grade'])
>>>Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)
# P-value is large, there's no significant difference between these two sample means, we can not reject the null hypothesis.

P-hacking (Dredging)

  • 虚假的相关性,而不是一般化的结果
  • Doing many tests until you find one which is of statistical significance
  • At a confidence level of 0.05, we expect to find one positive result 1 time out of 20 test
  • Remedies
    • Bonferroni correction (随着测试次数增多而减小 α \alpha α)
    • hold-out sets
    • investigation pre-registration
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值