【Python】数据分析 Section 1.4: Basic Statistical Testing | Coursera “Applied Data Science with Python“

Yqalu

已于 2024-09-18 10:41:22 修改

阅读量493

点赞数 11

分类专栏： Applied Data Science in Python 文章标签： python 数据分析

于 2024-05-17 00:11:58 首次发布

本文链接：https://blog.csdn.net/Yqalu/article/details/138979533

版权

Applied Data Science in Python 专栏收录该内容

25 篇文章

订阅专栏

In this lecture we're going to review some of the basics of statistical testing in python. We're going to talk about hypothesis testing, statistical significance, and using scipy to run student's t-tests.

# We use statistics in a lot of different ways in data science, and on this lecture, I want to refresh your
# knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of
# hypothesis testing is to determine if, for instance, the two different conditions we have in an experiment 
# have resulted in different impacts

# Let's import our usual numpy and pandas libraries
import numpy as np
import pandas as pd

# Now let's bring in some new libraries from scipy
from scipy import stats

# Now, scipy is an interesting collection of libraries for data science and you'll use most or perpahs all of
# these libraries. It includes numpy and pandas, but also plotting libraries such as matplotlib, and a
# number of scientific library functions as well

# When we do hypothesis testing, we actually have two statements of interest: the first is our actual
# explanation, which we call the alternative hypothesis, and the second is that the explanation we have is not
# sufficient, and we call this the null hypothesis. Our actual testing method is to determine whether the null
# hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null
# hypothesis and we accept our alternative.

# Let's see an example of this; we're going to use some grade data
df=pd.read_csv ('datasets/grades.csv')
df.head()

# If we take a look at the data frame inside, we see we have six different assignments. Lets look at some
# summary statistics for this DataFrame
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

>>> There are 2315 rows and 13 columns

# For the purpose of this lecture, let's segment this population into two pieces. Let's say those who finish
# the first assignment by the end of December 2015, we'll call them early finishers, and those who finish it 
# sometime after that, we'll call them late finishers.

early_finishers=df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

# So, you have lots of skills now with pandas, how would you go about getting the late_finishers dataframe?
# Why don't you pause the video and give it a try.

# Here's my solution. First, the dataframe df and the early_finishers share index values, so I really just
# want everything in the df which is not in early_finishers
late_finishers=df[~df.index.isin(early_finishers.index)]
late_finishers.head()

# There are lots of other ways to do this. For instance, you could just copy and paste the first projection
# and change the sign from less than to greater than or equal to. This is ok, but if you decide you want to
# change the date down the road you have to remember to change it in two places. You could also do a join of
# the dataframe df with early_finishers - if you do a left join you only keep the items in the left dataframe,
# so this would have been a good answer. You also could have written a function that determines if someone is
# early or late, and then called .apply() on the dataframe and added a new column to the dataframe. This is a
# pretty reasonable answer as well.

# As you've seen, the pandas data frame object has a variety of statistical functions associated with it. If
# we call the mean function directly on the data frame, we see that each of the means for the assignments are
# calculated. Let's compare the means for our two populations

print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

>>>
74.94728457024304
74.0450648477065

# Ok, these look pretty similar. But, are they the same? What do we mean by similar? This is where the
# students' t-test comes in. It allows us to form the alternative hypothesis ("These are different") as well
# as the null hypothesis ("These are the same") and then test that null hypothesis.

# When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a
# chance we're willing to accept. This significance level is typically called alpha. #For this example, let's
# use a threshold of 0.05 for our alpha or 5%. Now this is a commonly used number but it's really quite
# arbitrary.

# The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing
# in Python and we're going to use the ttest_ind() function which does an independent t-test (meaning the
# populations are not related to one another). The result of ttest_index() are the t-statistic and a p-value.
# It's this latter value, the probability, which is most important to us, as it indicates the chance (between
# 0 and 1) of our null hypothesis being True.

# Let's bring in our ttest_ind function
from scipy.stats import ttest_ind

# Let's run this function with our two populations, looking at the assignment 1 grades
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

>>> Ttest_indResult(statistic=1.3223540853721596, pvalue=0.18618101101713855)

# So here we see that the probability is 0.18, and this is above our alpha value of 0.05. This means that we
# cannot reject the null hypothesis. The null hypothesis was that the two populations are the same, and we
# don't have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to
# the contrary. This doesn't mean that we have proven the populations are the same.

# Why don't we check the other assignment grades?
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

>>>
Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)

# Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with
# respect to grade. Let's take a look at those p-values for a moment though, because they are saying things
# that can inform experimental design down the road. For instance, one of the assignments, assignment 3, has a
# p-value around 0.1. This means that if we accepted a level of chance similarity of 11% this would have been
# considered statistically significant. As a research, this would suggest to me that there is something here
# worth considering following up on. For instance, if we had a small number of participants (we don't) or if
# there was something unique about this assignment as it relates to our experiment (whatever it was) then
# there may be followup experiments we could run.

# P-values have come under fire recently for being insuficient for telling us enough about the interactions
# which are happening, and two other techniques, confidence intervalues and bayesian analyses, are being used
# more regularly. One issue with p-values is that as you run more tests you are likely to get a value which
# is statistically significant just by chance.

# Lets see a simulation of this. First, lets create a data frame of 100 columns, each with 100 numbers
df1=pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

# Pause this and reflect -- do you understand the list comprehension and how I created this DataFrame? You
# don't have to use a list comprehension to do this, but you should be able to read this and figure out how it
# works as this is a commonly used approach on web forums.

# Ok, let's create a second dataframe
df2=pd.DataFrame([np.random.random(100) for x in range(100)])

# Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same
# as the row inside df2?

# Let's take a look. Let's say our critical value is 0.1, or and alpha of 10%. And we're going to compare each
# column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%,
# which means that we have sufficient evidence to say that the columns are different.

# Let's write this in a function called test_columns
def test_columns(alpha=0.1):
    # I want to keep track of how many differ
    num_diff=0
    # And now we can just iterate over the columns
    for col in df1.columns:
        # we can run out ttest_ind between the two dataframes
        teststat,pval=ttest_ind(df1[col],df2[col])
        # and we check the pvalue versus the alpha
        if pval<=alpha:
            # And now we'll just print out if they are different and increment the num_diff
            print("Col {} is statistically significantly different at alpha={}, pval={}".format(col,alpha,pval))
            num_diff=num_diff+1
    # and let's print out some summary stats
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))

# And now lets actually run this
test_columns()

>>>
Col 3 is statistically significantly different at alpha=0.1, pval=0.021685096862095007
Col 4 is statistically significantly different at alpha=0.1, pval=0.09134886090874293
Col 8 is statistically significantly different at alpha=0.1, pval=0.06928980679771227
Col 26 is statistically significantly different at alpha=0.1, pval=0.04588502618355945
Col 31 is statistically significantly different at alpha=0.1, pval=0.06989401755010603
Col 33 is statistically significantly different at alpha=0.1, pval=0.056409723796422014
Col 62 is statistically significantly different at alpha=0.1, pval=0.09525112969763211
Col 66 is statistically significantly different at alpha=0.1, pval=0.031173925944036347
Col 87 is statistically significantly different at alpha=0.1, pval=0.04038510918018904
Col 88 is statistically significantly different at alpha=0.1, pval=0.07786581526163536
Col 89 is statistically significantly different at alpha=0.1, pval=0.06029139598486774
Col 91 is statistically significantly different at alpha=0.1, pval=0.01959282545656669
Col 94 is statistically significantly different at alpha=0.1, pval=0.01816106677943939
Total number different was 13, which is 13.0%

# Interesting, so we see that there are a bunch of columns that are different! In fact, that number looks a
# lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same? Remember
# that all the ttest does is check if two sets are similar given some level of confidence, in our case, 10%.
# The more random comparisons you do, the more will just happen to be the same by chance. In this example, we
# checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1.

# We can test some other alpha values as well
test_columns(0.05)

>>>
Col 3 is statistically significantly different at alpha=0.05, pval=0.021685096862095007
Col 26 is statistically significantly different at alpha=0.05, pval=0.04588502618355945
Col 66 is statistically significantly different at alpha=0.05, pval=0.031173925944036347
Col 87 is statistically significantly different at alpha=0.05, pval=0.04038510918018904
Col 91 is statistically significantly different at alpha=0.05, pval=0.01959282545656669
Col 94 is statistically significantly different at alpha=0.05, pval=0.01816106677943939
Total number different was 6, which is 6.0%

# So, keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand
# that this p-value isn't magic, that it's a threshold for you when reporting results and trying to answer
# your hypothesis. What's a reasonable threshold? Depends on your question, and you need to engage domain
# experts to better understand what they would consider significant.

# Just for fun, lets recreate that second dataframe using a non-normal distribution, I'll arbitrarily chose
# chi squared
df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])
test_columns()

>>>
Col 0 is statistically significantly different at alpha=0.1, pval=0.0007209737267769657
Col 1 is statistically significantly different at alpha=0.1, pval=0.0005960975114707457
Col 2 is statistically significantly different at alpha=0.1, pval=0.0053502188475608855
Col 3 is statistically significantly different at alpha=0.1, pval=0.059676248690011605
Col 4 is statistically significantly different at alpha=0.1, pval=0.0010973303366079742
Col 5 is statistically significantly different at alpha=0.1, pval=0.00887623318572617
Col 6 is statistically significantly different at alpha=0.1, pval=0.0029215703807162737
Col 7 is statistically significantly different at alpha=0.1, pval=0.00024758981881588493
Col 8 is statistically significantly different at alpha=0.1, pval=0.002160969862822123
Col 9 is statistically significantly different at alpha=0.1, pval=7.89411012927914e-06
Col 10 is statistically significantly different at alpha=0.1, pval=4.138636032916614e-05
Col 11 is statistically significantly different at alpha=0.1, pval=0.0001280646995677526
Col 12 is statistically significantly different at alpha=0.1, pval=0.0010446949374846996
Col 13 is statistically significantly different at alpha=0.1, pval=0.0005138383093177963
Col 14 is statistically significantly different at alpha=0.1, pval=5.5645171178136285e-06
Col 15 is statistically significantly different at alpha=0.1, pval=0.031768897620408264
Col 16 is statistically significantly different at alpha=0.1, pval=0.003996166130532832
Col 17 is statistically significantly different at alpha=0.1, pval=0.000253464856286431
Col 18 is statistically significantly different at alpha=0.1, pval=0.0032391102782203572
Col 19 is statistically significantly different at alpha=0.1, pval=0.00016204139267184144
Col 20 is statistically significantly different at alpha=0.1, pval=0.0027610055681302733
Col 21 is statistically significantly different at alpha=0.1, pval=0.0008889318934250981
Col 22 is statistically significantly different at alpha=0.1, pval=1.6156590711490786e-05
Col 23 is statistically significantly different at alpha=0.1, pval=0.007822489352105772
Col 24 is statistically significantly different at alpha=0.1, pval=0.010397123237153302
Col 25 is statistically significantly different at alpha=0.1, pval=0.015564233506873745
Col 26 is statistically significantly different at alpha=0.1, pval=0.00020695576377467668
Col 27 is statistically significantly different at alpha=0.1, pval=0.00010711145223822568
Col 28 is statistically significantly different at alpha=0.1, pval=2.8293321708732755e-05
Col 29 is statistically significantly different at alpha=0.1, pval=0.0010056732631818
Col 30 is statistically significantly different at alpha=0.1, pval=0.0060612836235019885
Col 31 is statistically significantly different at alpha=0.1, pval=0.006474197545671199
Col 32 is statistically significantly different at alpha=0.1, pval=0.0023930366923458884
Col 33 is statistically significantly different at alpha=0.1, pval=2.3217485381966707e-05
Col 34 is statistically significantly different at alpha=0.1, pval=0.0002012338935093803
Col 35 is statistically significantly different at alpha=0.1, pval=0.011080885607364262
Col 36 is statistically significantly different at alpha=0.1, pval=0.0002652267349298995
Col 37 is statistically significantly different at alpha=0.1, pval=0.02647170479481299
Col 38 is statistically significantly different at alpha=0.1, pval=1.3890074489793989e-06
Col 39 is statistically significantly different at alpha=0.1, pval=0.000543465906010951
Col 40 is statistically significantly different at alpha=0.1, pval=0.000198561781971486
Col 41 is statistically significantly different at alpha=0.1, pval=0.0034150323353820153
Col 42 is statistically significantly different at alpha=0.1, pval=0.00020155469599566947
Col 43 is statistically significantly different at alpha=0.1, pval=8.378915375251076e-05
Col 44 is statistically significantly different at alpha=0.1, pval=0.0014582162312103482
Col 45 is statistically significantly different at alpha=0.1, pval=0.0005074253002306509
Col 46 is statistically significantly different at alpha=0.1, pval=0.001511253172177943
Col 47 is statistically significantly different at alpha=0.1, pval=0.059544122335964275
Col 48 is statistically significantly different at alpha=0.1, pval=0.0006522525361603703
Col 49 is statistically significantly different at alpha=0.1, pval=0.009130703344368347
Col 50 is statistically significantly different at alpha=0.1, pval=0.00842930169427871
Col 51 is statistically significantly different at alpha=0.1, pval=0.0002932528018229352
Col 52 is statistically significantly different at alpha=0.1, pval=0.025447445975655802
Col 53 is statistically significantly different at alpha=0.1, pval=0.02541034893473839
Col 54 is statistically significantly different at alpha=0.1, pval=0.0038375952017761774
Col 55 is statistically significantly different at alpha=0.1, pval=0.0012191470926761264
Col 56 is statistically significantly different at alpha=0.1, pval=2.3400817917288324e-06
Col 57 is statistically significantly different at alpha=0.1, pval=0.0014082978794533988
Col 58 is statistically significantly different at alpha=0.1, pval=0.0025718315605676645
Col 59 is statistically significantly different at alpha=0.1, pval=0.0011833959529114965
Col 60 is statistically significantly different at alpha=0.1, pval=0.0007803387857644339
Col 61 is statistically significantly different at alpha=0.1, pval=1.8110882243741185e-05
Col 62 is statistically significantly different at alpha=0.1, pval=0.0059960689131191795
Col 63 is statistically significantly different at alpha=0.1, pval=0.0020061905593917752
Col 64 is statistically significantly different at alpha=0.1, pval=0.0003956109132514863
Col 65 is statistically significantly different at alpha=0.1, pval=0.005936859985996531
Col 66 is statistically significantly different at alpha=0.1, pval=9.430674662967803e-05
Col 67 is statistically significantly different at alpha=0.1, pval=0.0279358042339124
Col 68 is statistically significantly different at alpha=0.1, pval=3.4587295625687876e-06
Col 69 is statistically significantly different at alpha=0.1, pval=0.000498577140303116
Col 70 is statistically significantly different at alpha=0.1, pval=0.00022759200798044786
Col 71 is statistically significantly different at alpha=0.1, pval=0.0018896140668639868
Col 72 is statistically significantly different at alpha=0.1, pval=0.0003168602572464858
Col 73 is statistically significantly different at alpha=0.1, pval=0.014380320807627574
Col 74 is statistically significantly different at alpha=0.1, pval=1.0527920892873465e-05
Col 75 is statistically significantly different at alpha=0.1, pval=0.0050203063454928405
Col 76 is statistically significantly different at alpha=0.1, pval=0.0038047810950857363
Col 77 is statistically significantly different at alpha=0.1, pval=0.009898022773669569
Col 78 is statistically significantly different at alpha=0.1, pval=0.00011721288507198929
Col 79 is statistically significantly different at alpha=0.1, pval=4.299983438137586e-05
Col 80 is statistically significantly different at alpha=0.1, pval=0.004007844770139461
Col 81 is statistically significantly different at alpha=0.1, pval=5.0869526021992136e-05
Col 82 is statistically significantly different at alpha=0.1, pval=3.570500035090507e-05
Col 83 is statistically significantly different at alpha=0.1, pval=0.0016671823707496275
Col 84 is statistically significantly different at alpha=0.1, pval=0.0005338982101227677
Col 85 is statistically significantly different at alpha=0.1, pval=9.848076066128683e-07
Col 86 is statistically significantly different at alpha=0.1, pval=0.004311267380358322
Col 87 is statistically significantly different at alpha=0.1, pval=2.4910265961500223e-05
Col 88 is statistically significantly different at alpha=0.1, pval=0.00024178605830937384
Col 89 is statistically significantly different at alpha=0.1, pval=1.6926124275479522e-05
Col 90 is statistically significantly different at alpha=0.1, pval=0.0008196961556449748
Col 91 is statistically significantly different at alpha=0.1, pval=0.004242831386729519
Col 92 is statistically significantly different at alpha=0.1, pval=0.010786057811310165
Col 93 is statistically significantly different at alpha=0.1, pval=0.00014979923484925253
Col 94 is statistically significantly different at alpha=0.1, pval=0.00012426046220523812
Col 95 is statistically significantly different at alpha=0.1, pval=0.0009730590247640563
Col 96 is statistically significantly different at alpha=0.1, pval=0.0077915194604716775
Col 97 is statistically significantly different at alpha=0.1, pval=0.002053425935732665
Col 98 is statistically significantly different at alpha=0.1, pval=0.0003384129033471704
Col 99 is statistically significantly different at alpha=0.1, pval=0.0058863075005354286
Total number different was 100, which is 100.0%

# Now we see that all or most columns test to be statistically significant at the 10% level.

In this lecture, we've discussed just some of the basics of hypothesis testing in Python. I introduced you to the SciPy library, which you can use for the students t test. We've discussed some of the practical issues which arise from looking for statistical significance. There's much more to learn about hypothesis testing, for instance, there are different tests used, depending on the shape of your data and different ways to report results instead of just p-values such as confidence intervals or bayesian analyses. But this should give you a basic idea of where to start when comparing two populations for differences, which is a common task for data scientists.