统计推断基础

统计推断基础

  • 数据说明:本数据是地区房价增长率数据
  • 名称-中文含义
  • dis_name-小区名称
  • rate-房价同比增长率
import os

os.chdir('Q:/data')
os.getcwd()
'Q:\\data'
import pandas as pd

house_price_gr = pd.read_csv('Q:/data/house_price_gr.csv', encoding='gbk')
house_price_gr
dis_namerate
0东城区甘南小区0.169747
1东城区察慈小区0.165484
2东城区胡家园小区0.141358
3东城区台基厂小区0.063197
4东城区青年湖小区0.101528
5东城区小黄庄小区0.068467
6东城区和平里六区0.118572
7东城区京香福苑小区0.161386
8东城区安贞苑50号院0.085863
9东城区安馨园小区0.104397
10东城区外交部街33号院0.178980
11西城区新文化街小区0.057328
12西城区新融苑小区0.089179
13西城区裕中西里小区0.067066
14西城区国英园小区0.063661
15西城区国家广电总局新302住宅小区0.074919
16崇文区东四块玉小区0.108691
17崇文区金鱼池危改小区0.171723
18崇文区新景家园0.162617
19宣武区法源寺小区0.222625
20宣武区建功南里小区0.129224
21宣武区椿树园小区0.036800
22宣武区恒昌花园0.098843
23宣武区康乐里小区0.113615
24宣武区小马厂电信住宅小区0.109990
25宣武区天桥小区0.177385
26宣武区牛街东里民族团结小区0.067636
27宣武区云河公寓0.143818
28朝阳区团结湖小区0.106157
29朝阳区西坝河东里0.071425
.........
120通州区天赐良园小区0.080639
121通州区翠屏北里西区0.083920
122通州区京贸国际公寓0.060896
123大兴区义和庄北里小区0.121176
124大兴区清源西里小区0.139761
125大兴县宏福园小区0.114332
126大兴区菊源里小区0.110707
127怀柔区南华园一区0.148249
128怀柔县龙湖花园小区0.120356
129怀柔县梅苑花园小区0.096335
130怀柔县南华园四区0.116468
131怀柔县馥郁苑小区0.086261
132怀柔区迎宾北路12号院0.113126
133顺义区华中园别墅0.112126
134顺义区裕龙花园0.112064
135顺义区双裕小区0.067941
136顺义区西辛小区0.097185
137顺义区中央电视台影视培训中心0.104725
138顺义区裕祥花园0.178573
139房山区桥梁厂生活区0.126083
140房山区原子能科学研究院生活区0.142602
141房山区碧桂园温泉小区0.029540
142房山区北京输油公司生活小区0.159211
143房山区西厢苑小区0.135552
144延庆县川北小区0.161761
145密云县沿湖小区0.121524
146密云县东菜园小区0.104666
147密云县花园小区0.137225
148开发区鹿鸣苑0.073119
149开发区星岛嘉园0.048391

150 rows × 2 columns

参数估计

进行描述性统计分析

house_price_gr.describe(include='all')
dis_namerate
count150150.000000
unique150NaN
top东城区甘南小区NaN
freq1NaN
meanNaN0.110061
stdNaN0.041333
minNaN0.029540
25%NaN0.080027
50%NaN0.104908
75%NaN0.140066
maxNaN0.243743

Histograph

%matplotlib inline
import seaborn as sns
from scipy import stats

# sns.distplot(house_price_gr.rate, kde=True, fit=stats.norm) # Histograph

Q-Q

import statsmodels.api as sm
from matplotlib import pyplot as plt

fig = sm.qqplot(house_price_gr.rate, fit=True, line='45')
fig.show()
E:\Anaconda3\lib\site-packages\matplotlib\figure.py:418: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

[外链图片转存失败(img-8zXHT82y-1562725663278)(output_8_1.png)]

Box Plots

house_price_gr.plot(kind='box') # Box Plots
<matplotlib.axes._subplots.AxesSubplot at 0x135079e8>

[外链图片转存失败(img-MjxUaxCX-1562725663279)(output_10_1.png)]

置信度区间估计

se = house_price_gr.rate.std() / len(house_price_gr) ** 0.5
LB = house_price_gr.rate.mean() - 3 * se
UB = house_price_gr.rate.mean() + 3 * se
(LB, UB)
(0.09993649947438818, 0.12018549392945813)
# 如果要求任意置信度下的置信区间的话,可以自己编一个函数
def confint(x, alpha=0.05):
    n = len(x)
    xb = x.mean()
    df = n-1
    tmp = (x.std() / n ** 0.5) * stats.t.ppf(1-alpha/2, df)
    return {'Mean': xb, 'Degree of Freedom':df, 'LB':xb-tmp, 'UB':xb+tmp}

confint(house_price_gr.rate, 0.05)
{'Degree of Freedom': 149,
 'LB': 0.10339228338892809,
 'Mean': 0.11006099670192315,
 'UB': 0.11672971001491822}
# 或者使用DescrStatsW
d1 = sm.stats.DescrStatsW(house_price_gr.rate)
d1.tconfint_mean(0.05) # 
(0.10339228338892814, 0.11672971001491828)

假设检验与单样本T检验

当年住宅价格的增长率是否超过了10%的阈值

print('t-statistic=%6.4f, p-value=%6.4f, df=%s' %d1.ttest_mean(0.1))
#一般认为FICO高于690的客户信誉较高,请检验该产品的客户整体信用是否高于690
t-statistic=2.9812, p-value=0.0034, df=149.0

两样本T检验

  • 数据说明:本数据是一份汽车贷款违约数据
  • 名称 中文含义
  • id id
  • Acc 是否开卡(1=已开通)
  • avg_exp 月均信用卡支出(元)
  • avg_exp_ln 月均信用卡支出的自然对数
  • gender 性别(男=1)
  • Age 年龄
  • Income 年收入(万元)
  • Ownrent 是否自有住房(有=1;无=0)
  • Selfempl 是否自谋职业(1=yes, 0=no)
  • dist_home_val 所住小区房屋均价(万元)
  • dist_avg_income 当地人均收入
  • high_avg 高出当地平均收入
  • edu_class 教育等级:小学及以下开通=0,中学=1,本科=2,研究生=3

导入数据和数据清洗

creditcard_exp = pd.read_csv('creditcard_exp.csv', skipinitialspace=True)
creditcard_exp = creditcard_exp.dropna(how='any')
creditcard_exp.head()

idAccavg_expavg_exp_lngenderAgeIncomeOwnrentSelfempldist_home_valdist_avg_incomeage2high_avgedu_class
01911217.037.10416914016.035151199.9315.93278916000.1023613
1511251.507.13209813215.847501049.8815.79631610240.0511842
3861856.576.75293614111.472851016.1011.27563216810.1972183
45011321.837.18677212813.4091510100.3913.3464747840.0626762
5671816.036.70445114110.0301501119.7610.3322631681-0.3021133

根据性别比较支出

creditcard_exp['avg_exp'].groupby(creditcard_exp['gender']).describe()
gender       
0       count      50.000000
        mean      925.705200
        std       430.833365
        min       163.180000
        25%       593.312500
        50%       813.650000
        75%      1204.777500
        max      1992.390000
1       count      20.000000
        mean     1128.531000
        std       462.281389
        min       648.150000
        25%       829.860000
        50%      1020.005000
        75%      1238.202500
        max      2430.030000
dtype: float64
  • 第一步:方差齐次检验
gender0 = creditcard_exp[creditcard_exp['gender'] == 0]['avg_exp']
gender1 = creditcard_exp[creditcard_exp['gender'] == 1]['avg_exp']
leveneTestRes = stats.levene(gender0, gender1, center='median')
print('w-value=%6.4f, p-value=%6.4f' %leveneTestRes)
w-value=0.0683, p-value=0.7946
  • 第二步:T-test
stats.stats.ttest_ind(gender0, gender1, equal_var=True)
# Or Try: sm.stats.ttest_ind(gender0, gender1, usevar='pooled')
Ttest_indResult(statistic=-1.7429013868086289, pvalue=0.085871228784484485)

方差分析

  • 单因素方差分析
import pandas as pd
pd.set_option('display.max_columns', None) # 设置显示所有列
creditcard_exp.groupby('edu_class')[['avg_exp']].describe()
avg_exp
edu_class
0count2.000000
mean207.370000
std62.494097
min163.180000
25%185.275000
50%207.370000
75%229.465000
max251.560000
1count23.000000
mean641.937826
std147.577741
min418.780000
25%525.595000
50%593.920000
75%736.140000
max987.660000
2count23.000000
mean973.321304
std229.163196
min610.250000
25%807.820000
50%959.830000
75%1075.270000
max1472.820000
3count22.000000
mean1422.280909
std435.281442
min816.030000
25%1166.997500
50%1343.025000
75%1661.412500
max2430.030000
import numpy as np
A = np.ones([2, 3, 4])
A?
edu = []
for i in range(4):
    edu.append(creditcard_exp[creditcard_exp['edu_class'] == i]['avg_exp'])
stats.f_oneway(*edu)
F_onewayResult(statistic=31.825683356937645, pvalue=7.658361691248968e-13)
  • 多因素方差分析
from statsmodels.formula.api import ols

sm.stats.anova_lm(ols('avg_exp ~ C(edu_class) + C(gender)',data=creditcard_exp).fit())
dfsum_sqmean_sqFPR(>F)
C(edu_class)3.08.126056e+062.708685e+0631.5783651.031496e-12
C(gender)1.04.178273e+044.178273e+040.4871114.877082e-01
Residual65.05.575481e+068.577662e+04NaNNaN
ana = ols('avg_exp ~ C(edu_class) + C(gender) +C(edu_class)*C(gender)', data= creditcard_exp).fit()
sm.stats.anova_lm(ana)
dfsum_sqmean_sqFPR(>F)
C(edu_class)3.08.126056e+062.708685e+0633.8393503.753889e-13
C(gender)1.04.178273e+044.178273e+040.5219884.726685e-01
C(edu_class):C(gender)3.05.406989e+051.802330e+052.2516339.097723e-02
Residual63.05.042862e+068.004544e+04NaNNaN

相关分析

散点图

creditcard_exp.plot(x='Income', y='avg_exp', kind='scatter')

<matplotlib.axes._subplots.AxesSubplot at 0x13af56d8>

[外链图片转存失败(img-DXpyaRQC-1562725663280)(output_34_1.png)]

相关性分析:“spearman”,“pearson” 和 “kendall”

creditcard_exp[['Income', 'avg_exp']].corr(method='pearson')
Incomeavg_exp
Income1.0000000.674011
avg_exp0.6740111.000000

卡方检验

accepts = pd.read_csv('accepts.csv')
accepts = accepts.sample(30)
cross_table = pd.crosstab(accepts.bankruptcy_ind, columns=accepts.bad_ind)
# Or try this: accepts.pivot_table(index='bankruptcy_ind',columns='bad_ind', values='application_id', aggfunc='count')
cross_table
bad_ind01
bankruptcy_ind
N41631017
Y345103
print('chisq = %6.4f\n p-value = %6.4f\n dof = %i\n expected_freq = %s'  %stats.chi2_contingency(cross_table))
chisq = 2.7098
 p-value = 0.0997
 dof = 1
 expected_freq = [[ 4149.15422886  1030.84577114]
 [  358.84577114    89.15422886]]
import pandas as pd
wp = pd.Panel(np.random.randn(2, 5, 4), items=['item1', 'item2'],  major_axis=pd.date_range('1/1/2000', periods=5),  minor_axis=['A', 'B', 'C', 'D'])
wp
df=pd.DataFrame([1,2,3,4,5])
df.set_index("Date", inplace=True)
# display(A.head(1))
# A.set_index('Date'=True)
# A
# indexed_df = df.set_index(['A', 'B'])
# indexed_df2 = df.set_index(['A', [0, 1, 2, 0, 1, 2]])
# indexed_df3 = df.set_index([[0, 1, 2, 0, 1, 2]])
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()


TypeError: an integer is required


During handling of the above exception, another exception occurred:


KeyError                                  Traceback (most recent call last)

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()


KeyError: 'Date'


During handling of the above exception, another exception occurred:


TypeError                                 Traceback (most recent call last)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()


TypeError: an integer is required


During handling of the above exception, another exception occurred:


KeyError                                  Traceback (most recent call last)

<ipython-input-45-ed76529a4bdb> in <module>()
      3 wp
      4 df=pd.DataFrame([1,2,3,4,5])
----> 5 df.set_index("Date", inplace=True)
      6 # display(A.head(1))
      7 # A.set_index('Date'=True)


~\Anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   3144                 names.append(None)
   3145             else:
-> 3146                 level = frame[col]._values
   3147                 names.append(col)
   3148                 if drop:


~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):


~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality


~\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res


~\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]


~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()


KeyError: 'Date'

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值