第十四周作业

最新推荐文章于 2022-06-16 10:01:48 发布

marsmonkey

最新推荐文章于 2022-06-16 10:01:48 发布

阅读量190

点赞数

本文链接：https://blog.csdn.net/marsmonkey/article/details/80668735

版权

今天老师讲了 pandas ，要求我们用 pandas 、numpy 等工具完成下面两项作业。

老师上课介绍了 Jupyter ，用它来记录 Python 的历史运行记录非常方便，而且还能一键排版，生成 md ，粘上博客。本文就是用 Jupyter 生成的。

%matplotlib inline

import random

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set_context("talk")

Anscombe’s quartet

Anscombe’s quartet comprises of four datasets, and is rather famous. Why? You’ll find out in this exercise.

anascombe = pd.read_csv('data/anscombe.csv')
anascombe.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	dataset	x	y
0	I	10.0	8.04
1	I	8.0	6.95
2	I	13.0	7.58
3	I	9.0	8.81
4	I	11.0	8.33

Part 1

For each of the four datasets…
- Compute the mean and variance of both x and y
- Compute the correlation coefficient between x and y
- Compute the linear regression line: $y = \beta_0 + \beta_1 x + \epsilon$ (hint: use statsmodels and look at the Statsmodels notebook)

anascombe.groupby(['dataset'])[['x', 'y']].mean()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	x	y
dataset
I	9.0	7.500909
II	9.0	7.500909
III	9.0	7.500000
IV	9.0	7.500909

anascombe.groupby(['dataset'])[['x', 'y']].var()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	x	y
dataset
I	11.0	4.127269
II	11.0	4.127629
III	11.0	4.122620
IV	11.0	4.123249

anascombe['x'].corr(anascombe['y'])

0.81636624276147

We can see that the correlation coefficient between x and y is $0.81636624276147$

lin_model = smf.ols('y ~ x', anascombe).fit()
lin_model.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.666
Model:	OLS	Adj. R-squared:	0.659
Method:	Least Squares	F-statistic:	83.92
Date:	Tue, 12 Jun 2018	Prob (F-statistic):	1.44e-11
Time:	17:11:06	Log-Likelihood:	-67.358
No. Observations:	44	AIC:	138.7
Df Residuals:	42	BIC:	142.3
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.0013	0.521	5.765	0.000	1.951	4.052
x	0.4999	0.055	9.161	0.000	0.390	0.610

Omnibus:	1.513	Durbin-Watson:	2.327
Prob(Omnibus):	0.469	Jarque-Bera (JB):	0.896
Skew:	0.339	Prob(JB):	0.639
Kurtosis:	3.167	Cond. No.	29.1

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can see that $\beta_0 = 3.0013, \beta_1 = 0.4999, \varepsilon = 0.521$

Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

g = sns.FacetGrid(anascombe, hue='dataset', size=7.5)
g.map(plt.scatter,'x','y').add_legend()

<seaborn.axisgrid.FacetGrid at 0x7f12fbb63128>

这里写图片描述

marsmonkey

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第十四周作业

今天老师讲了 pandas ，要求我们用 pandas 、numpy 等工具完成下面两项作业。%matplotlib inlineimport randomimport numpy as npimport scipy as spimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimp...
复制链接

扫一扫