Jupyter exercise答案(仅供参考)

本次练习的数据来源和教学 
https://github.com/schmit/cme193-ipython-notebooks-lecture 

Part 1

For each of the four datasets...

  • Compute the mean and variance of both x and y
  • Compute the correlation coefficient between x and y
  • Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)

Compute the mean and variance of both x and y.

print("          mean")
print(anascombe.groupby("dataset").mean())
print("\n        variance")
print(anascombe.groupby("dataset").var())

输出结果:

          mean
           x         y
dataset               
I        9.0  7.500909
II       9.0  7.500909
III      9.0  7.500000
IV       9.0  7.500909

        variance
            x         y
dataset                
I        11.0  4.127269
II       11.0  4.127629
III      11.0  4.122620
IV       11.0  4.123249

Compute the correlation coefficient between x and y

anascombe.groupby("dataset").corr()

输出结果:

		x	        y
dataset			
I	x	1.000000	0.816421
        y	0.816421	1.000000
II	x	1.000000	0.816237
        y	0.816237	1.000000
III	x	1.000000	0.816287
        y	0.816287	1.000000
IV	x	1.000000	0.816521
        y	0.816521	1.000000

Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)

def rmse(y, yhat):
    return np.sum((y - yhat)**2)**0.5

def ols_by_dataset(anascombe, dataset):
    print("For dataset {}:".format(dataset))
    is_dataset = anascombe["dataset"] == dataset
    dataset = anascombe[is_dataset].reset_index(drop = True)
#     print(dataset)
    lin_model = smf.ols("y ~ x", dataset).fit()
    print("y = {}x".format(lin_model.params[0]) + " + {}".format(lin_model.params[1]))
#     print(lin_model.summary())
    preds = lin_model.predict(dataset['x'])
    print('The RMSE is {}\n'.format(rmse(dataset['y'], preds)))

ols_by_dataset(anascombe, 'I')
ols_by_dataset(anascombe, 'II')
ols_by_dataset(anascombe, 'III')
ols_by_dataset(anascombe, 'IV')

输出结果:

For dataset I:
y = 3.000090909090909x + 0.500090909090909
The RMSE is 3.7098099681789622

For dataset II:
y = 3.0009090909090905x + 0.5
The RMSE is 3.711642616024731

For dataset III:
y = 3.002454545454545x + 0.4997272727272728
The RMSE is 3.708934054169988

For dataset IV:
y = 3.0017272727272735x + 0.49990909090909075
The RMSE is 3.7070864570441304

Part 2

Using Seaborn, visualize all four datasets. 

hint: use sns.FacetGrid combined with plt.scatter

g = sns.FacetGrid(anascombe, col="dataset")
g.map(plt.scatter, "x", "y")
 
 
 
  
 
   



  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值