本文是个人学习中遇到问题的记录。
2018/9/20
q1:多层结构的dataframe怎么理成单层结构的,对其数据进行处理?多层结构的dataframe怎么理成单层结构的,对其数据进行处理?
from pandas_datareader import wb
mathces = wb.search("gdp.*capita.*const")
dat = wb.download(indicator="NY.GDP.PCAP.KD",country=["US","CA","MX","CHN"],start=2015,end=2017)
print(dat)
NY.GDP.PCAP.KD
country year
Canada 2017 51315.888975
2016 50407.341330
2015 50303.836848
China 2017 7329.089299
2016 6894.464522
2015 6496.624013
Mexico 2017 9946.157994
2016 9871.670109
2015 9717.898430
United States 2017 53128.539700
2016 52319.163351
2015 51933.404806
a1: 了解层次化索引的相关处理后,解决本问题
2018/9/22
首先层叠处理一下,长格式转成一列列数据的宽格式
dat2 = dat.unstack(0)
print(dat2)
NY.GDP.PCAP.KD
country Canada China Mexico United States
year
2015 50303.836848 6496.624013 9717.898430 51933.404806
2016 50407.341330 6894.464522 9871.670109 52319.163351
2017 51315.888975 7329.089299 9946.157994 53128.539700
然后对数据进行运算
# 计算年变化百分比
returns = dat2.pct_change()
print(returns)
NY.GDP.PCAP.KD
country Canada China Mexico United States
year
2015 NaN NaN NaN NaN
2016 0.002058 0.061238 0.015824 0.007428
2017 0.018024 0.063040 0.007546 0.015470
# 计算相关系数
print(returns.corr())
NY.GDP.PCAP.KD
country Canada China Mexico United States
country
NY.GDP.PCAP.KD Canada 1.0 1.0 -1.0 1.0
China 1.0 1.0 -1.0 1.0
Mexico -1.0 -1.0 1.0 -1.0
United States 1.0 1.0 -1.0 1.0
# 计算协方差
print(returns.cov())
NY.GDP.PCAP.KD
country Canada China Mexico United States
country
NY.GDP.PCAP.KD Canada 0.000127 0.000014 -0.000066 0.000064
China 0.000014 0.000002 -0.000007 0.000007
Mexico -0.000066 -0.000007 0.000034 -0.000033
United States 0.000064 0.000007 -0.000033 0.000032
2018/9/22
q2:numpy 中array生成中,怎么预设缺失值?
arr10 =np.array([[[1,2,3],[4,5]],[[6,7,8],[9,10,11]]],)
print(arr10)
[[list([1, 2, 3]) list([4, 5])]
[list([6, 7, 8]) list([9, 10, 11])]]
#本意想生成三维数组
[[['1' '2' '3']
['4' '5' nan]]
[['6' '7' '8']
['9' '10' '11']]]
2018/9/25
q2:python for data analysis 第一版第249页中代码运行不出书中结果
party_counts =pd.crosstab(tips.day,tips.size)
party_counts
col_0 1708
day
Fri 19
Sat 87
Sun 76
Thur 62
a3: 可能是内建函数优先级问题,修改代码后解决
2018/9/27
import pandas as pd
tips = pd.read_csv('tips.csv')
# print(tips[:5])
# print(tips.size)
# party_counts =pd.crosstab(tips.day,tips.size)
party_counts =pd.crosstab(tips["day"],tips["size"])
party_counts2 =pd.crosstab(tips.day,tips["size"])
print(party_counts)
print('===========')
print(party_counts2)
运行结果和书中一致,且注意tips.day 等价于 tips[‘day’]
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3
===========
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3