1. Pandas教程-10MintoPandas

介绍:这部分程序是10分钟进入pandas教程. 教程在Windows10环境,使用Anaconda,jupyter下测试的。注意版本区别,若版本有更新某些函数被弃用,请以官方手册为准。文章是jupyter编辑器转化成的markdown,可能会有一些格式问题,但是不影响复现

# 导入必要的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
print(pd.__version__)
print(np.__version__)
print(matplotlib.__version__)
1.0.1
1.18.1
3.1.3
# 通过列表构造Series(pandas中一维数据)
s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
# 通过列表构造DataFrame(pandas中二维数据)
sales_vol = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["Apple","Strawberry", "Pineapple","Watermelon"])
sales_vol
AppleStrawberryPineappleWatermelon
2013-01-011.5162040.495784-0.1534950.477701
2013-01-02-0.181439-0.574517-1.403139-0.018221
2013-01-03-0.379681-1.423055-1.034571-0.235021
2013-01-04-0.631352-0.492635-1.1860380.574631
2013-01-05-1.3432251.0913071.156641-0.228815
2013-01-06-0.5132850.401775-1.1628861.530610
df2 = pd.DataFrame({ 'A' : 1.,
                    'Date' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2
ADateCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo
df2.dtypes
A              float64
Date    datetime64[ns]
C              float32
D                int32
E             category
F               object
dtype: object

查看数据的头部尾部

sales_vol.head() #查看数据前5行. df.head(3) #查看数据前3行
AppleStrawberryPineappleWatermelon
2013-01-011.5162040.495784-0.1534950.477701
2013-01-02-0.181439-0.574517-1.403139-0.018221
2013-01-03-0.379681-1.423055-1.034571-0.235021
2013-01-04-0.631352-0.492635-1.1860380.574631
2013-01-05-1.3432251.0913071.156641-0.228815
sales_vol.tail() #查看数据后5行. df.tail(3) #查看数据后3行
AppleStrawberryPineappleWatermelon
2013-01-02-0.181439-0.574517-1.403139-0.018221
2013-01-03-0.379681-1.423055-1.034571-0.235021
2013-01-04-0.631352-0.492635-1.1860380.574631
2013-01-05-1.3432251.0913071.156641-0.228815
2013-01-06-0.5132850.401775-1.1628861.530610

显示索引、列和底层的numpy数据:

sales_vol.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
sales_vol.columns
Index(['Apple', 'Strawberry', 'Pineapple', 'Watermelon'], dtype='object')
sales_vol.values
array([[ 1.51620355,  0.49578439, -0.15349507,  0.4777014 ],
       [-0.18143898, -0.57451669, -1.40313854, -0.01822073],
       [-0.37968056, -1.42305498, -1.0345705 , -0.23502055],
       [-0.63135233, -0.49263546, -1.186038  ,  0.57463084],
       [-1.34322502,  1.09130739,  1.15664088, -0.22881497],
       [-0.51328525,  0.40177536, -1.16288585,  1.53061048]])
sales_vol.describe() # describe()函数对于数据的快速统计汇总
AppleStrawberryPineappleWatermelon
count6.0000006.0000006.0000006.000000
mean-0.255463-0.083557-0.6305810.350148
std0.9540130.9114970.9771090.674310
min-1.343225-1.423055-1.403139-0.235021
25%-0.601836-0.554046-1.180250-0.176166
50%-0.446483-0.045430-1.0987280.229740
75%-0.2309990.472282-0.3737640.550398
max1.5162041.0913071.1566411.530610
sales_vol.T # 对数据进行转置
2013-01-012013-01-022013-01-032013-01-042013-01-052013-01-06
Apple1.516204-0.181439-0.379681-0.631352-1.343225-0.513285
Strawberry0.495784-0.574517-1.423055-0.4926351.0913070.401775
Pineapple-0.153495-1.403139-1.034571-1.1860381.156641-1.162886
Watermelon0.477701-0.018221-0.2350210.574631-0.2288151.530610
sales_vol.sort_index(axis=1, ascending=True)
ApplePineappleStrawberryWatermelon
2013-01-011.516204-0.1534950.4957840.477701
2013-01-02-0.181439-1.403139-0.574517-0.018221
2013-01-03-0.379681-1.034571-1.423055-0.235021
2013-01-04-0.631352-1.186038-0.4926350.574631
2013-01-05-1.3432251.1566411.091307-0.228815
2013-01-06-0.513285-1.1628860.4017751.530610
sales_vol.sort_values(by='Watermelon', ascending=True)
AppleStrawberryPineappleWatermelon
2013-01-03-0.379681-1.423055-1.034571-0.235021
2013-01-05-1.3432251.0913071.156641-0.228815
2013-01-02-0.181439-0.574517-1.403139-0.018221
2013-01-011.5162040.495784-0.1534950.477701
2013-01-04-0.631352-0.492635-1.1860380.574631
2013-01-06-0.5132850.401775-1.1628861.530610
sales_vol['Apple'] #等价于sales_vol.Apple
2013-01-01    1.516204
2013-01-02   -0.181439
2013-01-03   -0.379681
2013-01-04   -0.631352
2013-01-05   -1.343225
2013-01-06   -0.513285
Freq: D, Name: Apple, dtype: float64
sales_vol[0:3]
AppleStrawberryPineappleWatermelon
2013-01-011.5162040.495784-0.1534950.477701
2013-01-02-0.181439-0.574517-1.403139-0.018221
2013-01-03-0.379681-1.423055-1.034571-0.235021
# sales_vol["2013"] 返回所有2013年数据
# sales_vol["2013-01"] 返回所有2013年1月数据
# 返回所有周六数据/周日数据/
sales_vol['2013-01-02':'2013-01-04'] # 传入一个时间段
AppleStrawberryPineappleWatermelon
2013-01-02-0.181439-0.574517-1.403139-0.018221
2013-01-03-0.379681-1.423055-1.034571-0.235021
2013-01-04-0.631352-0.492635-1.1860380.574631
sales_vol.loc[dates[0]] # 
Apple         1.516204
Strawberry    0.495784
Pineapple    -0.153495
Watermelon    0.477701
Name: 2013-01-01 00:00:00, dtype: float64
sales_vol.loc[:,['Apple','Strawberry']]
AppleStrawberry
2013-01-011.5162040.495784
2013-01-02-0.181439-0.574517
2013-01-03-0.379681-1.423055
2013-01-04-0.631352-0.492635
2013-01-05-1.3432251.091307
2013-01-06-0.5132850.401775
sales_vol.loc['20130102':'20130104',['Apple','Strawberry']]
AppleStrawberry
2013-01-02-0.181439-0.574517
2013-01-03-0.379681-1.423055
2013-01-04-0.631352-0.492635
sales_vol.loc['20130102',['Apple','Strawberry']]
Apple        -0.181439
Strawberry   -0.574517
Name: 2013-01-02 00:00:00, dtype: float64
sales_vol.loc[dates[0],'Apple']
1.5162035528655158
sales_vol.at[dates[0],'Apple']
1.5162035528655158
sales_vol.iloc[3] #通过传递数值进行位置选择(选择的是第4行)
# 等价于:sales_vol.loc[dates[3]]
Apple        -0.631352
Strawberry   -0.492635
Pineapple    -1.186038
Watermelon    0.574631
Name: 2013-01-04 00:00:00, dtype: float64
#通过数值进行切片,与numpy/python中的情况类似(含首不含尾)
sales_vol.iloc[3:5,0:2]

# 也可进行位置列表操作
sales_vol.iloc[[1,2,4],[0,2]]
ApplePineapple
2013-01-02-0.181439-1.403139
2013-01-03-0.379681-1.034571
2013-01-05-1.3432251.156641
# 对行进行切片操作
sales_vol.iloc[1:3,:]
AppleStrawberryPineappleWatermelon
2013-01-02-0.181439-0.574517-1.403139-0.018221
2013-01-03-0.379681-1.423055-1.034571-0.235021
# 对列进行切片
sales_vol.iloc[:,1:3]
StrawberryPineapple
2013-01-010.495784-0.153495
2013-01-02-0.574517-1.403139
2013-01-03-1.423055-1.034571
2013-01-04-0.492635-1.186038
2013-01-051.0913071.156641
2013-01-060.401775-1.162886
# 获取特定位置的值
sales_vol.iloc[1,1]
sales_vol.iat[1,1]
-0.574516686350344

布尔索引操作(查)

# boolean操作查找符合条件的数据
sales_vol[sales_vol.Apple > 0]
AppleStrawberryPineappleWatermelon
2013-01-011.5162040.495784-0.1534950.477701
df2 = sales_vol.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2
AppleStrawberryPineappleWatermelonE
2013-01-011.5162040.495784-0.1534950.477701one
2013-01-02-0.181439-0.574517-1.403139-0.018221one
2013-01-03-0.379681-1.423055-1.034571-0.235021two
2013-01-04-0.631352-0.492635-1.1860380.574631three
2013-01-05-1.3432251.0913071.156641-0.228815four
2013-01-06-0.5132850.401775-1.1628861.530610three
df2[df2['E'].isin(['one','four'])]
AppleStrawberryPineappleWatermelonE
2013-01-011.5162040.495784-0.1534950.477701one
2013-01-02-0.181439-0.574517-1.403139-0.018221one
2013-01-05-1.3432251.0913071.156641-0.228815four
df2['E'].isin(['one','four'])
2013-01-01     True
2013-01-02     True
2013-01-03    False
2013-01-04    False
2013-01-05     True
2013-01-06    False
Freq: D, Name: E, dtype: bool

插入操作(增)

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
sales_vol['Cherry'] = s1
sales_vol
AppleStrawberryPineappleWatermelonCherry
2013-01-011.5162040.495784-0.1534950.477701NaN
2013-01-02-0.181439-0.574517-1.403139-0.0182211.0
2013-01-03-0.379681-1.423055-1.034571-0.2350212.0
2013-01-04-0.631352-0.492635-1.1860380.5746313.0
2013-01-05-1.3432251.0913071.156641-0.2288154.0
2013-01-06-0.5132850.401775-1.1628861.5306105.0

修改操作(改)

# 通过标签 .at 方法,设置新值
sales_vol.at[dates[0],'Apple'] = 0
# 通过位置 .iat 设置新的值
sales_vol.iat[0,1] = 0
# 通过一个numpy数组设置一组新值 .loc
sales_vol.loc[:,'Pineapple'] = np.array([5] * len(sales_vol))
sales_vol
AppleStrawberryPineappleWatermelonCherry
2013-01-010.0000000.00000050.477701NaN
2013-01-02-0.181439-0.5745175-0.0182211.0
2013-01-03-0.379681-1.4230555-0.2350212.0
2013-01-04-0.631352-0.49263550.5746313.0
2013-01-05-1.3432251.0913075-0.2288154.0
2013-01-06-0.5132850.40177551.5306105.0
df2 = sales_vol.copy()
df2[df2 > 0] = -df2 # 这里神奇的地方在于左右的dataframe大小明显是不一样的,居然很只能的给赋值了
df2
AppleStrawberryPineappleWatermelonCherry
2013-01-010.0000000.000000-5-0.477701NaN
2013-01-02-0.181439-0.574517-5-0.018221-1.0
2013-01-03-0.379681-1.423055-5-0.235021-2.0
2013-01-04-0.631352-0.492635-5-0.574631-3.0
2013-01-05-1.343225-1.091307-5-0.228815-4.0
2013-01-06-0.513285-0.401775-5-1.530610-5.0

4. 缺失值处理

#  reindex()方法可以对指定轴上的索引进行改变/增加/删除操作,这将返回原始数据的一个拷贝
df1 = sales_vol.reindex(index=dates[0:4], columns=list(sales_vol.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
AppleStrawberryPineappleWatermelonCherryE
2013-01-010.0000000.00000050.477701NaN1.0
2013-01-02-0.181439-0.5745175-0.0182211.01.0
2013-01-03-0.379681-1.4230555-0.2350212.0NaN
2013-01-04-0.631352-0.49263550.5746313.0NaN
# 去掉包含缺失值的行-》执行之后返回一个dataframe,但是并没有对df1进行操作
df_drop = df1.dropna(how='any')
df_drop
AppleStrawberryPineappleWatermelonCherryE
2013-01-02-0.181439-0.5745175-0.0182211.01.0
# 对缺失值进行填充
df_fill = df1.fillna(value=5)
df_fill
AppleStrawberryPineappleWatermelonCherryE
2013-01-010.0000000.00000050.4777015.01.0
2013-01-02-0.181439-0.5745175-0.0182211.01.0
2013-01-03-0.379681-1.4230555-0.2350212.05.0
2013-01-04-0.631352-0.49263550.5746313.05.0
# 对数据进行bool填充
pd.isnull(df1)
AppleStrawberryPineappleWatermelonCherryE
2013-01-01FalseFalseFalseFalseTrueFalse
2013-01-02FalseFalseFalseFalseFalseFalse
2013-01-03FalseFalseFalseFalseFalseTrue
2013-01-04FalseFalseFalseFalseFalseTrue

5相关操作

# 执行描述性统计,计算每列的平均值
sales_vol.mean()
Apple        -0.508164
Strawberry   -0.166187
Pineapple     5.000000
Watermelon    0.350148
Cherry        3.000000
dtype: float64
# 计算每一行的平均值
sales_vol.mean(1)
2013-01-01    1.369425
2013-01-02    1.045165
2013-01-03    0.992449
2013-01-04    1.490129
2013-01-05    1.703853
2013-01-06    2.283820
Freq: D, dtype: float64
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
sales_vol.sub(s, axis='index')
AppleStrawberryPineappleWatermelonCherry
2013-01-010.0000000.00000050.477701NaN
2013-01-02-0.181439-0.5745175-0.0182211.0
2013-01-03-0.379681-1.4230555-0.2350212.0
2013-01-04-0.631352-0.49263550.5746313.0
2013-01-05-1.3432251.0913075-0.2288154.0
2013-01-06-0.5132850.40177551.5306105.0
sales_vol
AppleStrawberryPineappleWatermelonCherry
2013-01-010.0000000.00000050.477701NaN
2013-01-02-0.181439-0.5745175-0.0182211.0
2013-01-03-0.379681-1.4230555-0.2350212.0
2013-01-04-0.631352-0.49263550.5746313.0
2013-01-05-1.3432251.0913075-0.2288154.0
2013-01-06-0.5132850.40177551.5306105.0
# 对数据应用函数
sales_vol.apply(np.cumsum)
AppleStrawberryPineappleWatermelonCherry
2013-01-010.0000000.00000050.477701NaN
2013-01-02-0.181439-0.574517100.4594811.0
2013-01-03-0.561120-1.997572150.2244603.0
2013-01-04-1.192472-2.490207200.7990916.0
2013-01-05-2.535697-1.398900250.57027610.0
2013-01-06-3.048982-0.997124302.10088615.0
sales_vol.apply(lambda x: x.max() - x.min())
Apple         1.343225
Strawberry    2.514362
Pineapple     0.000000
Watermelon    1.765631
Cherry        4.000000
dtype: float64
s = pd.Series(np.random.randint(0, 7, size=10))
s
0    4
1    5
2    3
3    2
4    2
5    6
6    1
7    1
8    4
9    6
dtype: int32
s.value_counts()
6    2
4    2
2    2
1    2
5    1
3    1
dtype: int64
# 字符串方法
# Series对象在其str属性中配备了一组字符串处理方法,可以很容易的应用到数组中的每个元素,如下段代码所示。
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

# 返回所有元素的小写格式,的一个拷贝
s.str.lower()
0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

合并

df = pd.DataFrame(np.random.randn(10, 4))
df
0123
0-0.064357-0.729935-0.2579290.658859
1-0.0879681.5126431.4321600.060939
20.721509-0.417664-0.785848-0.440885
31.355722-1.766013-0.595825-0.507181
40.127415-0.3308550.155734-0.395429
50.110239-2.378898-0.139588-2.090046
6-0.8755290.9801702.0732910.470351
7-1.3340290.8734971.1640640.624527
80.123787-0.077460-0.2763212.126241
9-0.914960-1.583755-0.5289520.816373
# 将dataframe切片,通过concat()方法合并
pieces = [df[:3], df[3:7], df[7:]]
pieces
[          0         1         2         3
 0 -0.064357 -0.729935 -0.257929  0.658859
 1 -0.087968  1.512643  1.432160  0.060939
 2  0.721509 -0.417664 -0.785848 -0.440885,
           0         1         2         3
 3  1.355722 -1.766013 -0.595825 -0.507181
 4  0.127415 -0.330855  0.155734 -0.395429
 5  0.110239 -2.378898 -0.139588 -2.090046
 6 -0.875529  0.980170  2.073291  0.470351,
           0         1         2         3
 7 -1.334029  0.873497  1.164064  0.624527
 8  0.123787 -0.077460 -0.276321  2.126241
 9 -0.914960 -1.583755 -0.528952  0.816373]
pd.concat(pieces)
0123
0-0.064357-0.729935-0.2579290.658859
1-0.0879681.5126431.4321600.060939
20.721509-0.417664-0.785848-0.440885
31.355722-1.766013-0.595825-0.507181
40.127415-0.3308550.155734-0.395429
50.110239-2.378898-0.139588-2.090046
6-0.8755290.9801702.0732910.470351
7-1.3340290.8734971.1640640.624527
80.123787-0.077460-0.2763212.126241
9-0.914960-1.583755-0.5289520.816373
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
pd.merge(left, right, on='key')
keylvalrval
0foo14
1foo15
2foo24
3foo25
# Append 将一行连接到一个DataFrame上
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df.append(s, ignore_index=True)
ABCD
0-1.0227771.015712-0.275568-0.505822
1-0.449308-2.0280960.9767981.341397
20.150794-0.277167-0.324461-0.700945
3-0.043391-1.0905330.8546480.778152
4-1.3910910.847236-0.126084-1.004139
50.0656300.229489-2.0330060.825722
61.3154190.3777260.0923370.290093
70.615620-0.402542-0.047299-1.149747
8-0.043391-1.0905330.8546480.778152

7.分组

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                    'C' : np.random.randn(8),
                    'D' : np.random.randn(8)})
df
ABCD
0fooone0.9422550.461081
1barone-0.9881181.099893
2footwo0.7737050.513753
3barthree0.8441890.341648
4footwo1.325598-0.200073
5bartwo0.313239-0.007345
6fooone1.4169632.522686
7foothree-0.4874021.928968
# 对'A'列进行分组,然后求组内的和
df.groupby('A').sum()
CD
A
bar0.1693101.434197
foo3.9711195.226417
# 对多个列进行分组,形成层次索引,然后求和
df.groupby(['A','B']).sum()
CD
AB
barone-0.9881181.099893
three0.8441890.341648
two0.313239-0.007345
fooone2.3592182.983768
three-0.4874021.928968
two2.0993030.313681

8 reshape

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
tuples
# 定义一个多重索引
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
# 利用多重索引生成dataframe对象
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
AB
firstsecond
barone1.182011-0.351810
two-1.464586-0.377918
bazone-0.6969951.227232
two0.416556-1.173721
# stack()方法在DataFrame列的层次上压缩
stacked = df2.stack()
stacked
first  second   
bar    one     A    1.182011
               B   -0.351810
       two     A   -1.464586
               B   -0.377918
baz    one     A   -0.696995
               B    1.227232
       two     A    0.416556
               B   -1.173721
dtype: float64
# stack()的逆运算是unstack(),默认情况下unStack上一次的情况
stacked.unstack()
AB
firstsecond
barone1.182011-0.351810
two-1.464586-0.377918
bazone-0.6969951.227232
two0.416556-1.173721
stacked.unstack(1)
secondonetwo
first
barA1.182011-1.464586
B-0.351810-0.377918
bazA-0.6969950.416556
B1.227232-1.173721
stacked.unstack(0)
firstbarbaz
second
oneA1.182011-0.696995
B-0.3518101.227232
twoA-1.4645860.416556
B-0.377918-1.173721
# 数据透视表
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D' : np.random.randn(12),
                    'E' : np.random.randn(12)})
df
ABCDE
0oneAfoo0.7337912.254117
1oneBfoo0.8898670.172840
2twoCfoo-0.673481-1.287785
3threeAbar1.6211920.286708
4oneBbar1.8270561.477342
5oneCbar-1.5862551.145898
6twoAfoo0.932376-1.259903
7threeBfoo-2.2461420.178643
8oneCfoo-1.4838780.055398
9oneAbar-0.134426-0.629503
10twoBbar-1.7849990.498491
11threeCbar0.956670-0.446973
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
oneA-0.1344260.733791
B1.8270560.889867
C-1.586255-1.483878
threeA1.621192NaN
BNaN-2.246142
C0.956670NaN
twoANaN0.932376
B-1.784999NaN
CNaN-0.673481

9 时间序列

Pandas在对频率转换进行重新采样时拥有简单、强大且高效的功能(如将按秒采样的数据转换为按5分钟为单位进行采样的数据)。这种操作在金融领域非常常见。

rng = pd.date_range('20120101', periods=100, freq='S')
rng
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()
2012-01-01    25643
Freq: 5T, dtype: int32
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2012-03-06    0.617556
2012-03-07   -0.140357
2012-03-08    0.720688
2012-03-09   -0.002317
2012-03-10   -1.039849
Freq: D, dtype: float64
#  时区表示
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-06 00:00:00+00:00    0.617556
2012-03-07 00:00:00+00:00   -0.140357
2012-03-08 00:00:00+00:00    0.720688
2012-03-09 00:00:00+00:00   -0.002317
2012-03-10 00:00:00+00:00   -1.039849
Freq: D, dtype: float64
# 时区转换
ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00    0.617556
2012-03-06 19:00:00-05:00   -0.140357
2012-03-07 19:00:00-05:00    0.720688
2012-03-08 19:00:00-05:00   -0.002317
2012-03-09 19:00:00-05:00   -1.039849
Freq: D, dtype: float64
# 时间跨度转换
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-01-31    0.228717
2012-02-29   -1.027588
2012-03-31    0.908395
2012-04-30    0.625547
2012-05-31   -0.109560
Freq: M, dtype: float64
ps = ts.to_period()
ps
2012-01    0.228717
2012-02   -1.027588
2012-03    0.908395
2012-04    0.625547
2012-05   -0.109560
Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01    0.228717
2012-02-01   -1.027588
2012-03-01    0.908395
2012-04-01    0.625547
2012-05-01   -0.109560
Freq: MS, dtype: float64
ps.to_timestamp()
Period('2012-01', 'M')
# 时期和时间戳之间的转换使得可以使用一些方便的算术函数
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

Categorical

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
# 排序是按照Categorical的顺序进行的而不是按照字典顺序进行
df.sort_values(by="grade")
idraw_gradegrade
56every bad
12bgood
23bgood
01avery good
34avery good
45avery good
#   对Categorical列进行计数/排序时,存在空的类别
df.groupby("grade").size()

画图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1788d4ef1c8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Uw25svMu-1586607531658)(1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_files/1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_92_1.svg)]

# 对于DataFrame来说,plot是一种将所有列及其标签进行绘制的简便方法
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); 
df.plot(); 
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x1788cbdbe88>




<Figure size 432x288 with 0 Axes>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o37hZq0z-1586607531659)(1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_files/1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_93_2.svg)]

导入和保存数据

# 读取,写入cxv文件
df.to_csv('foo.csv')
pd.read_csv('foo.csv')
Unnamed: 0ABCD
02000-01-01-1.1252251.577122-0.6311040.226040
12000-01-02-1.6366180.825889-1.7026680.742720
22000-01-03-1.874304-0.924554-0.3964820.757374
32000-01-04-2.804663-1.0152250.5639581.102884
42000-01-05-2.183829-1.1261470.987302-0.351845
..................
9952002-09-22-10.052230-5.444433-49.40896119.674901
9962002-09-23-9.058395-5.363573-47.81530719.377475
9972002-09-24-8.024636-6.063298-47.26281921.506791
9982002-09-25-8.128653-5.967861-46.71000822.240115
9992002-09-26-9.313855-5.906042-47.30299922.040534

1000 rows × 5 columns

# 读取,写入hdf5文件
df.to_hdf('foo.h5','df')
pd.read_hdf('foo.h5','df')
ABCD
2000-01-01-1.1252251.577122-0.6311040.226040
2000-01-02-1.6366180.825889-1.7026680.742720
2000-01-03-1.874304-0.924554-0.3964820.757374
2000-01-04-2.804663-1.0152250.5639581.102884
2000-01-05-2.183829-1.1261470.987302-0.351845
...............
2002-09-22-10.052230-5.444433-49.40896119.674901
2002-09-23-9.058395-5.363573-47.81530719.377475
2002-09-24-8.024636-6.063298-47.26281921.506791
2002-09-25-8.128653-5.967861-46.71000822.240115
2002-09-26-9.313855-5.906042-47.30299922.040534

1000 rows × 4 columns

# 写入,读取Excel文件
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Unnamed: 0ABCD
02000-01-01-1.1252251.577122-0.6311040.226040
12000-01-02-1.6366180.825889-1.7026680.742720
22000-01-03-1.874304-0.924554-0.3964820.757374
32000-01-04-2.804663-1.0152250.5639581.102884
42000-01-05-2.183829-1.1261470.987302-0.351845
..................
9952002-09-22-10.052230-5.444433-49.40896119.674901
9962002-09-23-9.058395-5.363573-47.81530719.377475
9972002-09-24-8.024636-6.063298-47.26281921.506791
9982002-09-25-8.128653-5.967861-46.71000822.240115
9992002-09-26-9.313855-5.906042-47.30299922.040534

1000 rows × 5 columns

  <th>A</th>
  <th>B</th>
  <th>C</th>
  <th>D</th>
</tr>
2000-01-01 -1.125225 1.577122 -0.631104 0.226040 2000-01-02 -1.636618 0.825889 -1.702668 0.742720 2000-01-03 -1.874304 -0.924554 -0.396482 0.757374 2000-01-04 -2.804663 -1.015225 0.563958 1.102884 2000-01-05 -2.183829 -1.126147 0.987302 -0.351845 ... ... ... ... ... 2002-09-22 -10.052230 -5.444433 -49.408961 19.674901 2002-09-23 -9.058395 -5.363573 -47.815307 19.377475 2002-09-24 -8.024636 -6.063298 -47.262819 21.506791 2002-09-25 -8.128653 -5.967861 -46.710008 22.240115 2002-09-26 -9.313855 -5.906042 -47.302999 22.040534

1000 rows × 4 columns

# 写入,读取Excel文件
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Unnamed: 0ABCD
02000-01-01-1.1252251.577122-0.6311040.226040
12000-01-02-1.6366180.825889-1.7026680.742720
22000-01-03-1.874304-0.924554-0.3964820.757374
32000-01-04-2.804663-1.0152250.5639581.102884
42000-01-05-2.183829-1.1261470.987302-0.351845
..................
9952002-09-22-10.052230-5.444433-49.40896119.674901
9962002-09-23-9.058395-5.363573-47.81530719.377475
9972002-09-24-8.024636-6.063298-47.26281921.506791
9982002-09-25-8.128653-5.967861-46.71000822.240115
9992002-09-26-9.313855-5.906042-47.30299922.040534

1000 rows × 5 columns

本博客的目的是总结和分享,大部分程序来自互联网,作者为标榜任何代码所有权。读者可以在未经本博客作者授权的情况下随意复制和使用。转发时若声明出处将不胜感激。另外如果程序难免有未优化或欢迎批评指正(chenaijun1298 at yeah dot net)

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值