1. Pandas教程-10MintoPandas

最新推荐文章于 2024-04-17 21:30:17 发布

onetpm

最新推荐文章于 2024-04-17 21:30:17 发布

阅读量320

点赞数

分类专栏：编程语言文章标签： python 数据分析机器学习

本文链接：https://blog.csdn.net/weixin_46674660/article/details/105458766

版权

编程语言专栏收录该内容

5 篇文章 0 订阅

订阅专栏

介绍：这部分程序是10分钟进入pandas教程. 教程在Windows10环境，使用Anaconda，jupyter下测试的。注意版本区别，若版本有更新某些函数被弃用，请以官方手册为准。文章是jupyter编辑器转化成的markdown，可能会有一些格式问题，但是不影响复现

# 导入必要的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
print(pd.__version__)
print(np.__version__)
print(matplotlib.__version__)

1.0.1
1.18.1
3.1.3

# 通过列表构造Series（pandas中一维数据）
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

# 通过列表构造DataFrame（pandas中二维数据）
sales_vol = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["Apple","Strawberry", "Pineapple","Watermelon"])
sales_vol

	Apple	Strawberry	Pineapple	Watermelon
2013-01-01	1.516204	0.495784	-0.153495	0.477701
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021
2013-01-04	-0.631352	-0.492635	-1.186038	0.574631
2013-01-05	-1.343225	1.091307	1.156641	-0.228815
2013-01-06	-0.513285	0.401775	-1.162886	1.530610

df2 = pd.DataFrame({ 'A' : 1.,
                    'Date' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2

	A	Date	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

df2.dtypes

A              float64
Date    datetime64[ns]
C              float32
D                int32
E             category
F               object
dtype: object

查看数据的头部尾部

sales_vol.head() #查看数据前5行. df.head(3) #查看数据前3行

	Apple	Strawberry	Pineapple	Watermelon
2013-01-01	1.516204	0.495784	-0.153495	0.477701
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021
2013-01-04	-0.631352	-0.492635	-1.186038	0.574631
2013-01-05	-1.343225	1.091307	1.156641	-0.228815

sales_vol.tail() #查看数据后5行. df.tail(3) #查看数据后3行

	Apple	Strawberry	Pineapple	Watermelon
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021
2013-01-04	-0.631352	-0.492635	-1.186038	0.574631
2013-01-05	-1.343225	1.091307	1.156641	-0.228815
2013-01-06	-0.513285	0.401775	-1.162886	1.530610

显示索引、列和底层的numpy数据：

sales_vol.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

sales_vol.columns

Index(['Apple', 'Strawberry', 'Pineapple', 'Watermelon'], dtype='object')

sales_vol.values

array([[ 1.51620355,  0.49578439, -0.15349507,  0.4777014 ],
       [-0.18143898, -0.57451669, -1.40313854, -0.01822073],
       [-0.37968056, -1.42305498, -1.0345705 , -0.23502055],
       [-0.63135233, -0.49263546, -1.186038  ,  0.57463084],
       [-1.34322502,  1.09130739,  1.15664088, -0.22881497],
       [-0.51328525,  0.40177536, -1.16288585,  1.53061048]])

sales_vol.describe() # describe()函数对于数据的快速统计汇总

	Apple	Strawberry	Pineapple	Watermelon
count	6.000000	6.000000	6.000000	6.000000
mean	-0.255463	-0.083557	-0.630581	0.350148
std	0.954013	0.911497	0.977109	0.674310
min	-1.343225	-1.423055	-1.403139	-0.235021
25%	-0.601836	-0.554046	-1.180250	-0.176166
50%	-0.446483	-0.045430	-1.098728	0.229740
75%	-0.230999	0.472282	-0.373764	0.550398
max	1.516204	1.091307	1.156641	1.530610

sales_vol.T # 对数据进行转置

	2013-01-01	2013-01-02	2013-01-03	2013-01-04	2013-01-05	2013-01-06
Apple	1.516204	-0.181439	-0.379681	-0.631352	-1.343225	-0.513285
Strawberry	0.495784	-0.574517	-1.423055	-0.492635	1.091307	0.401775
Pineapple	-0.153495	-1.403139	-1.034571	-1.186038	1.156641	-1.162886
Watermelon	0.477701	-0.018221	-0.235021	0.574631	-0.228815	1.530610

sales_vol.sort_index(axis=1, ascending=True)

	Apple	Pineapple	Strawberry	Watermelon
2013-01-01	1.516204	-0.153495	0.495784	0.477701
2013-01-02	-0.181439	-1.403139	-0.574517	-0.018221
2013-01-03	-0.379681	-1.034571	-1.423055	-0.235021
2013-01-04	-0.631352	-1.186038	-0.492635	0.574631
2013-01-05	-1.343225	1.156641	1.091307	-0.228815
2013-01-06	-0.513285	-1.162886	0.401775	1.530610

sales_vol.sort_values(by='Watermelon', ascending=True)

	Apple	Strawberry	Pineapple	Watermelon
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021
2013-01-05	-1.343225	1.091307	1.156641	-0.228815
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221
2013-01-01	1.516204	0.495784	-0.153495	0.477701
2013-01-04	-0.631352	-0.492635	-1.186038	0.574631
2013-01-06	-0.513285	0.401775	-1.162886	1.530610

sales_vol['Apple'] #等价于sales_vol.Apple

2013-01-01    1.516204
2013-01-02   -0.181439
2013-01-03   -0.379681
2013-01-04   -0.631352
2013-01-05   -1.343225
2013-01-06   -0.513285
Freq: D, Name: Apple, dtype: float64

sales_vol[0:3]

	Apple	Strawberry	Pineapple	Watermelon
2013-01-01	1.516204	0.495784	-0.153495	0.477701
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021

# sales_vol["2013"] 返回所有2013年数据
# sales_vol["2013-01"] 返回所有2013年1月数据
# 返回所有周六数据/周日数据/
sales_vol['2013-01-02':'2013-01-04'] # 传入一个时间段

	Apple	Strawberry	Pineapple	Watermelon
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021
2013-01-04	-0.631352	-0.492635	-1.186038	0.574631

sales_vol.loc[dates[0]] #

Apple         1.516204
Strawberry    0.495784
Pineapple    -0.153495
Watermelon    0.477701
Name: 2013-01-01 00:00:00, dtype: float64

sales_vol.loc[:,['Apple','Strawberry']]

	Apple	Strawberry
2013-01-01	1.516204	0.495784
2013-01-02	-0.181439	-0.574517
2013-01-03	-0.379681	-1.423055
2013-01-04	-0.631352	-0.492635
2013-01-05	-1.343225	1.091307
2013-01-06	-0.513285	0.401775

sales_vol.loc['20130102':'20130104',['Apple','Strawberry']]

	Apple	Strawberry
2013-01-02	-0.181439	-0.574517
2013-01-03	-0.379681	-1.423055
2013-01-04	-0.631352	-0.492635

sales_vol.loc['20130102',['Apple','Strawberry']]

Apple        -0.181439
Strawberry   -0.574517
Name: 2013-01-02 00:00:00, dtype: float64

sales_vol.loc[dates[0],'Apple']

1.5162035528655158

sales_vol.at[dates[0],'Apple']

1.5162035528655158

sales_vol.iloc[3] #通过传递数值进行位置选择（选择的是第4行）
# 等价于：sales_vol.loc[dates[3]]

Apple        -0.631352
Strawberry   -0.492635
Pineapple    -1.186038
Watermelon    0.574631
Name: 2013-01-04 00:00:00, dtype: float64

#通过数值进行切片，与numpy/python中的情况类似(含首不含尾)
sales_vol.iloc[3:5,0:2]

# 也可进行位置列表操作
sales_vol.iloc[[1,2,4],[0,2]]

	Apple	Pineapple
2013-01-02	-0.181439	-1.403139
2013-01-03	-0.379681	-1.034571
2013-01-05	-1.343225	1.156641

# 对行进行切片操作
sales_vol.iloc[1:3,:]

	Apple	Strawberry	Pineapple	Watermelon
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021

# 对列进行切片
sales_vol.iloc[:,1:3]

	Strawberry	Pineapple
2013-01-01	0.495784	-0.153495
2013-01-02	-0.574517	-1.403139
2013-01-03	-1.423055	-1.034571
2013-01-04	-0.492635	-1.186038
2013-01-05	1.091307	1.156641
2013-01-06	0.401775	-1.162886

# 获取特定位置的值
sales_vol.iloc[1,1]
sales_vol.iat[1,1]

-0.574516686350344

布尔索引操作(查)

# boolean操作查找符合条件的数据
sales_vol[sales_vol.Apple > 0]

	Apple	Strawberry	Pineapple	Watermelon
2013-01-01	1.516204	0.495784	-0.153495	0.477701

df2 = sales_vol.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

	Apple	Strawberry	Pineapple	Watermelon	E
2013-01-01	1.516204	0.495784	-0.153495	0.477701	one
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221	one
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021	two
2013-01-04	-0.631352	-0.492635	-1.186038	0.574631	three
2013-01-05	-1.343225	1.091307	1.156641	-0.228815	four
2013-01-06	-0.513285	0.401775	-1.162886	1.530610	three

df2[df2['E'].isin(['one','four'])]

	Apple	Strawberry	Pineapple	Watermelon	E
2013-01-01	1.516204	0.495784	-0.153495	0.477701	one
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221	one
2013-01-05	-1.343225	1.091307	1.156641	-0.228815	four

df2['E'].isin(['one','four'])

2013-01-01     True
2013-01-02     True
2013-01-03    False
2013-01-04    False
2013-01-05     True
2013-01-06    False
Freq: D, Name: E, dtype: bool

插入操作（增）

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
sales_vol['Cherry'] = s1
sales_vol

	Apple	Strawberry	Pineapple	Watermelon	Cherry
2013-01-01	1.516204	0.495784	-0.153495	0.477701	NaN
2013-01-02	-0.181439	-0.574517	-1.403139	-0.018221	1.0
2013-01-03	-0.379681	-1.423055	-1.034571	-0.235021	2.0
2013-01-04	-0.631352	-0.492635	-1.186038	0.574631	3.0
2013-01-05	-1.343225	1.091307	1.156641	-0.228815	4.0
2013-01-06	-0.513285	0.401775	-1.162886	1.530610	5.0

修改操作（改）

# 通过标签 .at 方法，设置新值
sales_vol.at[dates[0],'Apple'] = 0
# 通过位置 .iat 设置新的值
sales_vol.iat[0,1] = 0
# 通过一个numpy数组设置一组新值 .loc
sales_vol.loc[:,'Pineapple'] = np.array([5] * len(sales_vol))
sales_vol

	Apple	Strawberry	Pineapple	Watermelon	Cherry
2013-01-01	0.000000	0.000000	5	0.477701	NaN
2013-01-02	-0.181439	-0.574517	5	-0.018221	1.0
2013-01-03	-0.379681	-1.423055	5	-0.235021	2.0
2013-01-04	-0.631352	-0.492635	5	0.574631	3.0
2013-01-05	-1.343225	1.091307	5	-0.228815	4.0
2013-01-06	-0.513285	0.401775	5	1.530610	5.0

df2 = sales_vol.copy()
df2[df2 > 0] = -df2 # 这里神奇的地方在于左右的dataframe大小明显是不一样的，居然很只能的给赋值了
df2

	Apple	Strawberry	Pineapple	Watermelon	Cherry
2013-01-01	0.000000	0.000000	-5	-0.477701	NaN
2013-01-02	-0.181439	-0.574517	-5	-0.018221	-1.0
2013-01-03	-0.379681	-1.423055	-5	-0.235021	-2.0
2013-01-04	-0.631352	-0.492635	-5	-0.574631	-3.0
2013-01-05	-1.343225	-1.091307	-5	-0.228815	-4.0
2013-01-06	-0.513285	-0.401775	-5	-1.530610	-5.0

4. 缺失值处理

#  reindex()方法可以对指定轴上的索引进行改变/增加/删除操作，这将返回原始数据的一个拷贝
df1 = sales_vol.reindex(index=dates[0:4], columns=list(sales_vol.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

	Apple	Strawberry	Pineapple	Watermelon	Cherry	E
2013-01-01	0.000000	0.000000	5	0.477701	NaN	1.0
2013-01-02	-0.181439	-0.574517	5	-0.018221	1.0	1.0
2013-01-03	-0.379681	-1.423055	5	-0.235021	2.0	NaN
2013-01-04	-0.631352	-0.492635	5	0.574631	3.0	NaN

# 去掉包含缺失值的行-》执行之后返回一个dataframe，但是并没有对df1进行操作
df_drop = df1.dropna(how='any')
df_drop

	Apple	Strawberry	Pineapple	Watermelon	Cherry	E
2013-01-02	-0.181439	-0.574517	5	-0.018221	1.0	1.0

# 对缺失值进行填充
df_fill = df1.fillna(value=5)
df_fill

	Apple	Strawberry	Pineapple	Watermelon	Cherry	E
2013-01-01	0.000000	0.000000	5	0.477701	5.0	1.0
2013-01-02	-0.181439	-0.574517	5	-0.018221	1.0	1.0
2013-01-03	-0.379681	-1.423055	5	-0.235021	2.0	5.0
2013-01-04	-0.631352	-0.492635	5	0.574631	3.0	5.0

# 对数据进行bool填充
pd.isnull(df1)

	Apple	Strawberry	Pineapple	Watermelon	Cherry	E
2013-01-01	False	False	False	False	True	False
2013-01-02	False	False	False	False	False	False
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True

5相关操作

# 执行描述性统计，计算每列的平均值
sales_vol.mean()

Apple        -0.508164
Strawberry   -0.166187
Pineapple     5.000000
Watermelon    0.350148
Cherry        3.000000
dtype: float64

# 计算每一行的平均值
sales_vol.mean(1)

2013-01-01    1.369425
2013-01-02    1.045165
2013-01-03    0.992449
2013-01-04    1.490129
2013-01-05    1.703853
2013-01-06    2.283820
Freq: D, dtype: float64

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

sales_vol.sub(s, axis='index')

	Apple	Strawberry	Pineapple	Watermelon	Cherry
2013-01-01	0.000000	0.000000	5	0.477701	NaN
2013-01-02	-0.181439	-0.574517	5	-0.018221	1.0
2013-01-03	-0.379681	-1.423055	5	-0.235021	2.0
2013-01-04	-0.631352	-0.492635	5	0.574631	3.0
2013-01-05	-1.343225	1.091307	5	-0.228815	4.0
2013-01-06	-0.513285	0.401775	5	1.530610	5.0

sales_vol

	Apple	Strawberry	Pineapple	Watermelon	Cherry
2013-01-01	0.000000	0.000000	5	0.477701	NaN
2013-01-02	-0.181439	-0.574517	5	-0.018221	1.0
2013-01-03	-0.379681	-1.423055	5	-0.235021	2.0
2013-01-04	-0.631352	-0.492635	5	0.574631	3.0
2013-01-05	-1.343225	1.091307	5	-0.228815	4.0
2013-01-06	-0.513285	0.401775	5	1.530610	5.0

# 对数据应用函数
sales_vol.apply(np.cumsum)

	Apple	Strawberry	Pineapple	Watermelon	Cherry
2013-01-01	0.000000	0.000000	5	0.477701	NaN
2013-01-02	-0.181439	-0.574517	10	0.459481	1.0
2013-01-03	-0.561120	-1.997572	15	0.224460	3.0
2013-01-04	-1.192472	-2.490207	20	0.799091	6.0
2013-01-05	-2.535697	-1.398900	25	0.570276	10.0
2013-01-06	-3.048982	-0.997124	30	2.100886	15.0

sales_vol.apply(lambda x: x.max() - x.min())

Apple         1.343225
Strawberry    2.514362
Pineapple     0.000000
Watermelon    1.765631
Cherry        4.000000
dtype: float64

s = pd.Series(np.random.randint(0, 7, size=10))
s

0    4
1    5
2    3
3    2
4    2
5    6
6    1
7    1
8    4
9    6
dtype: int32

s.value_counts()

6    2
4    2
2    2
1    2
5    1
3    1
dtype: int64

# 字符串方法
# Series对象在其str属性中配备了一组字符串处理方法，可以很容易的应用到数组中的每个元素，如下段代码所示。
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

# 返回所有元素的小写格式，的一个拷贝
s.str.lower()

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

合并

df = pd.DataFrame(np.random.randn(10, 4))
df

	0	1	2	3
0	-0.064357	-0.729935	-0.257929	0.658859
1	-0.087968	1.512643	1.432160	0.060939
2	0.721509	-0.417664	-0.785848	-0.440885
3	1.355722	-1.766013	-0.595825	-0.507181
4	0.127415	-0.330855	0.155734	-0.395429
5	0.110239	-2.378898	-0.139588	-2.090046
6	-0.875529	0.980170	2.073291	0.470351
7	-1.334029	0.873497	1.164064	0.624527
8	0.123787	-0.077460	-0.276321	2.126241
9	-0.914960	-1.583755	-0.528952	0.816373

# 将dataframe切片，通过concat()方法合并
pieces = [df[:3], df[3:7], df[7:]]
pieces

[          0         1         2         3
 0 -0.064357 -0.729935 -0.257929  0.658859
 1 -0.087968  1.512643  1.432160  0.060939
 2  0.721509 -0.417664 -0.785848 -0.440885,
           0         1         2         3
 3  1.355722 -1.766013 -0.595825 -0.507181
 4  0.127415 -0.330855  0.155734 -0.395429
 5  0.110239 -2.378898 -0.139588 -2.090046
 6 -0.875529  0.980170  2.073291  0.470351,
           0         1         2         3
 7 -1.334029  0.873497  1.164064  0.624527
 8  0.123787 -0.077460 -0.276321  2.126241
 9 -0.914960 -1.583755 -0.528952  0.816373]

pd.concat(pieces)

	0	1	2	3
0	-0.064357	-0.729935	-0.257929	0.658859
1	-0.087968	1.512643	1.432160	0.060939
2	0.721509	-0.417664	-0.785848	-0.440885
3	1.355722	-1.766013	-0.595825	-0.507181
4	0.127415	-0.330855	0.155734	-0.395429
5	0.110239	-2.378898	-0.139588	-2.090046
6	-0.875529	0.980170	2.073291	0.470351
7	-1.334029	0.873497	1.164064	0.624527
8	0.123787	-0.077460	-0.276321	2.126241
9	-0.914960	-1.583755	-0.528952	0.816373

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
pd.merge(left, right, on='key')

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

# Append 将一行连接到一个DataFrame上
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df.append(s, ignore_index=True)

	A	B	C	D
0	-1.022777	1.015712	-0.275568	-0.505822
1	-0.449308	-2.028096	0.976798	1.341397
2	0.150794	-0.277167	-0.324461	-0.700945
3	-0.043391	-1.090533	0.854648	0.778152
4	-1.391091	0.847236	-0.126084	-1.004139
5	0.065630	0.229489	-2.033006	0.825722
6	1.315419	0.377726	0.092337	0.290093
7	0.615620	-0.402542	-0.047299	-1.149747
8	-0.043391	-1.090533	0.854648	0.778152

7.分组

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                    'C' : np.random.randn(8),
                    'D' : np.random.randn(8)})
df

	A	B	C	D
0	foo	one	0.942255	0.461081
1	bar	one	-0.988118	1.099893
2	foo	two	0.773705	0.513753
3	bar	three	0.844189	0.341648
4	foo	two	1.325598	-0.200073
5	bar	two	0.313239	-0.007345
6	foo	one	1.416963	2.522686
7	foo	three	-0.487402	1.928968

# 对'A'列进行分组，然后求组内的和
df.groupby('A').sum()

	C	D
A
bar	0.169310	1.434197
foo	3.971119	5.226417

# 对多个列进行分组，形成层次索引，然后求和
df.groupby(['A','B']).sum()

		C	D
A	B
bar	one	-0.988118	1.099893
	three	0.844189	0.341648
	two	0.313239	-0.007345
foo	one	2.359218	2.983768
	three	-0.487402	1.928968
	two	2.099303	0.313681

8 reshape

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
tuples
# 定义一个多重索引
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
# 利用多重索引生成dataframe对象
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

		A	B
first	second
bar	one	1.182011	-0.351810
bar	two	-1.464586	-0.377918
baz	one	-0.696995	1.227232
baz	two	0.416556	-1.173721

# stack()方法在DataFrame列的层次上压缩
stacked = df2.stack()
stacked

first  second   
bar    one     A    1.182011
               B   -0.351810
       two     A   -1.464586
               B   -0.377918
baz    one     A   -0.696995
               B    1.227232
       two     A    0.416556
               B   -1.173721
dtype: float64

# stack()的逆运算是unstack(),默认情况下unStack上一次的情况
stacked.unstack()

		A	B
first	second
bar	one	1.182011	-0.351810
bar	two	-1.464586	-0.377918
baz	one	-0.696995	1.227232
baz	two	0.416556	-1.173721

stacked.unstack(1)

	second	one	two
first
bar	A	1.182011	-1.464586
bar	B	-0.351810	-0.377918
baz	A	-0.696995	0.416556
baz	B	1.227232	-1.173721

stacked.unstack(0)

	first	bar	baz
second
one	A	1.182011	-0.696995
one	B	-0.351810	1.227232
two	A	-1.464586	0.416556
two	B	-0.377918	-1.173721

# 数据透视表
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D' : np.random.randn(12),
                    'E' : np.random.randn(12)})
df

	A	B	C	D	E
0	one	A	foo	0.733791	2.254117
1	one	B	foo	0.889867	0.172840
2	two	C	foo	-0.673481	-1.287785
3	three	A	bar	1.621192	0.286708
4	one	B	bar	1.827056	1.477342
5	one	C	bar	-1.586255	1.145898
6	two	A	foo	0.932376	-1.259903
7	three	B	foo	-2.246142	0.178643
8	one	C	foo	-1.483878	0.055398
9	one	A	bar	-0.134426	-0.629503
10	two	B	bar	-1.784999	0.498491
11	three	C	bar	0.956670	-0.446973

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

	C	bar	foo
A	B
one	A	-0.134426	0.733791
	B	1.827056	0.889867
	C	-1.586255	-1.483878
three	A	1.621192	NaN
	B	NaN	-2.246142
	C	0.956670	NaN
two	A	NaN	0.932376
	B	-1.784999	NaN
	C	NaN	-0.673481

9 时间序列

Pandas在对频率转换进行重新采样时拥有简单、强大且高效的功能（如将按秒采样的数据转换为按5分钟为单位进行采样的数据）。这种操作在金融领域非常常见。

rng = pd.date_range('20120101', periods=100, freq='S')
rng
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()

2012-01-01    25643
Freq: 5T, dtype: int32

rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2012-03-06    0.617556
2012-03-07   -0.140357
2012-03-08    0.720688
2012-03-09   -0.002317
2012-03-10   -1.039849
Freq: D, dtype: float64

#  时区表示
ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-06 00:00:00+00:00    0.617556
2012-03-07 00:00:00+00:00   -0.140357
2012-03-08 00:00:00+00:00    0.720688
2012-03-09 00:00:00+00:00   -0.002317
2012-03-10 00:00:00+00:00   -1.039849
Freq: D, dtype: float64

# 时区转换
ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00    0.617556
2012-03-06 19:00:00-05:00   -0.140357
2012-03-07 19:00:00-05:00    0.720688
2012-03-08 19:00:00-05:00   -0.002317
2012-03-09 19:00:00-05:00   -1.039849
Freq: D, dtype: float64

# 时间跨度转换
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31    0.228717
2012-02-29   -1.027588
2012-03-31    0.908395
2012-04-30    0.625547
2012-05-31   -0.109560
Freq: M, dtype: float64

ps = ts.to_period()
ps

2012-01    0.228717
2012-02   -1.027588
2012-03    0.908395
2012-04    0.625547
2012-05   -0.109560
Freq: M, dtype: float64

ps.to_timestamp()

2012-01-01    0.228717
2012-02-01   -1.027588
2012-03-01    0.908395
2012-04-01    0.625547
2012-05-01   -0.109560
Freq: MS, dtype: float64

ps.to_timestamp()

Period('2012-01', 'M')

# 时期和时间戳之间的转换使得可以使用一些方便的算术函数
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

Categorical

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

# 排序是按照Categorical的顺序进行的而不是按照字典顺序进行
df.sort_values(by="grade")

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

#   对Categorical列进行计数/排序时，存在空的类别
df.groupby("grade").size()

画图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1788d4ef1c8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Uw25svMu-1586607531658)(1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_files/1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_92_1.svg)]

# 对于DataFrame来说，plot是一种将所有列及其标签进行绘制的简便方法
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); 
df.plot(); 
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x1788cbdbe88>




<Figure size 432x288 with 0 Axes>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o37hZq0z-1586607531659)(1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_files/1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_93_2.svg)]

导入和保存数据

# 读取，写入cxv文件
df.to_csv('foo.csv')
pd.read_csv('foo.csv')

	Unnamed: 0	A	B	C	D
0	2000-01-01	-1.125225	1.577122	-0.631104	0.226040
1	2000-01-02	-1.636618	0.825889	-1.702668	0.742720
2	2000-01-03	-1.874304	-0.924554	-0.396482	0.757374
3	2000-01-04	-2.804663	-1.015225	0.563958	1.102884
4	2000-01-05	-2.183829	-1.126147	0.987302	-0.351845
...	...	...	...	...	...
995	2002-09-22	-10.052230	-5.444433	-49.408961	19.674901
996	2002-09-23	-9.058395	-5.363573	-47.815307	19.377475
997	2002-09-24	-8.024636	-6.063298	-47.262819	21.506791
998	2002-09-25	-8.128653	-5.967861	-46.710008	22.240115
999	2002-09-26	-9.313855	-5.906042	-47.302999	22.040534

1000 rows × 5 columns

# 读取，写入hdf5文件
df.to_hdf('foo.h5','df')
pd.read_hdf('foo.h5','df')

	A	B	C	D
2000-01-01	-1.125225	1.577122	-0.631104	0.226040
2000-01-02	-1.636618	0.825889	-1.702668	0.742720
2000-01-03	-1.874304	-0.924554	-0.396482	0.757374
2000-01-04	-2.804663	-1.015225	0.563958	1.102884
2000-01-05	-2.183829	-1.126147	0.987302	-0.351845
...	...	...	...	...
2002-09-22	-10.052230	-5.444433	-49.408961	19.674901
2002-09-23	-9.058395	-5.363573	-47.815307	19.377475
2002-09-24	-8.024636	-6.063298	-47.262819	21.506791
2002-09-25	-8.128653	-5.967861	-46.710008	22.240115
2002-09-26	-9.313855	-5.906042	-47.302999	22.040534

1000 rows × 4 columns

# 写入，读取Excel文件
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

	Unnamed: 0	A	B	C	D
0	2000-01-01	-1.125225	1.577122	-0.631104	0.226040
1	2000-01-02	-1.636618	0.825889	-1.702668	0.742720
2	2000-01-03	-1.874304	-0.924554	-0.396482	0.757374
3	2000-01-04	-2.804663	-1.015225	0.563958	1.102884
4	2000-01-05	-2.183829	-1.126147	0.987302	-0.351845
...	...	...	...	...	...
995	2002-09-22	-10.052230	-5.444433	-49.408961	19.674901
996	2002-09-23	-9.058395	-5.363573	-47.815307	19.377475
997	2002-09-24	-8.024636	-6.063298	-47.262819	21.506791
998	2002-09-25	-8.128653	-5.967861	-46.710008	22.240115
999	2002-09-26	-9.313855	-5.906042	-47.302999	22.040534

1000 rows × 5 columns

  <th>A</th>
  <th>B</th>
  <th>C</th>
  <th>D</th>
</tr>

2000-01-01 -1.125225 1.577122 -0.631104 0.226040 2000-01-02 -1.636618 0.825889 -1.702668 0.742720 2000-01-03 -1.874304 -0.924554 -0.396482 0.757374 2000-01-04 -2.804663 -1.015225 0.563958 1.102884 2000-01-05 -2.183829 -1.126147 0.987302 -0.351845 ... ... ... ... ... 2002-09-22 -10.052230 -5.444433 -49.408961 19.674901 2002-09-23 -9.058395 -5.363573 -47.815307 19.377475 2002-09-24 -8.024636 -6.063298 -47.262819 21.506791 2002-09-25 -8.128653 -5.967861 -46.710008 22.240115 2002-09-26 -9.313855 -5.906042 -47.302999 22.040534

1000 rows × 4 columns

# 写入，读取Excel文件
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

	Unnamed: 0	A	B	C	D
0	2000-01-01	-1.125225	1.577122	-0.631104	0.226040
1	2000-01-02	-1.636618	0.825889	-1.702668	0.742720
2	2000-01-03	-1.874304	-0.924554	-0.396482	0.757374
3	2000-01-04	-2.804663	-1.015225	0.563958	1.102884
4	2000-01-05	-2.183829	-1.126147	0.987302	-0.351845
...	...	...	...	...	...
995	2002-09-22	-10.052230	-5.444433	-49.408961	19.674901
996	2002-09-23	-9.058395	-5.363573	-47.815307	19.377475
997	2002-09-24	-8.024636	-6.063298	-47.262819	21.506791
998	2002-09-25	-8.128653	-5.967861	-46.710008	22.240115
999	2002-09-26	-9.313855	-5.906042	-47.302999	22.040534

1000 rows × 5 columns

本博客的目的是总结和分享，大部分程序来自互联网，作者为标榜任何代码所有权。读者可以在未经本博客作者授权的情况下随意复制和使用。转发时若声明出处将不胜感激。另外如果程序难免有未优化或欢迎批评指正（chenaijun1298 at yeah dot net）

onetpm

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
1. Pandas教程-10MintoPandas

介绍：这部分程序是10分钟进入pandas教程. 教程在Windows10环境，使用Anaconda，jupyter下测试的。注意版本区别，若版本有更新某些函数被弃用，请以官方手册为准。文章是jupyter编辑器转化成的markdown，可能会有一些格式问题，但是不影响复现# 导入必要的包import pandas as pdimport numpy as npimport matplot...
复制链接

扫一扫

专栏目录