关注微信公共号:小程在线
关注CSDN博客:程志伟的博客
纯属个人学习笔记
Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 7.12.0 -- An enhanced Interactive Python.
#一、创建对象
#1、可以通过传递一个 list 对象来创建一个 Series , pandas 会默认创建整型索引:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
s = pd.Series([1,3,5,np.nan,9,8])
print(s)
0 1.0
1 3.0
2 5.0
3 NaN
4 9.0
5 8.0
dtype: float64
#2.通过传递一个 numpy array , 时间索引以及列标签来创建一个 DataFrame
dates = pd.date_range('20130101', periods=6)
print(dates)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(df)
A B C D
2013-01-01 -1.863938 0.402828 1.170145 1.219647
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-05 0.020218 0.861441 -0.710662 -1.181764
2013-01-06 -0.315920 -2.371423 -0.572653 0.889698
#3、 通过传递一个能够被转换成类似序列结构的字典对象来创建一个 DataFrame
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
print(df2)
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
#4、 查看不同列的数据类型:
print(df2.dtypes)
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
'''
二、 查看数据
'''
1、 查看 DataFrame 中头部和尾部的行:
print(df.head())
A B C D
2013-01-01 -1.863938 0.402828 1.170145 1.219647
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-05 0.020218 0.861441 -0.710662 -1.181764
print(df.tail(3))
A B C D
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-05 0.020218 0.861441 -0.710662 -1.181764
2013-01-06 -0.315920 -2.371423 -0.572653 0.889698
#2、 显示索引、 列和底层的 numpy 数据:
print(df.index)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
print(df.columns)
Index(['A', 'B', 'C', 'D'], dtype='object')
print(df.values)
[[-1.86393821 0.40282769 1.17014473 1.21964729]
[-0.25573943 0.90825332 -1.19425769 0.92671242]
[-0.09869885 -0.79461508 -1.08363359 -0.71968247]
[ 1.17222981 -0.15551869 0.38053457 -0.69428843]
[ 0.02021829 0.86144052 -0.71066191 -1.18176438]
[-0.31592033 -2.37142305 -0.57265269 0.88969805]]
#3、 describe() 函数对于数据的快速统计汇总:
print(df.describe())
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.223641 -0.191506 -0.335088 0.073387
std 0.971973 1.248076 0.924535 1.049012
min -1.863938 -2.371423 -1.194258 -1.181764
25% -0.300875 -0.634841 -0.990391 -0.713334
50% -0.177219 0.123655 -0.641657 0.097705
75% -0.009511 0.746787 0.142238 0.917459
max 1.172230 0.908253 1.170145 1.219647
#4、 对数据的转置:
print(df.T)
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A -1.863938 -0.255739 -0.098699 1.172230 0.020218 -0.315920
B 0.402828 0.908253 -0.794615 -0.155519 0.861441 -2.371423
C 1.170145 -1.194258 -1.083634 0.380535 -0.710662 -0.572653
D 1.219647 0.926712 -0.719682 -0.694288 -1.181764 0.889698
#5、 按轴进行排序
print(df.sort_index(axis=1, ascending=False))
D C B A
2013-01-01 1.219647 1.170145 0.402828 -1.863938
2013-01-02 0.926712 -1.194258 0.908253 -0.255739
2013-01-03 -0.719682 -1.083634 -0.794615 -0.098699
2013-01-04 -0.694288 0.380535 -0.155519 1.172230
2013-01-05 -1.181764 -0.710662 0.861441 0.020218
2013-01-06 0.889698 -0.572653 -2.371423 -0.315920
#6、 按值进行排序
print(df.sort_values(by='B'))
A B C D
2013-01-06 -0.315920 -2.371423 -0.572653 0.889698
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-01 -1.863938 0.402828 1.170145 1.219647
2013-01-05 0.020218 0.861441 -0.710662 -1.181764
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
'''
三、 选择
虽然标准的 Python/Numpy 的选择和设置表达式都能够直接派上用场, 但是作为工
程使用的代码, 我们推荐使用经过优化的 pandas 数据访问方式: .at , .iat ,
.loc , .iloc 和 .ix 。 详情请参阅索引和选取数据 和 多重索引/高级索引。
'''
#1、 选择一个单独的列, 这将会返回一个 Series , 等同于 df.A :
print(df['A'])
2013-01-01 -1.863938
2013-01-02 -0.255739
2013-01-03 -0.098699
2013-01-04 1.172230
2013-01-05 0.020218
2013-01-06 -0.315920
Freq: D, Name: A, dtype: float64
#2、 通过 [] 进行选择, 这将会对行进行切片
print(df[0:3])
A B C D
2013-01-01 -1.863938 0.402828 1.170145 1.219647
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
print(df['20130102':'20130104'])
A B C D
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
'''
四、
通过标签选择
'''
#1、 使用标签来获取一个交叉的区域
print(df.loc[dates[0]])
A -1.863938
B 0.402828
C 1.170145
D 1.219647
Name: 2013-01-01 00:00:00, dtype: float64
#2、 通过标签来在多个轴上进行选择
print(df.loc[:,['A','B']])
A B
2013-01-01 -1.863938 0.402828
2013-01-02 -0.255739 0.908253
2013-01-03 -0.098699 -0.794615
2013-01-04 1.172230 -0.155519
2013-01-05 0.020218 0.861441
2013-01-06 -0.315920 -2.371423
#3. 标签切片
print(df.loc['20130102':'20130104',['A','B']])
A B
2013-01-02 -0.255739 0.908253
2013-01-03 -0.098699 -0.794615
2013-01-04 1.172230 -0.155519
#4、 对于返回的对象进行维度缩减
print(df.loc['20130102',['A','B']])
A -0.255739
B 0.908253
Name: 2013-01-02 00:00:00, dtype: float64
#5、 获取一个标量
print(df.loc[dates[0],'A'])
-1.8639382082056544
'''
五、
通过位置选择
'''
#1、 通过传递数值进行位置选择( 选择的是行)
print(df.iloc[3])
A 1.172230
B -0.155519
C 0.380535
D -0.694288
Name: 2013-01-04 00:00:00, dtype: float64
#2、 通过数值进行切片, 与 numpy/python 中的情况类似
print(df.iloc[3:5,0:2])
A B
2013-01-04 1.172230 -0.155519
2013-01-05 0.020218 0.861441
#3、 通过指定一个位置的列表, 与 numpy/python 中的情况类似
df.iloc[[1,2,4],[0,2]]
Out[26]:
A C
2013-01-02 -0.255739 -1.194258
2013-01-03 -0.098699 -1.083634
2013-01-05 0.020218 -0.710662
#4、 对行进行切片
df.iloc[1:3,:]
Out[27]:
A B C D
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
#5、 对列进行切片
df.iloc[:,1:3]
Out[28]:
B C
2013-01-01 0.402828 1.170145
2013-01-02 0.908253 -1.194258
2013-01-03 -0.794615 -1.083634
2013-01-04 -0.155519 0.380535
2013-01-05 0.861441 -0.710662
2013-01-06 -2.371423 -0.572653
#6、 获取特定的值
df.iloc[1,1]
Out[29]: 0.9082533192518502
'''
六、
布尔索引
'''
#1、 使用一个单独列的值来选择数据
df[df.A > 0]
Out[30]:
A B C D
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-05 0.020218 0.861441 -0.710662 -1.181764
#2、 使用 where 操作来选择数据
df[df > 0]
Out[31]:
A B C D
2013-01-01 NaN 0.402828 1.170145 1.219647
2013-01-02 NaN 0.908253 NaN 0.926712
2013-01-03 NaN NaN NaN NaN
2013-01-04 1.172230 NaN 0.380535 NaN
2013-01-05 0.020218 0.861441 NaN NaN
2013-01-06 NaN NaN NaN 0.889698
#3、 使用 isin() 方法来过滤
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2
Out[33]:
A B C D E
2013-01-01 -0.309948 -0.216393 -0.056348 -0.820341 one
2013-01-02 -0.025978 -0.487552 -0.128323 -1.403817 one
2013-01-03 -0.675241 1.801300 -0.564131 -1.747483 two
2013-01-04 0.159498 -1.440484 -1.916870 -0.202248 three
2013-01-05 1.628307 -0.472663 -0.196632 0.561092 four
2013-01-06 -1.359379 0.442662 0.898418 -0.943424 three
df2[df2['E'].isin(['two','four'])]
Out[34]:
A B C D E
2013-01-03 -0.675241 1.801300 -0.564131 -1.747483 two
2013-01-05 1.628307 -0.472663 -0.196632 0.561092 four
'''
设置
'''
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1
Out[35]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
#2、 通过标签设置新的值
df.at[dates[0],'A'] = 0
df
Out[37]:
A B C D
2013-01-01 0.000000 -0.216393 -0.056348 -0.820341
2013-01-02 -0.025978 -0.487552 -0.128323 -1.403817
2013-01-03 -0.675241 1.801300 -0.564131 -1.747483
2013-01-04 0.159498 -1.440484 -1.916870 -0.202248
2013-01-05 1.628307 -0.472663 -0.196632 0.561092
2013-01-06 -1.359379 0.442662 0.898418 -0.943424
#3、 通过位置设置新的
df.iat[0,1] = 0
#4、 通过一个numpy数组设置一组新值
df.loc[:,'D'] = np.array([5] * len(df))
df
Out[40]:
A B C D
2013-01-01 0.000000 0.000000 -0.056348 5
2013-01-02 -0.025978 -0.487552 -0.128323 5
2013-01-03 -0.675241 1.801300 -0.564131 5
2013-01-04 0.159498 -1.440484 -1.916870 5
2013-01-05 1.628307 -0.472663 -0.196632 5
2013-01-06 -1.359379 0.442662 0.898418 5
df['F'] = s1
df
Out[42]:
A B C D F
2013-01-01 0.000000 0.000000 -0.056348 5 NaN
2013-01-02 -0.025978 -0.487552 -0.128323 5 1.0
2013-01-03 -0.675241 1.801300 -0.564131 5 2.0
2013-01-04 0.159498 -1.440484 -1.916870 5 3.0
2013-01-05 1.628307 -0.472663 -0.196632 5 4.0
2013-01-06 -1.359379 0.442662 0.898418 5 5.0
#5、 通过where操作来设置新的值
df2 = df.copy()
df2[df2 > 0] = -df2
df2
Out[43]:
A B C D F
2013-01-01 0.000000 0.000000 -0.056348 -5 NaN
2013-01-02 -0.025978 -0.487552 -0.128323 -5 -1.0
2013-01-03 -0.675241 -1.801300 -0.564131 -5 -2.0
2013-01-04 -0.159498 -1.440484 -1.916870 -5 -3.0
2013-01-05 -1.628307 -0.472663 -0.196632 -5 -4.0
2013-01-06 -1.359379 -0.442662 -0.898418 -5 -5.0
'''
缺失值处理
'''
#1、 reindex() 方法可以对指定轴上的索引进行改变/增加/删除操作, 这将返回原始数据的一个拷贝
df1.loc[dates[0]:dates[1],'E'] = 1
df1
Out[46]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.056348 5 NaN 1.0
2013-01-02 -0.025978 -0.487552 -0.128323 5 1.0 1.0
2013-01-03 -0.675241 1.801300 -0.564131 5 2.0 NaN
2013-01-04 0.159498 -1.440484 -1.916870 5 3.0 NaN
#2、 去掉包含缺失值的行
df1.dropna(how='any')
Out[47]:
A B C D F E
2013-01-02 -0.025978 -0.487552 -0.128323 5 1.0 1.0
#3、 对缺失值进行填充
df1.fillna(value=5)
Out[48]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.056348 5 5.0 1.0
2013-01-02 -0.025978 -0.487552 -0.128323 5 1.0 1.0
2013-01-03 -0.675241 1.801300 -0.564131 5 2.0 5.0
2013-01-04 0.159498 -1.440484 -1.916870 5 3.0 5.0
#4、 对数据进行布尔填充
pd.isnull(df1)
Out[49]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True