Learning pandas (1)

最新推荐文章于 2022-09-24 23:42:30 发布

远山待

最新推荐文章于 2022-09-24 23:42:30 发布

阅读量184

点赞数 1

本文链接：https://blog.csdn.net/weixin_40835556/article/details/83346631

版权

【仅作为本人学习的记录草稿】

pandas最重要的内容是两种数据结构：Series和DataFrame。

先熟悉一下基本命令。

1. import:

import pandas as pd

import numpy as np

import matplotlib.pylab as plt

2. Create Series and DataFrame

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

# create Series by passing a list of values
>>>s = pd.Series([1,3,5,np.nan,6,8])   
>>>s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

# create DataFrame by passing a NumPy array with a datetime index and labeled columns
>>>dates = pd.date_range('20181024',periods=6)      
>>>dates
DatetimeIndex(['2018-10-24', '2018-10-25', '2018-10-26', '2018-10-27',
               '2018-10-28', '2018-10-29'],
              dtype='datetime64[ns]', freq='D')

>>>df = pd.DataFrame(np.random.randn(6,4),index = dates, columns = list('ABCD'))
>>>df
                   A         B         C         D
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397

# create DataFrame by passing a dict of objects that can be converted to series-like
>>>df2 = pd.DataFrame({'A':1.,                      
                       'B':pd.Timestamp('20181024'),
                       'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                       'D':np.array([3]*4,dtype='int32'),
                       'E':pd.Categorical(["test","train","test","train"]),
                       'F':'foo'})
>>>df2 
     A          B    C  D      E    F
0  1.0 2018-10-24  1.0  3   test  foo
1  1.0 2018-10-24  1.0  3  train  foo
2  1.0 2018-10-24  1.0  3   test  foo
3  1.0 2018-10-24  1.0  3  train  foo
>>>df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

3. Viewing Data

>>>df.head()
                   A         B         C         D
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
>>>df.tail()
                   A         B         C         D
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397
>>>df.tail(2)
                   A         B         C         D
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397

# Display the index,columns and the underlying NumPy data:
>>>df.index
DatetimeIndex(['2018-10-24', '2018-10-25', '2018-10-26', '2018-10-27',
               '2018-10-28', '2018-10-29'],
              dtype='datetime64[ns]', freq='D')
>>>df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
>>>df.values
array([[-0.44857797, -0.43094069,  0.1163714 ,  0.02990564],
       [ 1.47278109, -1.88029967,  1.62607715, -0.39742327],
       [-1.18227842,  0.11046589, -0.36528377, -0.75864786],
       [ 0.79296163,  1.97246094, -0.0098655 ,  0.58424183],
       [ 1.86912973, -1.19133507, -0.90369126, -1.42584953],
       [ 0.71844329,  0.99475901, -1.20281179,  0.09639716]])
>>>df.describe()
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.537077 -0.070815 -0.123201 -0.311896
std    1.155506  1.414410  0.996348  0.711954
min   -1.182278 -1.880300 -1.202812 -1.425850
25%   -0.156823 -1.001236 -0.769089 -0.668342
50%    0.755702 -0.160237 -0.187575 -0.183759
75%    1.302826  0.773686  0.084812  0.079774
max    1.869130  1.972461  1.626077  0.584242
# Transposing the data
>>>df.T
   2018-10-24  2018-10-25     ...      2018-10-28  2018-10-29
A   -0.448578    1.472781     ...        1.869130    0.718443
B   -0.430941   -1.880300     ...       -1.191335    0.994759
C    0.116371    1.626077     ...       -0.903691   -1.202812
D    0.029906   -0.397423     ...       -1.425850    0.096397
# sort by axis
>>>df.sort_index(axis=0,ascending=False)     
                   A         B         C         D
2018-10-29  0.718443  0.994759 -1.202812  0.096397
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
# sort by values
>>>df.sort_values(by='B')                    
                   A         B         C         D
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-29  0.718443  0.994759 -1.202812  0.096397
2018-10-27  0.792962  1.972461 -0.009865  0.584242

4. Selection

# Getting a single column
>>>df['A']
2018-10-24   -0.448578
2018-10-25    1.472781
2018-10-26   -1.182278
2018-10-27    0.792962
2018-10-28    1.869130
2018-10-29    0.718443
Freq: D, Name: A, dtype: float64
>>>df.A
2018-10-24   -0.448578
2018-10-25    1.472781
2018-10-26   -1.182278
2018-10-27    0.792962
2018-10-28    1.869130
2018-10-29    0.718443
Freq: D, Name: A, dtype: float64

# Selecting via [], which slices the rows
>>>df
                   A         B         C         D
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397
>>>df[0:3]
                   A         B         C         D
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
>>>df['2018-10-27':]
                   A         B         C         D
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397

# Selection by Label
>>>df.loc[:,['A','B']]
                   A         B
2018-10-24 -0.448578 -0.430941
2018-10-25  1.472781 -1.880300
2018-10-26 -1.182278  0.110466
2018-10-27  0.792962  1.972461
2018-10-28  1.869130 -1.191335
2018-10-29  0.718443  0.994759

>>>df.loc['2018-10-24':'2018-10-25',['A','B']]
                   A         B
2018-10-24 -0.448578 -0.430941
2018-10-25  1.472781 -1.880300

>>>df.loc['2018-10-24',['A','B']]
A   -0.448578
B   -0.430941
Name: 2018-10-24 00:00:00, dtype: float64

>>>df.loc[dates[0],'A']    # Getting a scalar value:
-0.4485779663058197
>>>df.at[dates[0],'A']
-0.4485779663058197

# Selection by Position
>>>df.iloc[3]
A    0.792962
B    1.972461
C   -0.009865
D    0.584242
Name: 2018-10-27 00:00:00, dtype: float64

Boolean Indexing

>>>df[df.A>0]
                   A         B         C         D
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397

# isin() method for filtering:
>>>df2 = df.copy()
>>>df2['E'] = ['one','two','three','four','five','six']
>>>df2
                   A         B         C         D      E
2018-10-24 -0.448578 -0.430941  0.116371  0.029906    one
2018-10-25  1.472781 -1.880300  1.626077 -0.397423    two
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648  three
2018-10-27  0.792962  1.972461 -0.009865  0.584242   four
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850   five
2018-10-29  0.718443  0.994759 -1.202812  0.096397    six
>>>df2[df2['E'].isin(['one','six'])]
                   A         B         C         D    E
2018-10-24 -0.448578 -0.430941  0.116371  0.029906  one
2018-10-29  0.718443  0.994759 -1.202812  0.096397  six

Setting

>>>s1 = pd.Series([1,2,3,4,5,6],index = pd.date_range('20181024',periods=6))
>>>s1
2018-10-24    1
2018-10-25    2
2018-10-26    3
2018-10-27    4
2018-10-28    5
2018-10-29    6
Freq: D, dtype: int64

>>>df['G'] = s1
>>>df
                   A         B         C         D  G
2018-10-24 -0.448578 -0.430941  0.116371  0.029906  1
2018-10-25  1.472781 -1.880300  1.626077 -0.397423  2
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648  3
2018-10-27  0.792962  1.972461 -0.009865  0.584242  4
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850  5
2018-10-29  0.718443  0.994759 -1.202812  0.096397  6

# Setting values by label:
>>>df.at[dates[0],'A'] = 0
>>>df
                   A         B         C         D  G
2018-10-24  0.000000 -0.430941  0.116371  0.029906  1
2018-10-25  1.472781 -1.880300  1.626077 -0.397423  2
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648  3
2018-10-27  0.792962  1.972461 -0.009865  0.584242  4
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850  5
2018-10-29  0.718443  0.994759 -1.202812  0.096397  6

# Setting values by position:
>>>df.iat[0,1] = 0
>>>df
                   A         B         C         D  G
2018-10-24  0.000000  0.000000  0.116371  0.029906  1
2018-10-25  1.472781 -1.880300  1.626077 -0.397423  2
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648  3
2018-10-27  0.792962  1.972461 -0.009865  0.584242  4
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850  5
2018-10-29  0.718443  0.994759 -1.202812  0.096397  6

# Setting by assigning with a NumPy array:
>>>df.loc[:,'D'] = np.array([5]*len(df))
>>>df
                   A         B         C  D  G
2018-10-24  0.000000  0.000000  0.116371  5  1
2018-10-25  1.472781 -1.880300  1.626077  5  2
2018-10-26 -1.182278  0.110466 -0.365284  5  3
2018-10-27  0.792962  1.972461 -0.009865  5  4
2018-10-28  1.869130 -1.191335 -0.903691  5  5
2018-10-29  0.718443  0.994759 -1.202812  5  6

# Look:
>>>np.array([5])
array([5])
>>>np.array([5]*8)
array([5, 5, 5, 5, 5, 5, 5, 5])

Missing Data

The value np.nan represents missing data.

>>>df.loc[0:1,'G'] = np.nan
>>>df
                   A         B         C  D    G
2018-10-24  0.000000  0.000000  0.116371  5  NaN
2018-10-25  1.472781 -1.880300  1.626077  5  2.0
2018-10-26 -1.182278  0.110466 -0.365284  5  3.0
2018-10-27  0.792962  1.972461 -0.009865  5  4.0
2018-10-28  1.869130 -1.191335 -0.903691  5  5.0
2018-10-29  0.718443  0.994759 -1.202812  5  6.0

# Drop any rows that have missing data:
>>>df.dropna(how='any')
                   A         B         C  D    G
2018-10-25  1.472781 -1.880300  1.626077  5  2.0
2018-10-26 -1.182278  0.110466 -0.365284  5  3.0
2018-10-27  0.792962  1.972461 -0.009865  5  4.0
2018-10-28  1.869130 -1.191335 -0.903691  5  5.0
2018-10-29  0.718443  0.994759 -1.202812  5  6.0

远山待

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Learning pandas (1)

【仅作为本人学习的记录草稿】pandas最重要的内容是两种数据结构：Series和DataFrame。先熟悉一下基本命令。1. import: import pandas as pdimport numpy as npimport matplotlib.pylab as plt2. Create Series and DataFrameSeries is a o...
复制链接

扫一扫