Learning pandas (1)

【仅作为本人学习的记录草稿】

pandas最重要的内容是两种数据结构:Series和DataFrame。

先熟悉一下基本命令。

1. import: 

import pandas as pd

import numpy as np

import matplotlib.pylab as plt

2.  Create Series and DataFrame

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. 

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. 

# create Series by passing a list of values
>>>s = pd.Series([1,3,5,np.nan,6,8])   
>>>s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

# create DataFrame by passing a NumPy array with a datetime index and labeled columns
>>>dates = pd.date_range('20181024',periods=6)      
>>>dates
DatetimeIndex(['2018-10-24', '2018-10-25', '2018-10-26', '2018-10-27',
               '2018-10-28', '2018-10-29'],
              dtype='datetime64[ns]', freq='D')

>>>df = pd.DataFrame(np.random.randn(6,4),index = dates, columns = list('ABCD'))
>>>df
                   A         B         C         D
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397

# create DataFrame by passing a dict of objects that can be converted to series-like
>>>df2 = pd.DataFrame({'A':1.,                      
                       'B':pd.Timestamp('20181024'),
                       'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                       'D':np.array([3]*4,dtype='int32'),
                       'E':pd.Categorical(["test","train","test","train"]),
                       'F':'foo'})
>>>df2 
     A          B    C  D      E    F
0  1.0 2018-10-24  1.0  3   test  foo
1  1.0 2018-10-24  1.0  3  train  foo
2  1.0 2018-10-24  1.0  3   test  foo
3  1.0 2018-10-24  1.0  3  train  foo
>>>df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

3. Viewing Data

>>>df.head()
                   A         B         C         D
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
>>>df.tail()
                   A         B         C         D
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397
>>>df.tail(2)
                   A         B         C         D
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397

# Display the index,columns and the underlying NumPy data:
>>>df.index
DatetimeIndex(['2018-10-24', '2018-10-25', '2018-10-26', '2018-10-27',
               '2018-10-28', '2018-10-29'],
              dtype='datetime64[ns]', freq='D')
>>>df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
>>>df.values
array([[-0.44857797, -0.43094069,  0.1163714 ,  0.02990564],
       [ 1.47278109, -1.88029967,  1.62607715, -0.39742327],
       [-1.18227842,  0.11046589, -0.36528377, -0.75864786],
       [ 0.79296163,  1.97246094, -0.0098655 ,  0.58424183],
       [ 1.86912973, -1.19133507, -0.90369126, -1.42584953],
       [ 0.71844329,  0.99475901, -1.20281179,  0.09639716]])
>>>df.describe()
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.537077 -0.070815 -0.123201 -0.311896
std    1.155506  1.414410  0.996348  0.711954
min   -1.182278 -1.880300 -1.202812 -1.425850
25%   -0.156823 -1.001236 -0.769089 -0.668342
50%    0.755702 -0.160237 -0.187575 -0.183759
75%    1.302826  0.773686  0.084812  0.079774
max    1.869130  1.972461  1.626077  0.584242
# Transposing the data
>>>df.T
   2018-10-24  2018-10-25     ...      2018-10-28  2018-10-29
A   -0.448578    1.472781     ...        1.869130    0.718443
B   -0.430941   -1.880300     ...       -1.191335    0.994759
C    0.116371    1.626077     ...       -0.903691   -1.202812
D    0.029906   -0.397423     ...       -1.425850    0.096397
# sort by axis
>>>df.sort_index(axis=0,ascending=False)     
                   A         B         C         D
2018-10-29  0.718443  0.994759 -1.202812  0.096397
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
# sort by values
>>>df.sort_values(by='B')                    
                   A         B         C         D
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-29  0.718443  0.994759 -1.202812  0.096397
2018-10-27  0.792962  1.972461 -0.009865  0.584242

4. Selection

# Getting a single column
>>>df['A']
2018-10-24   -0.448578
2018-10-25    1.472781
2018-10-26   -1.182278
2018-10-27    0.792962
2018-10-28    1.869130
2018-10-29    0.718443
Freq: D, Name: A, dtype: float64
>>>df.A
2018-10-24   -0.448578
2018-10-25    1.472781
2018-10-26   -1.182278
2018-10-27    0.792962
2018-10-28    1.869130
2018-10-29    0.718443
Freq: D, Name: A, dtype: float64

# Selecting via [], which slices the rows
>>>df
                   A         B         C         D
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397
>>>df[0:3]
                   A         B         C         D
2018-10-24 -0.448578 -0.430941  0.116371  0.029906
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648
>>>df['2018-10-27':]
                   A         B         C         D
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397

# Selection by Label
>>>df.loc[:,['A','B']]
                   A         B
2018-10-24 -0.448578 -0.430941
2018-10-25  1.472781 -1.880300
2018-10-26 -1.182278  0.110466
2018-10-27  0.792962  1.972461
2018-10-28  1.869130 -1.191335
2018-10-29  0.718443  0.994759

>>>df.loc['2018-10-24':'2018-10-25',['A','B']]
                   A         B
2018-10-24 -0.448578 -0.430941
2018-10-25  1.472781 -1.880300

>>>df.loc['2018-10-24',['A','B']]
A   -0.448578
B   -0.430941
Name: 2018-10-24 00:00:00, dtype: float64

>>>df.loc[dates[0],'A']    # Getting a scalar value:
-0.4485779663058197
>>>df.at[dates[0],'A']
-0.4485779663058197

# Selection by Position
>>>df.iloc[3]
A    0.792962
B    1.972461
C   -0.009865
D    0.584242
Name: 2018-10-27 00:00:00, dtype: float64


Boolean Indexing

>>>df[df.A>0]
                   A         B         C         D
2018-10-25  1.472781 -1.880300  1.626077 -0.397423
2018-10-27  0.792962  1.972461 -0.009865  0.584242
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850
2018-10-29  0.718443  0.994759 -1.202812  0.096397

# isin() method for filtering:
>>>df2 = df.copy()
>>>df2['E'] = ['one','two','three','four','five','six']
>>>df2
                   A         B         C         D      E
2018-10-24 -0.448578 -0.430941  0.116371  0.029906    one
2018-10-25  1.472781 -1.880300  1.626077 -0.397423    two
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648  three
2018-10-27  0.792962  1.972461 -0.009865  0.584242   four
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850   five
2018-10-29  0.718443  0.994759 -1.202812  0.096397    six
>>>df2[df2['E'].isin(['one','six'])]
                   A         B         C         D    E
2018-10-24 -0.448578 -0.430941  0.116371  0.029906  one
2018-10-29  0.718443  0.994759 -1.202812  0.096397  six

Setting

>>>s1 = pd.Series([1,2,3,4,5,6],index = pd.date_range('20181024',periods=6))
>>>s1
2018-10-24    1
2018-10-25    2
2018-10-26    3
2018-10-27    4
2018-10-28    5
2018-10-29    6
Freq: D, dtype: int64

>>>df['G'] = s1
>>>df
                   A         B         C         D  G
2018-10-24 -0.448578 -0.430941  0.116371  0.029906  1
2018-10-25  1.472781 -1.880300  1.626077 -0.397423  2
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648  3
2018-10-27  0.792962  1.972461 -0.009865  0.584242  4
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850  5
2018-10-29  0.718443  0.994759 -1.202812  0.096397  6

# Setting values by label:
>>>df.at[dates[0],'A'] = 0
>>>df
                   A         B         C         D  G
2018-10-24  0.000000 -0.430941  0.116371  0.029906  1
2018-10-25  1.472781 -1.880300  1.626077 -0.397423  2
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648  3
2018-10-27  0.792962  1.972461 -0.009865  0.584242  4
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850  5
2018-10-29  0.718443  0.994759 -1.202812  0.096397  6

# Setting values by position:
>>>df.iat[0,1] = 0
>>>df
                   A         B         C         D  G
2018-10-24  0.000000  0.000000  0.116371  0.029906  1
2018-10-25  1.472781 -1.880300  1.626077 -0.397423  2
2018-10-26 -1.182278  0.110466 -0.365284 -0.758648  3
2018-10-27  0.792962  1.972461 -0.009865  0.584242  4
2018-10-28  1.869130 -1.191335 -0.903691 -1.425850  5
2018-10-29  0.718443  0.994759 -1.202812  0.096397  6

# Setting by assigning with a NumPy array:
>>>df.loc[:,'D'] = np.array([5]*len(df))
>>>df
                   A         B         C  D  G
2018-10-24  0.000000  0.000000  0.116371  5  1
2018-10-25  1.472781 -1.880300  1.626077  5  2
2018-10-26 -1.182278  0.110466 -0.365284  5  3
2018-10-27  0.792962  1.972461 -0.009865  5  4
2018-10-28  1.869130 -1.191335 -0.903691  5  5
2018-10-29  0.718443  0.994759 -1.202812  5  6

# Look:
>>>np.array([5])
array([5])
>>>np.array([5]*8)
array([5, 5, 5, 5, 5, 5, 5, 5])

Missing Data

The value np.nan represents missing data.

>>>df.loc[0:1,'G'] = np.nan
>>>df
                   A         B         C  D    G
2018-10-24  0.000000  0.000000  0.116371  5  NaN
2018-10-25  1.472781 -1.880300  1.626077  5  2.0
2018-10-26 -1.182278  0.110466 -0.365284  5  3.0
2018-10-27  0.792962  1.972461 -0.009865  5  4.0
2018-10-28  1.869130 -1.191335 -0.903691  5  5.0
2018-10-29  0.718443  0.994759 -1.202812  5  6.0

# Drop any rows that have missing data:
>>>df.dropna(how='any')
                   A         B         C  D    G
2018-10-25  1.472781 -1.880300  1.626077  5  2.0
2018-10-26 -1.182278  0.110466 -0.365284  5  3.0
2018-10-27  0.792962  1.972461 -0.009865  5  4.0
2018-10-28  1.869130 -1.191335 -0.903691  5  5.0
2018-10-29  0.718443  0.994759 -1.202812  5  6.0

 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Learning pandas - Second Edition by Michael Heydt English | 30 Jun. 2017 | ASIN: B06ZXT13HZ | 446 Pages | AZW3 | 23.6 MB Key Features Get comfortable using pandas and Python as an effective data exploration and analysis tool Explore pandas through a framework of data analysis, with an explanation of how pandas is well suited for the various stages in a data analysis process A comprehensive guide to pandas with many of clear and practical examples to help you get up and using pandas Book Description You will learn how to use pandas to perform data analysis in Python. You will start with an overview of data analysis and iteratively progress from modeling data, to accessing data from remote sources, performing numeric and statistical analysis, through indexing and performing aggregate analysis, and finally to visualizing statistical data and applying pandas to finance. With the knowledge you gain from this book, you will quickly learn pandas and how it can empower you in the exciting world of data manipulation, analysis and science. What you will learn Understand how data analysts and scientists think about of the processes of gathering and understanding data Learn how pandas can be used to support the end-to-end process of data analysis Use pandas Series and DataFrame objects to represent single and multivariate data Slicing and dicing data with pandas, as well as combining, grouping, and aggregating data from multiple sources How to access data from external sources such as files, databases, and web services Represent and manipulate time-series data and the many of the intricacies involved with this type of data How to visualize statistical information How to use pandas to solve several common data representation and analysis problems within finance About the Author Michael Heydt is a technologist, entrepreneur, and educator with decades of professional software development and financial and commodities trading experience. He has worked extensively on Wall Street sp

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值