pandas使用:
1、数据结构
2、创建对象
3、查看数据
4、选择数据
5、赋值数据
一、数据结构:
pandas主要的两种数据结构:
Dimensions | name | Descriptions |
1 | Series | 1维标记同类型数组 |
2 | DataFrame | 通常是2维,大小可变的表格结构 |
pandas数据结构是低维数据的灵活组合。DaTaFrame是Series的组合,Series是单个个体数据的组合。存储在pandas对象的主要类型包括: float
, int
, bool
, datetime64[ns]
and datetime64[ns, tz]
, timedelta[ns]
, category
和 object。
二、创建对象:
1、creating a Series:构建一个默认整数索引
In [26]: s = pd.Series([1,3,5,np.nan,6,8])
In [27]: s
Out[27]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
2、creating a DataFrame:构建一个六行四列的日期索引和标记列
In [29]: dates = pd.date_range('20130101', periods=6)
In [30]: dates
Out[30]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [31]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
In [32]: df
Out[32]:
A B C D
2013-01-01 -0.927711 1.086539 1.480014 -0.930739
2013-01-02 -0.829482 2.504326 1.782794 1.030505
2013-01-03 -1.766893 0.932729 0.553364 -0.173031
2013-01-04 -0.174698 -0.166893 -0.763745 0.353329
2013-01-05 0.586179 0.391955 -1.113228 1.034821
2013-01-06 1.124303 2.044700 -0.910196 0.254656
3、将series-like转换为DataFrame:
In [36]: df2 = pd.DataFrame({ 'A' : 1.,
...: ....: 'B' : pd.Timestamp('20130102'),
...: ....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
...: ....: 'D' : np.array([3] * 4,dtype='int32'),
...: ....: 'E' : pd.Categorical(["test","train","test","train"]),
...: ....: 'F' : 'foo' })
In [37]: df2
Out[37]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
4、查看数据结构的数据类型:
利用dtypes可以查看Dataframe的数据类型,利用dtype查看Series的数据类型
In [38]: df2.dtypes
Out[38]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
In [39]: s.dtype
Out[39]: dtype('float64')
如果一个列中包含多种type,最终类型会总结为object。
In [40]: pd.Series([1, 2, 3, 6., 'foo'])
Out[40]:
0 1
1 2
2 3
3 6
4 foo
dtype: object
查看DataFrame每个类型的列数,利用get_dtype_counts()
:
In [41]: a = [['a', 1, 1.0], ['b', 2, 2.0], ['c', 3, 3.0]]
In [42]: df3 = pd.DataFrame(a, columns=['str', 'int', 'float'])
In [43]: df3
Out[43]:
str int float
0 a 1 1.0
1 b 2 2.0
2 c 3 3.0
In [44]: df3.get_dtype_counts()
Out[44]:
float64 1
int64 1
object 1
dtype: int64
三、查看数据
利用head()查看前几行数据,利用tail()查看后几行数据。
In [53]: df4 = pd.DataFrame(np.random.randn(6,4),columns=list('ABCD'))
In [54]: df4
Out[54]:
A B C D
0 -1.075377 2.554021 -0.374659 -0.192440
1 0.179418 0.628350 -0.700365 0.064725
2 0.955096 0.329108 0.509071 0.024015
3 -0.023552 0.700166 -0.030874 1.588215
4 -1.082844 0.477065 1.357116 -1.083416
5 -1.619765 -0.510836 -0.904413 -0.386017
In [55]: df4.head(1)
Out[55]:
A B C D
0 -1.075377 2.554021 -0.374659 -0.19244
In [56]: df4.tail(1)
Out[56]:
A B C D
5 -1.619765 -0.510836 -0.904413 -0.386017
利用index查看索引,利用columns查看列名称,利用values查看数值
In [57]: df4.index
Out[57]: RangeIndex(start=0, stop=6, step=1)
In [58]: df4.columns
Out[58]: Index(['A', 'B', 'C', 'D'], dtype='object')
In [60]: df4.values
Out[60]:
array([[-1.07537664, 2.55402142, -0.37465895, -0.19243966],
[ 0.17941807, 0.62835002, -0.70036508, 0.06472549],
[ 0.95509596, 0.32910841, 0.50907057, 0.02401518],
[-0.02355206, 0.70016557, -0.03087417, 1.58821536],
[-1.08284367, 0.47706484, 1.3571161 , -1.08341606],
[-1.61976495, -0.51083645, -0.90441259, -0.38601714]])
利用describe()可以查看数据的静态信息
In [61]: df4.describe()
Out[61]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.444504 0.696312 -0.024021 0.002514
std 0.970781 1.009538 0.842289 0.881702
min -1.619765 -0.510836 -0.904413 -1.083416
25% -1.080977 0.366098 -0.618939 -0.337623
50% -0.549464 0.552707 -0.202767 -0.084212
75% 0.128676 0.682212 0.374084 0.054548
max 0.955096 2.554021 1.357116 1.588215
利用T进行矩阵的转置
In [62]: df4.T
Out[62]:
0 1 2 3 4 5
A -1.075377 0.179418 0.955096 -0.023552 -1.082844 -1.619765
B 2.554021 0.628350 0.329108 0.700166 0.477065 -0.510836
C -0.374659 -0.700365 0.509071 -0.030874 1.357116 -0.904413
D -0.192440 0.064725 0.024015 1.588215 -1.083416 -0.386017
列反转
In [63]: df4.sort_index(axis=1, ascending=False)
Out[63]:
D C B A
0 -0.192440 -0.374659 2.554021 -1.075377
1 0.064725 -0.700365 0.628350 0.179418
2 0.024015 0.509071 0.329108 0.955096
3 1.588215 -0.030874 0.700166 -0.023552
4 -1.083416 1.357116 0.477065 -1.082844
5 -0.386017 -0.904413 -0.510836 -1.619765
以下案例为按照B列进行排序。排序可以按照index排列,或者按照value排列,也可以同时按index和value进行排列。
按index进行排列:
df.sort_index()按索引顺序排列
df.sort_index(ascending=False)按索引反向排列
df.sort_index(axis=1)这个还没有搞懂。。
In [64]: df4.sort_values(by='B')
Out[64]:
A B C D
5 -1.619765 -0.510836 -0.904413 -0.386017
2 0.955096 0.329108 0.509071 0.024015
4 -1.082844 0.477065 1.357116 -1.083416
1 0.179418 0.628350 -0.700365 0.064725
3 -0.023552 0.700166 -0.030874 1.588215
0 -1.075377 2.554021 -0.374659 -0.192440
四、选择数据
列数据选择
In [66]: df4['A']
Out[66]:
0 -1.075377
1 0.179418
2 0.955096
3 -0.023552
4 -1.082844
5 -1.619765
Name: A, dtype: float64
行数据选择:通过行数选择
In [67]: df4[2:5]
Out[67]:
A B C D
2 0.955096 0.329108 0.509071 0.024015
3 -0.023552 0.700166 -0.030874 1.588215
4 -1.082844 0.477065 1.357116 -1.083416
利用label进行选择,可以行列同时选择
In [71]: df4.loc['2':'4','A':'B']
Out[71]:
A B
2 0.955096 0.329108
3 -0.023552 0.700166
4 -1.082844 0.477065
利用位置进行选择
In [73]: df4.iloc[2:5,0:2]
Out[73]:
A B
2 0.955096 0.329108
3 -0.023552 0.700166
4 -1.082844 0.477065
In [74]: df4.iloc[[2,3,4],[0,1]]
Out[74]:
A B
2 0.955096 0.329108
3 -0.023552 0.700166
4 -1.082844 0.477065
布尔型索引:
显示某一列中大于0的行
In [77]: df4[df4.A>0]
Out[77]:
A B C D
1 0.179418 0.628350 -0.700365 0.064725
2 0.955096 0.329108 0.509071 0.024015
显示数据结构中大于0的数
In [78]: df4[df4>0]
Out[78]:
A B C D
0 NaN 2.554021 NaN NaN
1 0.179418 0.628350 NaN 0.064725
2 0.955096 0.329108 0.509071 0.024015
3 NaN 0.700166 NaN 1.588215
4 NaN 0.477065 1.357116 NaN
5 NaN NaN NaN NaN
利用isin显示
In [82]: df5[df5['E'].isin(['one'])]
Out[82]:
A B C D E
0 -1.075377 2.554021 -0.374659 -0.192440 one
1 0.179418 0.628350 -0.700365 0.064725 one
五、赋值数据
赋值一个新列
In [88]: s1 = pd.Series([1,2,3,4,5,6], index=np.arange(6))
In [89]: s1
Out[89]:
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
In [90]: df4
Out[90]:
A B C D
0 -1.075377 2.554021 -0.374659 -0.192440
1 0.179418 0.628350 -0.700365 0.064725
2 0.955096 0.329108 0.509071 0.024015
3 -0.023552 0.700166 -0.030874 1.588215
4 -1.082844 0.477065 1.357116 -1.083416
5 -1.619765 -0.510836 -0.904413 -0.386017
In [91]: df4['E'] = s1
In [92]: df4
Out[92]:
A B C D E
0 -1.075377 2.554021 -0.374659 -0.192440 1
1 0.179418 0.628350 -0.700365 0.064725 2
2 0.955096 0.329108 0.509071 0.024015 3
3 -0.023552 0.700166 -0.030874 1.588215 4
4 -1.082844 0.477065 1.357116 -1.083416 5
5 -1.619765 -0.510836 -0.904413 -0.386017 6
通过label设置数据
In [108]: df4
Out[108]:
A B C D E
0 1.942578 2.020022 1.638648 0.018943 1
1 -0.858475 0.464734 -0.479293 0.146777 2
2 -1.841956 1.089131 1.113484 -1.513879 3
3 0.855033 -0.573659 1.262123 0.480048 4
4 -0.714338 0.057483 -0.048462 -1.075702 5
5 -1.061789 -0.530627 -2.421183 -1.098107 6
In [109]: df4.at[0,'A'] = 0
In [110]: df4
Out[110]:
A B C D E
0 0.000000 2.020022 1.638648 0.018943 1
1 -0.858475 0.464734 -0.479293 0.146777 2
2 -1.841956 1.089131 1.113484 -1.513879 3
3 0.855033 -0.573659 1.262123 0.480048 4
4 -0.714338 0.057483 -0.048462 -1.075702 5
5 -1.061789 -0.530627 -2.421183 -1.098107 6
通过位置赋值
In [121]: df4.iat[0,1] = 0
In [122]: df4
Out[122]:
A B C D E
0 0.000000 0.000000 1.638648 0.018943 1
1 -0.858475 0.464734 -0.479293 0.146777 2
2 -1.841956 1.089131 1.113484 -1.513879 3
3 0.855033 -0.573659 1.262123 0.480048 4
4 -0.714338 0.057483 -0.048462 -1.075702 5
5 -1.061789 -0.530627 -2.421183 -1.098107 6
利用numpy array赋值(len函数?)
In [128]: df4.loc[:,'F'] = np.array([6] * len(df4))
In [129]: df4
Out[129]:
A B C D E F
0 0.000000 0.000000 1.638648 0.018943 1 6
1 -0.858475 0.464734 -0.479293 0.146777 2 6
2 -1.841956 1.089131 1.113484 -1.513879 3 6
3 0.855033 -0.573659 1.262123 0.480048 4 6
4 -0.714338 0.057483 -0.048462 -1.075702 5 6
5 -1.061789 -0.530627 -2.421183 -1.098107 6 6
给定位置赋值
In [129]: df4
Out[129]:
A B C D E F
0 0.000000 0.000000 1.638648 0.018943 1 6
1 -0.858475 0.464734 -0.479293 0.146777 2 6
2 -1.841956 1.089131 1.113484 -1.513879 3 6
3 0.855033 -0.573659 1.262123 0.480048 4 6
4 -0.714338 0.057483 -0.048462 -1.075702 5 6
5 -1.061789 -0.530627 -2.421183 -1.098107 6 6
In [130]: df4[df4>0]
Out[130]:
A B C D E F
0 NaN NaN 1.638648 0.018943 1 6
1 NaN 0.464734 NaN 0.146777 2 6
2 NaN 1.089131 1.113484 NaN 3 6
3 0.855033 NaN 1.262123 0.480048 4 6
4 NaN 0.057483 NaN NaN 5 6
5 NaN NaN NaN NaN 6 6
In [131]: df4[df4>0] = -df4
In [132]: df4
Out[132]:
A B C D E F
0 0.000000 0.000000 -1.638648 -0.018943 -1 -6
1 -0.858475 -0.464734 -0.479293 -0.146777 -2 -6
2 -1.841956 -1.089131 -1.113484 -1.513879 -3 -6
3 -0.855033 -0.573659 -1.262123 -0.480048 -4 -6
4 -0.714338 -0.057483 -0.048462 -1.075702 -5 -6
5 -1.061789 -0.530627 -2.421183 -1.098107 -6 -6
参考文献:
[1]Package overview — pandas 0.23.4 documentation http://pandas.pydata.org/pandas-docs/stable/overview.html
贴上一个比较好的网站:
Python数据分析之pandas学习 - Little_Rookie - 博客园
http://www.cnblogs.com/nxld/p/6058591.html