pandas学习笔记1

pandas使用:

1、数据结构

2、创建对象

3、查看数据

4、选择数据

5、赋值数据

一、数据结构:

pandas主要的两种数据结构:

DimensionsnameDescriptions
1Series1维标记同类型数组
2DataFrame通常是2维,大小可变的表格结构

pandas数据结构是低维数据的灵活组合。DaTaFrame是Series的组合,Series是单个个体数据的组合。存储在pandas对象的主要类型包括: floatintbooldatetime64[ns] and datetime64[ns, tz]timedelta[ns]category 和 object。

二、创建对象:

1、creating a Series:构建一个默认整数索引

In [26]: s = pd.Series([1,3,5,np.nan,6,8])

In [27]: s
Out[27]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

 2、creating a DataFrame:构建一个六行四列的日期索引和标记列

In [29]: dates = pd.date_range('20130101', periods=6)

In [30]: dates
Out[30]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [31]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

In [32]: df
Out[32]:
                   A         B         C         D
2013-01-01 -0.927711  1.086539  1.480014 -0.930739
2013-01-02 -0.829482  2.504326  1.782794  1.030505
2013-01-03 -1.766893  0.932729  0.553364 -0.173031
2013-01-04 -0.174698 -0.166893 -0.763745  0.353329
2013-01-05  0.586179  0.391955 -1.113228  1.034821
2013-01-06  1.124303  2.044700 -0.910196  0.254656

3、将series-like转换为DataFrame:

In [36]:  df2 = pd.DataFrame({ 'A' : 1.,
    ...:    ....:                      'B' : pd.Timestamp('20130102'),
    ...:    ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
    ...:    ....:                      'D' : np.array([3] * 4,dtype='int32'),
    ...:    ....:                      'E' : pd.Categorical(["test","train","test","train"]),
    ...:    ....:                      'F' : 'foo' })

In [37]: df2
Out[37]:
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

4、查看数据结构的数据类型:

利用dtypes可以查看Dataframe的数据类型,利用dtype查看Series的数据类型

In [38]: df2.dtypes
Out[38]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
In [39]: s.dtype
Out[39]: dtype('float64')

如果一个列中包含多种type,最终类型会总结为object。

In [40]: pd.Series([1, 2, 3, 6., 'foo'])
Out[40]:
0      1
1      2
2      3
3      6
4    foo
dtype: object

查看DataFrame每个类型的列数,利用get_dtype_counts()

In [41]: a = [['a', 1, 1.0], ['b', 2, 2.0], ['c', 3, 3.0]]

In [42]: df3 = pd.DataFrame(a, columns=['str', 'int', 'float'])

In [43]: df3
Out[43]:
  str  int  float
0   a    1    1.0
1   b    2    2.0
2   c    3    3.0

In [44]: df3.get_dtype_counts()
Out[44]:
float64    1
int64      1
object     1
dtype: int64

 三、查看数据

利用head()查看前几行数据,利用tail()查看后几行数据。

In [53]: df4 = pd.DataFrame(np.random.randn(6,4),columns=list('ABCD'))

In [54]: df4
Out[54]:
          A         B         C         D
0 -1.075377  2.554021 -0.374659 -0.192440
1  0.179418  0.628350 -0.700365  0.064725
2  0.955096  0.329108  0.509071  0.024015
3 -0.023552  0.700166 -0.030874  1.588215
4 -1.082844  0.477065  1.357116 -1.083416
5 -1.619765 -0.510836 -0.904413 -0.386017

In [55]: df4.head(1)
Out[55]:
          A         B         C        D
0 -1.075377  2.554021 -0.374659 -0.19244

In [56]: df4.tail(1)
Out[56]:
          A         B         C         D
5 -1.619765 -0.510836 -0.904413 -0.386017

利用index查看索引,利用columns查看列名称,利用values查看数值

In [57]: df4.index
Out[57]: RangeIndex(start=0, stop=6, step=1)

In [58]: df4.columns
Out[58]: Index(['A', 'B', 'C', 'D'], dtype='object')

In [60]: df4.values
Out[60]:
array([[-1.07537664,  2.55402142, -0.37465895, -0.19243966],
       [ 0.17941807,  0.62835002, -0.70036508,  0.06472549],
       [ 0.95509596,  0.32910841,  0.50907057,  0.02401518],
       [-0.02355206,  0.70016557, -0.03087417,  1.58821536],
       [-1.08284367,  0.47706484,  1.3571161 , -1.08341606],
       [-1.61976495, -0.51083645, -0.90441259, -0.38601714]])

利用describe()可以查看数据的静态信息

In [61]: df4.describe()
Out[61]:
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.444504  0.696312 -0.024021  0.002514
std    0.970781  1.009538  0.842289  0.881702
min   -1.619765 -0.510836 -0.904413 -1.083416
25%   -1.080977  0.366098 -0.618939 -0.337623
50%   -0.549464  0.552707 -0.202767 -0.084212
75%    0.128676  0.682212  0.374084  0.054548
max    0.955096  2.554021  1.357116  1.588215

利用T进行矩阵的转置

In [62]: df4.T
Out[62]:
          0         1         2         3         4         5
A -1.075377  0.179418  0.955096 -0.023552 -1.082844 -1.619765
B  2.554021  0.628350  0.329108  0.700166  0.477065 -0.510836
C -0.374659 -0.700365  0.509071 -0.030874  1.357116 -0.904413
D -0.192440  0.064725  0.024015  1.588215 -1.083416 -0.386017

列反转

In [63]: df4.sort_index(axis=1, ascending=False)
Out[63]:
          D         C         B         A
0 -0.192440 -0.374659  2.554021 -1.075377
1  0.064725 -0.700365  0.628350  0.179418
2  0.024015  0.509071  0.329108  0.955096
3  1.588215 -0.030874  0.700166 -0.023552
4 -1.083416  1.357116  0.477065 -1.082844
5 -0.386017 -0.904413 -0.510836 -1.619765

以下案例为按照B列进行排序。排序可以按照index排列,或者按照value排列,也可以同时按index和value进行排列。

按index进行排列:

df.sort_index()按索引顺序排列
df.sort_index(ascending=False)按索引反向排列
df.sort_index(axis=1)这个还没有搞懂。。

 

In [64]: df4.sort_values(by='B')
Out[64]:
          A         B         C         D
5 -1.619765 -0.510836 -0.904413 -0.386017
2  0.955096  0.329108  0.509071  0.024015
4 -1.082844  0.477065  1.357116 -1.083416
1  0.179418  0.628350 -0.700365  0.064725
3 -0.023552  0.700166 -0.030874  1.588215
0 -1.075377  2.554021 -0.374659 -0.192440

四、选择数据

列数据选择

In [66]: df4['A']
Out[66]:
0   -1.075377
1    0.179418
2    0.955096
3   -0.023552
4   -1.082844
5   -1.619765
Name: A, dtype: float64

行数据选择:通过行数选择

In [67]: df4[2:5]
Out[67]:
          A         B         C         D
2  0.955096  0.329108  0.509071  0.024015
3 -0.023552  0.700166 -0.030874  1.588215
4 -1.082844  0.477065  1.357116 -1.083416

利用label进行选择,可以行列同时选择

In [71]: df4.loc['2':'4','A':'B']
Out[71]:
          A         B
2  0.955096  0.329108
3 -0.023552  0.700166
4 -1.082844  0.477065

利用位置进行选择

In [73]: df4.iloc[2:5,0:2]
Out[73]:
          A         B
2  0.955096  0.329108
3 -0.023552  0.700166
4 -1.082844  0.477065
In [74]: df4.iloc[[2,3,4],[0,1]]
Out[74]:
          A         B
2  0.955096  0.329108
3 -0.023552  0.700166
4 -1.082844  0.477065

布尔型索引:

显示某一列中大于0的行

In [77]: df4[df4.A>0]
Out[77]:
          A         B         C         D
1  0.179418  0.628350 -0.700365  0.064725
2  0.955096  0.329108  0.509071  0.024015

显示数据结构中大于0的数

In [78]: df4[df4>0]
Out[78]:
          A         B         C         D
0       NaN  2.554021       NaN       NaN
1  0.179418  0.628350       NaN  0.064725
2  0.955096  0.329108  0.509071  0.024015
3       NaN  0.700166       NaN  1.588215
4       NaN  0.477065  1.357116       NaN
5       NaN       NaN       NaN       NaN

利用isin显示

In [82]: df5[df5['E'].isin(['one'])]
Out[82]:
          A         B         C         D    E
0 -1.075377  2.554021 -0.374659 -0.192440  one
1  0.179418  0.628350 -0.700365  0.064725  one

五、赋值数据

赋值一个新列

In [88]:  s1 = pd.Series([1,2,3,4,5,6], index=np.arange(6))

In [89]: s1
Out[89]:
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

In [90]: df4
Out[90]:
          A         B         C         D
0 -1.075377  2.554021 -0.374659 -0.192440
1  0.179418  0.628350 -0.700365  0.064725
2  0.955096  0.329108  0.509071  0.024015
3 -0.023552  0.700166 -0.030874  1.588215
4 -1.082844  0.477065  1.357116 -1.083416
5 -1.619765 -0.510836 -0.904413 -0.386017

In [91]: df4['E'] = s1

In [92]: df4
Out[92]:
          A         B         C         D  E
0 -1.075377  2.554021 -0.374659 -0.192440  1
1  0.179418  0.628350 -0.700365  0.064725  2
2  0.955096  0.329108  0.509071  0.024015  3
3 -0.023552  0.700166 -0.030874  1.588215  4
4 -1.082844  0.477065  1.357116 -1.083416  5
5 -1.619765 -0.510836 -0.904413 -0.386017  6

通过label设置数据

In [108]: df4
Out[108]:
          A         B         C         D  E
0  1.942578  2.020022  1.638648  0.018943  1
1 -0.858475  0.464734 -0.479293  0.146777  2
2 -1.841956  1.089131  1.113484 -1.513879  3
3  0.855033 -0.573659  1.262123  0.480048  4
4 -0.714338  0.057483 -0.048462 -1.075702  5
5 -1.061789 -0.530627 -2.421183 -1.098107  6

In [109]: df4.at[0,'A'] = 0

In [110]: df4
Out[110]:
          A         B         C         D  E
0  0.000000  2.020022  1.638648  0.018943  1
1 -0.858475  0.464734 -0.479293  0.146777  2
2 -1.841956  1.089131  1.113484 -1.513879  3
3  0.855033 -0.573659  1.262123  0.480048  4
4 -0.714338  0.057483 -0.048462 -1.075702  5
5 -1.061789 -0.530627 -2.421183 -1.098107  6

通过位置赋值

In [121]: df4.iat[0,1] = 0

In [122]: df4
Out[122]:
          A         B         C         D  E
0  0.000000  0.000000  1.638648  0.018943  1
1 -0.858475  0.464734 -0.479293  0.146777  2
2 -1.841956  1.089131  1.113484 -1.513879  3
3  0.855033 -0.573659  1.262123  0.480048  4
4 -0.714338  0.057483 -0.048462 -1.075702  5
5 -1.061789 -0.530627 -2.421183 -1.098107  6

利用numpy array赋值(len函数?)

In [128]: df4.loc[:,'F'] = np.array([6] * len(df4))

In [129]: df4
Out[129]:
          A         B         C         D  E  F
0  0.000000  0.000000  1.638648  0.018943  1  6
1 -0.858475  0.464734 -0.479293  0.146777  2  6
2 -1.841956  1.089131  1.113484 -1.513879  3  6
3  0.855033 -0.573659  1.262123  0.480048  4  6
4 -0.714338  0.057483 -0.048462 -1.075702  5  6
5 -1.061789 -0.530627 -2.421183 -1.098107  6  6

 给定位置赋值

In [129]: df4
Out[129]:
          A         B         C         D  E  F
0  0.000000  0.000000  1.638648  0.018943  1  6
1 -0.858475  0.464734 -0.479293  0.146777  2  6
2 -1.841956  1.089131  1.113484 -1.513879  3  6
3  0.855033 -0.573659  1.262123  0.480048  4  6
4 -0.714338  0.057483 -0.048462 -1.075702  5  6
5 -1.061789 -0.530627 -2.421183 -1.098107  6  6

In [130]: df4[df4>0]
Out[130]:
          A         B         C         D  E  F
0       NaN       NaN  1.638648  0.018943  1  6
1       NaN  0.464734       NaN  0.146777  2  6
2       NaN  1.089131  1.113484       NaN  3  6
3  0.855033       NaN  1.262123  0.480048  4  6
4       NaN  0.057483       NaN       NaN  5  6
5       NaN       NaN       NaN       NaN  6  6

In [131]: df4[df4>0] = -df4

In [132]: df4
Out[132]:
          A         B         C         D  E  F
0  0.000000  0.000000 -1.638648 -0.018943 -1 -6
1 -0.858475 -0.464734 -0.479293 -0.146777 -2 -6
2 -1.841956 -1.089131 -1.113484 -1.513879 -3 -6
3 -0.855033 -0.573659 -1.262123 -0.480048 -4 -6
4 -0.714338 -0.057483 -0.048462 -1.075702 -5 -6
5 -1.061789 -0.530627 -2.421183 -1.098107 -6 -6

参考文献:

[1]Package overview — pandas 0.23.4 documentation  http://pandas.pydata.org/pandas-docs/stable/overview.html

贴上一个比较好的网站:

Python数据分析之pandas学习 - Little_Rookie - 博客园
http://www.cnblogs.com/nxld/p/6058591.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值