Python Pandas –操作

Pandas support very useful operations which are illustrated below,

熊猫支持非常有用的操作,如下所示,

Consider the below dataFrame,

考虑下面的dataFrame,

import numpy as np
import pandas as pd

df = pd.DataFrame({
  'col1': [1, 2, 3, 4],
  'col2': [444, 555, 666, 444],
  'col3': ['abc', 'def', 'ghi', 'xyz']
})

print(df.head())

'''
Output:
   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz
'''

在数据框中查找唯一值 (Finding unique values in a data frame)

In order to find unique values from columns,

为了从列中找到唯一值,

# returns numpy array of all unique  values
print(df['col2'].unique() )
# Output: array([444, 555, 666])

# returns length / number of unique values 
# in a numpy array
print(df['col2'].nunique())
# Output: 3

# if we want the table of the unique values
# and how many times they show up
print(df['col2'].value_counts() )
'''
Output:
444    2
555    1
666    1
Name: col2, dtype: int64
'''

从数据框中选择数据 (Selecting data from a data frame)

Consider the dataFrame,

考虑一下dataFrame,

Using the conditional selection, we could select data as follows,

使用条件选择,我们可以选择以下数据,

print(df['col1']>2)

'''
Output:
0    False
1    False
2     True
3     True
Name: col1, dtype: bool
'''

print(df[(df['col1']>2)])

'''
Output:
   col1  col2 col3
2     3   666  ghi
3     4   444  xyz
'''

print(df[df['col1']>2 & (df['col2']==44)])

'''
Output:
   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz
'''

应用方法 (Applied Methods)

Consider a simple method,

考虑一个简单的方法,

def times2(x):
  return x*2

We already are aware that we can grab a column and call a built-in function off of it. Such as below,

我们已经知道我们可以抓住一列并从中调用一个内置函数。 如下

print(df['col1'].sum())
# Output: 10

Now, in order to apply the custom function, such as one defined above (times2), pandas provide an option to do that as well, as explained below,

现在,为了应用自定义功能(例如上面定义的时间(times2)),熊猫也提供了执行此功能的选项,如下所述,

print(df['col2'].apply(times2))

'''
Output:
0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64
'''

Apply built-in functions,

应用内置功能,

print(df['col3'].apply(len))

'''
Output:
0    3
1    3
2    3
3    3
Name: col3, dtype: int64
'''

Apply method will be more powerful, when combined with lambda expressions. For instance,

与lambda表达式结合使用时,apply方法将更强大。 例如,

print(df['col2'].apply(lambda x: x*2))

'''
Output:
0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64
'''

更多操作 (Some more operations)

# returns the columns names
print(df.columns) 
# Output: Index(['col1', 'col2', 'col3'], dtype='object')

#since this is a rangeindex, it actually reports 
# start, stop and step values too
print(df.index)
# Output: RangeIndex(start=0, stop=4, step=1)

# sort by column
print(df.sort_values('col2'))

'''
Output:
   col1  col2 col3
0     1   444  abc
3     4   444  xyz
1     2   555  def
2     3   666  ghi
'''

In the above result, note that the index values doesn't change, this is to ensure that the values is retained.

在上面的结果中,请注意索引值不会更改,这是为了确保保留这些值。

isnull

一片空白

# isnull
print(df.isnull())

'''
Output
    col1   col2   col3
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False  False
'''

The isnull() will return a dataframe of booleans indicating whether or not the value was null or not. In the above, we get a boolean of all false because we have nulls in our dataframe.

notull()将返回一个布尔值数据框,指示该值是否为null。 在上面的代码中,由于我们的数据帧中包含null,因此我们得到的布尔值均为false。

Drop NAN values

降低NAN值

print(df.dropna())

'''
Output:
   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz
'''

Fill NAN values with custom values

用自定义值填充NAN值

df = pd.DataFrame({
  'col1': [1, 2, 3, np.nan],
  'col2': [np.nan, 555, 666, 444],
  'col3': ['abc', 'def', 'ghi', 'xyz']
})

print(df)

'''
Output:
   col1   col2 col3
0   1.0    NaN  abc
1   2.0  555.0  def
2   3.0  666.0  ghi
3   NaN  444.0  xyz
'''

print(df.fillna('FILL'))

'''
Output:
   col1  col2 col3
0     1  FILL  abc
1     2   555  def
2     3   666  ghi
3  FILL   444  xyz
'''

Usage of pivot table

数据透视表的用法

This methodology will be familiar for the Advanced Excel users. Consider a new dataFrame,

Advanced Excel用户将熟悉这种方法。 考虑一个新的dataFrame,

data = {
  'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
  'B': ['one', 'one', 'two', 'two', 'one', 'one'],
  'C': ['x', 'y', 'x', 'y', 'x', 'y'],
  'D': [1, 3, 2, 5, 4, 1]
}

df = pd.DataFrame(data)

print(df)

'''
Output:
     A    B  C  D
0  foo  one  x  1
1  foo  one  y  3
2  foo  two  x  2
3  bar  two  y  5
4  bar  one  x  4
5  bar  one  y  1
'''

The pivot table, creates a multi index dataFrame. The pivot table takes three main arguments, the values, the index and the columns.

数据透视表创建一个多索引dataFrame。 数据透视表采用三个主要参数,即值,索引和列。

print(df.pivot_table(values='D',index=['A', 'B'],columns=['C']))

'''
Output:
      C        x    y
A    B            
bar one  4.0  1.0
      two  NaN  5.0
foo one  1.0  3.0
      two  2.0  NaN
'''


翻译自: https://www.includehelp.com/python/python-pandas-operations.aspx

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值