Pandas support very useful operations which are illustrated below,
熊猫支持非常有用的操作,如下所示,
Consider the below dataFrame,
考虑下面的dataFrame,
import numpy as np
import pandas as pd
df = pd.DataFrame({
'col1': [1, 2, 3, 4],
'col2': [444, 555, 666, 444],
'col3': ['abc', 'def', 'ghi', 'xyz']
})
print(df.head())
'''
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
'''
在数据框中查找唯一值 (Finding unique values in a data frame)
In order to find unique values from columns,
为了从列中找到唯一值,
# returns numpy array of all unique values
print(df['col2'].unique() )
# Output: array([444, 555, 666])
# returns length / number of unique values
# in a numpy array
print(df['col2'].nunique())
# Output: 3
# if we want the table of the unique values
# and how many times they show up
print(df['col2'].value_counts() )
'''
Output:
444 2
555 1
666 1
Name: col2, dtype: int64
'''
从数据框中选择数据 (Selecting data from a data frame)
Consider the dataFrame,
考虑一下dataFrame,
Using the conditional selection, we could select data as follows,
使用条件选择,我们可以选择以下数据,
print(df['col1']>2)
'''
Output:
0 False
1 False
2 True
3 True
Name: col1, dtype: bool
'''
print(df[(df['col1']>2)])
'''
Output:
col1 col2 col3
2 3 666 ghi
3 4 444 xyz
'''
print(df[df['col1']>2 & (df['col2']==44)])
'''
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
'''
应用方法 (Applied Methods)
Consider a simple method,
考虑一个简单的方法,
def times2(x):
return x*2
We already are aware that we can grab a column and call a built-in function off of it. Such as below,
我们已经知道我们可以抓住一列并从中调用一个内置函数。 如下
print(df['col1'].sum())
# Output: 10
Now, in order to apply the custom function, such as one defined above (times2), pandas provide an option to do that as well, as explained below,
现在,为了应用自定义功能(例如上面定义的时间(times2)),熊猫也提供了执行此功能的选项,如下所述,
print(df['col2'].apply(times2))
'''
Output:
0 888
1 1110
2 1332
3 888
Name: col2, dtype: int64
'''
Apply built-in functions,
应用内置功能,
print(df['col3'].apply(len))
'''
Output:
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
'''
Apply method will be more powerful, when combined with lambda expressions. For instance,
与lambda表达式结合使用时,apply方法将更强大。 例如,
print(df['col2'].apply(lambda x: x*2))
'''
Output:
0 888
1 1110
2 1332
3 888
Name: col2, dtype: int64
'''
更多操作 (Some more operations)
# returns the columns names
print(df.columns)
# Output: Index(['col1', 'col2', 'col3'], dtype='object')
#since this is a rangeindex, it actually reports
# start, stop and step values too
print(df.index)
# Output: RangeIndex(start=0, stop=4, step=1)
# sort by column
print(df.sort_values('col2'))
'''
Output:
col1 col2 col3
0 1 444 abc
3 4 444 xyz
1 2 555 def
2 3 666 ghi
'''
In the above result, note that the index values doesn't change, this is to ensure that the values is retained.
在上面的结果中,请注意索引值不会更改,这是为了确保保留这些值。
isnull
一片空白
# isnull
print(df.isnull())
'''
Output
col1 col2 col3
0 False False False
1 False False False
2 False False False
3 False False False
'''
The isnull() will return a dataframe of booleans indicating whether or not the value was null or not. In the above, we get a boolean of all false because we have nulls in our dataframe.
notull()将返回一个布尔值数据框,指示该值是否为null。 在上面的代码中,由于我们的数据帧中包含null,因此我们得到的布尔值均为false。
Drop NAN values
降低NAN值
print(df.dropna())
'''
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
'''
Fill NAN values with custom values
用自定义值填充NAN值
df = pd.DataFrame({
'col1': [1, 2, 3, np.nan],
'col2': [np.nan, 555, 666, 444],
'col3': ['abc', 'def', 'ghi', 'xyz']
})
print(df)
'''
Output:
col1 col2 col3
0 1.0 NaN abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 NaN 444.0 xyz
'''
print(df.fillna('FILL'))
'''
Output:
col1 col2 col3
0 1 FILL abc
1 2 555 def
2 3 666 ghi
3 FILL 444 xyz
'''
Usage of pivot table
数据透视表的用法
This methodology will be familiar for the Advanced Excel users. Consider a new dataFrame,
Advanced Excel用户将熟悉这种方法。 考虑一个新的dataFrame,
data = {
'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': ['one', 'one', 'two', 'two', 'one', 'one'],
'C': ['x', 'y', 'x', 'y', 'x', 'y'],
'D': [1, 3, 2, 5, 4, 1]
}
df = pd.DataFrame(data)
print(df)
'''
Output:
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
'''
The pivot table, creates a multi index dataFrame. The pivot table takes three main arguments, the values, the index and the columns.
数据透视表创建一个多索引dataFrame。 数据透视表采用三个主要参数,即值,索引和列。
print(df.pivot_table(values='D',index=['A', 'B'],columns=['C']))
'''
Output:
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN
'''
翻译自: https://www.includehelp.com/python/python-pandas-operations.aspx