Python | 熊猫数据框 (Python | Pandas DataFrame)
A DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects.
DataFrame是带有标签轴(行和列)的二维大小可变的,可能是异构的表格数据结构。 算术运算在行和列标签上对齐。 可以将其视为Series对象的类似dict的容器 。
Syntax:
句法:
class pandas.DataFrame(
data=None,
index=None,
columns=None,
dtype=None,
copy=False
)
创建DataFrame的示例 (Example creation of DataFrame)
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4), ['A','B','C','D','E'],['W','X','Y','Z'])
print(df)
Output
输出量
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
In the above example, each of the columns is a series and the respective rows are the common index-labels.
在上面的示例中,每个列都是一个系列 ,相应的行是公共索引标签。
In order to do indexing and selection, the approach followed is,
为了进行索引和选择,遵循的方法是
print(df['W'])
'''
Output:
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
'''
print(type(df['W']))
'''
Output:
<class 'pandas.core.series.Series'>
'''
The above explains that dataframe is a bunch of series with common index-labels. Another approach to retrieve the series from the dataframe is following the SQL way (less preferred way),
上面说明了数据帧是一堆带有常见索引标签的序列。 从数据框中检索序列的另一种方法是遵循SQL方法(不太受欢迎的方法),
print(df.W)
'''
Output:
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
'''
To get multiple columns from the dataframes
从数据框中获取多个列
print(df[['W','X']])
'''
Output:
W X
A 2.706850 0.628133
B 0.651118 -0.319318
C -2.018168 0.740122
D 0.188695 -0.758872
E 0.190794 1.978757
'''
print(df[list('W''X')])
'''
Output:
W X
A 2.706850 0.628133
B 0.651118 -0.319318
C -2.018168 0.740122
D 0.188695 -0.758872
E 0.190794 1.978757
'''
To create a new column in a dataframe
在数据框中创建新列
df['new'] = df['X']+df['Y']
print(df)
'''
Output:
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 1.536102
B 0.651118 -0.319318 -0.848077 0.605965 -1.167395
C -2.018168 0.740122 0.528813 -0.589001 1.268936
D 0.188695 -0.758872 -0.933237 0.955057 -1.692109
E 0.190794 1.978757 2.605967 0.683509 4.584725
'''
To remove the column in a dataframe
删除数据框中的列
# doesn't remove from the object df
df.drop('W', axis=1)
print(df)
'''
Output:
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 1.536102
B 0.651118 -0.319318 -0.848077 0.605965 -1.167395
C -2.018168 0.740122 0.528813 -0.589001 1.268936
D 0.188695 -0.758872 -0.933237 0.955057 -1.692109
E 0.190794 1.978757 2.605967 0.683509 4.584725
'''
df = df.drop('W', axis=1)
print(df)
'''
Output:
X Y Z new
A 0.628133 0.907969 0.503826 1.536102
B -0.319318 -0.848077 0.605965 -1.167395
C 0.740122 0.528813 -0.589001 1.268936
D -0.758872 -0.933237 0.955057 -1.692109
E 1.978757 2.605967 0.683509 4.584725
'''
# use inplace = True to retain the changes
df.drop('X', axis=1, inplace = True)
print(df)
'''
Output:
Y Z new
A 0.907969 0.503826 1.536102
B -0.848077 0.605965 -1.167395
C 0.528813 -0.589001 1.268936
D -0.933237 0.955057 -1.692109
E 2.605967 0.683509 4.584725
'''
To remove a row from the dataframe
从数据框中删除一行
df.drop('E', axis=0, inplace = True)
print(df)
'''
Output:
Y Z new
A 0.907969 0.503826 1.536102
B -0.848077 0.605965 -1.167395
C 0.528813 -0.589001 1.268936
D -0.933237 0.955057 -1.692109
'''
To order to explain the reasoning behind the value 0 and 1 to axis, we have to know the shape of the dataframe
为了解释轴值0和1背后的原因,我们必须知道数据框的形状
print(df)
'''
Output:
Y Z new
A 0.907969 0.503826 1.536102
B -0.848077 0.605965 -1.167395
C 0.528813 -0.589001 1.268936
D -0.933237 0.955057 -1.692109
'''
print(df.shape)
'''
Output:
(4, 3)
'''
The return type of shape is a tuple, and in above example the 0th index of tuple (4) refers to number of rows and 1st index of tuple (3) refers to the number of columns and hence the value given to axis as 0 or 1 while deleting the row/column.
形状的返回类型为元组,在上面的示例中,元组的第 0 个索引(4)表示行数,元组的第 1个索引(3)表示列数,因此,将给定的axis值指定为删除行/列时为0或1。
Selecting rows in a dataFrame
在dataFrame中选择行
print(df)
'''
Output:
Y Z new
A 0.907969 0.503826 1.536102
B -0.848077 0.605965 -1.167395
C 0.528813 -0.589001 1.268936
D -0.933237 0.955057 -1.692109
'''
# here the argument is the location based index
print(df.loc['B'])
'''
Output:
Y -0.848077
Z 0.605965
new -1.167395
Name: B, dtype: float64
'''
# here the argument is the numerical based index of the row
print(df.iloc[1] )
'''
Output:
Y -0.848077
Z 0.605965
new -1.167395
Name: B, dtype: float64
'''
Selecting subsets of rows and columns
选择行和列的子集
print(df)
'''
Output:
Y Z new
A 0.907969 0.503826 1.536102
B -0.848077 0.605965 -1.167395
C 0.528813 -0.589001 1.268936
D -0.933237 0.955057 -1.692109
'''
# row, column
print(df.loc['C','Y'])
'''
Output: 0.5288134940893595
'''
# pass the list of rows and columns to get the subsets
print(df.loc[['B','C'],['Y','Z']])
'''
Output:
Y Z
B -0.848077 0.605965
C 0.528813 -0.589001
'''
翻译自: https://www.includehelp.com/python/pandas-dataframe-in-python.aspx