Pandas.DataFrame
DataFrame是Pandas的主要数据结构,它是带有行索引和列标签的二维表格数据结构。DataFrame的每一列都是一个类似字典的Series对象,列标签对应键,列中的数据对应值。所有需要用Pandas处理的数据都要先转化为DataFrame类型的数据,然后才可以进行其它操作。Pandas.DataFrame类的主要参数有:
data:数据。可接受数组、字典和列表等可迭代对象。
index:行索引。
columns:列标签。
dtype:数据类型,默认为None。
当数据类型为数组或者列表时,转化为Pandas.DataFrame可表示为:
d = np.array([[9, 8, 7], [8, 8, 8], [7, 9, 10]])
data = pd.DataFrame(data=d,
index=['Tiger', 'Lion', 'Horse'],
columns=['Strength', 'Speed', 'Endurance'])
# Output:
Strength Speed Endurance
Tiger 9 8 7
Lion 8 8 8
Horse 7 9 10
d = [[9, 8, 7], [8, 8, 8], [7, 9, 10]]
data = pd.DataFrame(data=d,
index=['Tiger', 'Lion', 'Horse'],
columns=['Strength', 'Speed', 'Endurance'])
# Output:
Strength Speed Endurance
Tiger 9 8 7
Lion 8 8 8
Horse 7 9 10
当数据类型为字典时,转化为Pandas.DataFrame可表示为:
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
# Output:
Strength Speed Endurance
Tiger 9 8 7
Lion 8 8 8
Horse 7 9 10
Pandas.DataFrame特性
- T:转置
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
print(data.T)
# Output:
Tiger Lion Horse
Strength 9 8 7
Speed 8 8 9
Endurance 7 8 10
- at:读取某行某列的值,必须要行索引在前,列标签在后
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
print(data.at['Horse', 'Endurance'])
data.at['Tiger', 'Speed'] = 10
print(data.at['Tiger', 'Speed'])
# Output:
10
10
- loc:通过标签和boolean数组读取一组行和列的值
loc接受的输入包括:
- 单个标签,例如:5或者’a’(这里的5表示索引为5的行或者列,而不是具体的第5行或者第5列)
- 标签的列表或者数组,例如:[‘a’, ‘b’, ‘c’]
- 标签的切片对象,例如:‘a’: ‘f’
- boolean数组,例如:[True, False, True]
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
# Single label. Note this returns the row as a Series.
print(data.loc['Tiger'])
# Output:
Strength 9
Speed 8
Endurance 7
Name: Tiger, dtype: int64
# List of labels. Note using [[]] returns a DataFrame.
print(data.loc[['Tiger', 'Horse']])
# Output:
Strength Speed Endurance
Tiger 9 8 7
Horse 7 9 10
# Single label for row and column
print(data.loc['Tiger', 'Endurance'])
# Output:
7
# Slice with labels for row and single label for column.
# both the start and stop of the slice are included.
print(data.loc['Tiger': 'Horse', 'Speed'])
# Output:
Tiger 8
Lion 8
Horse 9
Name: Speed, dtype: int64
# Boolean list with the same length as the row axis
print(data.loc[[False, False, True]])
# Output:
Strength Speed Endurance
Horse 7 9 10
# Conditional that returns a boolean Series
print(data.loc[data['Speed'] > 8])
# Output:
Strength Speed Endurance
Horse 7 9 10
# Conditional that returns a boolean Series with column labels specified
print(data.loc[data['Speed'] > 8, ['Strength']])
# Output:
Strength
Horse 7
# Callable that returns a boolean Series
print(data.loc[lambda df: df['Endurance'] == 8])
# Output:
Strength Speed Endurance
Lion 8 8 8
- iloc:与loc的用法相似,区别在于iloc只接受整数、整数的列表或数组、整数的切片对象和boolean数组作为输入。
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
print(data.iloc[[0, 1]])
# Output:
Strength Speed Endurance
Tiger 9 8 7
Lion 8 8 8
print(data.iloc[[True, False, False]])
# Output:
Strength Speed Endurance
Tiger 9 8 7
- shape:返回一个表示DataFrame维度的元组
- values:返回DataFrame的numpy数组表示
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
print(data.values)
# Output:
[[ 9 8 7]
[ 8 8 8]
[ 7 9 10]]
更推荐使用DataFrame.to_numpy()来替代。
Pandas.DataFrame方法
- 算数运算:+(DataFrame.add)、-(DataFrame.sub)、*(DataFrame.mul)、/(DataFrame.div)、%(DataFrame.mod)、 **(DataFrame.pow)
d = [[-10, 10, 5], [-3, 0, 6], [-6, 2, 7]]
data = pd.DataFrame(data=d,
index=['row1', 'row2', 'row3'],
columns=['col1', 'col2', 'col3'])
print(data)
print(data.add(1))
print(data.sub(1))
print(data.pow(2))
# Output:
col1 col2 col3
row1 -10 10 5
row2 -3 0 6
row3 -6 2 7
col1 col2 col3
row1 -9 11 6
row2 -2 1 7
row3 -5 3 8
col1 col2 col3
row1 -11 9 4
row2 -4 -1 5
row3 -7 1 6
col1 col2 col3
row1 100 100 25
row2 9 0 36
row3 36 4 49
- pandas.DataFrame.drop:删除指定行或者列
Parameters:
- labels: single label or list-like
- axis:{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
- index:Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
- columns:Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
d = [[-10, 10, 5], [-3, 0, 6], [-6, 2, 7]]
data = pd.DataFrame(data=d,
index=['row1', 'row2', 'row3'],
columns=['col1', 'col2', 'col3'])
print(data.drop(labels=['col1', 'col2'], axis=1))
# Equivalent to: data.drop(columns=['col1', 'col2'])
# Output:
col3
row1 5
row2 6
row3 7
print(data.drop(labels=['row1', 'row2'], axis=0))
# Equivalent to: data.drop(index=['row1', 'row2'])
# Output:
col1 col2 col3
row3 -6 2 7
- pandas.DataFrame.dropna:删除缺失值
Parameters:
- axis: {0 or ‘index’, 1 or ‘columns’}, default 0. Determine if rows or columns which contain missing values are removed. 0, or ‘index’ : Drop rows which contain missing values. 1, or ‘columns’ : Drop columns which contain missing value.(Deprecated since version 0.23.0)
- how:{‘any’, ‘all’}, default ‘any’. Determine if row or column is removed from DataFrame, when we have at least one NA or all NA. ‘any’ : If any NA values are present, drop that row or column. ‘all’ : If all values are NA, drop that row or column.
df = pd.DataFrame({'name': ['Alfred', 'Batman', 'Catwoman'],
'toy': [np.nan, 'Batmobile', 'Bullwhip'],
'born': [pd.NaT, pd.Timestamp("1940-04-25"),
pd.NaT]})
print(df)
# Output:
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
print(df.dropna())
# Output:
name toy born
1 Batman Batmobile 1940-04-25
print(df.dropna(axis='columns'))
# Output:
name
0 Alfred
1 Batman
2 Catwoman
print(df.dropna(how='all'))
# Output:
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
print(df.dropna(how='any'))
# Output:
name toy born
1 Batman Batmobile 1940-04-25
- pandas.DataFrame.fillna:填充NA/NaN值
Parameters:
- value:scalar, dict, Series, or DataFrame
- method:{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
- axis:{0 or ‘index’, 1 or ‘columns’}. Axis along which to fill missing values.
df = pd.DataFrame({'name': ['Alfred', 'Batman', 'Catwoman'],
'toy': [np.nan, 'Batmobile', 'Bullwhip'],
'born': [np.nan, pd.Timestamp("1940-04-25"),
np.nan]})
print(df)
print(df.fillna(0))
# Output:
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
name toy born
0 Alfred 0 0
1 Batman Batmobile 1940-04-25 00:00:00
2 Catwoman Bullwhip 0
- pandas.DataFrame.grouyby:将DataFrame或Series分组
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
print(df)
print(df.groupby(['Animal']).mean())
# Output:
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
Max Speed
Animal
Falcon 375.0
Parrot 25.0