pandas.DataFrame

Pandas.DataFrame

DataFrame是Pandas的主要数据结构,它是带有行索引和列标签的二维表格数据结构。DataFrame的每一列都是一个类似字典的Series对象,列标签对应键,列中的数据对应值。所有需要用Pandas处理的数据都要先转化为DataFrame类型的数据,然后才可以进行其它操作。Pandas.DataFrame类的主要参数有:

data:数据。可接受数组、字典和列表等可迭代对象。
index:行索引。
columns:列标签。
dtype:数据类型,默认为None。

当数据类型为数组或者列表时,转化为Pandas.DataFrame可表示为:

d = np.array([[9, 8, 7], [8, 8, 8], [7, 9, 10]])
data = pd.DataFrame(data=d,
                    index=['Tiger', 'Lion', 'Horse'],
                    columns=['Strength', 'Speed', 'Endurance'])

# Output:
      Strength  Speed  Endurance
Tiger         9      8          7
Lion          8      8          8
Horse         7      9         10
d = [[9, 8, 7], [8, 8, 8], [7, 9, 10]]
data = pd.DataFrame(data=d,
                    index=['Tiger', 'Lion', 'Horse'],
                    columns=['Strength', 'Speed', 'Endurance'])

# Output:
      Strength  Speed  Endurance
Tiger         9      8          7
Lion          8      8          8
Horse         7      9         10

当数据类型为字典时,转化为Pandas.DataFrame可表示为:

d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])

# Output:
      Strength  Speed  Endurance
Tiger         9      8          7
Lion          8      8          8
Horse         7      9         10

Pandas.DataFrame特性

  • T:转置
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
print(data.T)

# Output:
           Tiger  Lion  Horse
Strength       9     8      7
Speed          8     8      9
Endurance      7     8     10
  • at:读取某行某列的值,必须要行索引在前,列标签在后
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
print(data.at['Horse', 'Endurance'])
data.at['Tiger', 'Speed'] = 10
print(data.at['Tiger', 'Speed'])

# Output:
10
10
  • loc:通过标签和boolean数组读取一组行和列的值

loc接受的输入包括:

  • 单个标签,例如:5或者’a’(这里的5表示索引为5的行或者列,而不是具体的第5行或者第5列)
  • 标签的列表或者数组,例如:[‘a’, ‘b’, ‘c’]
  • 标签的切片对象,例如:‘a’: ‘f’
  • boolean数组,例如:[True, False, True]
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
# Single label. Note this returns the row as a Series.
print(data.loc['Tiger'])

# Output:
Strength     9
Speed        8
Endurance    7
Name: Tiger, dtype: int64

# List of labels. Note using [[]] returns a DataFrame.
print(data.loc[['Tiger', 'Horse']])

# Output:
       Strength  Speed  Endurance
Tiger         9      8          7
Horse         7      9         10

# Single label for row and column
print(data.loc['Tiger', 'Endurance'])

# Output:
7

# Slice with labels for row and single label for column.
# both the start and stop of the slice are included.
print(data.loc['Tiger': 'Horse', 'Speed'])

# Output:
Tiger    8
Lion     8
Horse    9
Name: Speed, dtype: int64

# Boolean list with the same length as the row axis
print(data.loc[[False, False, True]])

# Output:
       Strength  Speed  Endurance
Horse         7      9         10
# Conditional that returns a boolean Series
print(data.loc[data['Speed'] > 8])

# Output:
       Strength  Speed  Endurance
Horse         7      9         10

# Conditional that returns a boolean Series with column labels specified
print(data.loc[data['Speed'] > 8, ['Strength']])

# Output:
       Strength
Horse         7

# Callable that returns a boolean Series
print(data.loc[lambda df: df['Endurance'] == 8])

# Output:
      Strength  Speed  Endurance
Lion         8      8          8

  • iloc:与loc的用法相似,区别在于iloc只接受整数、整数的列表或数组、整数的切片对象和boolean数组作为输入。
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
print(data.iloc[[0, 1]])
# Output:
       Strength  Speed  Endurance
Tiger         9      8          7
Lion          8      8          8

print(data.iloc[[True, False, False]])
# Output:
       Strength  Speed  Endurance
Tiger         9      8          7

  • shape:返回一个表示DataFrame维度的元组
  • values:返回DataFrame的numpy数组表示
d = {'Strength': [9, 8, 7], 'Speed': [8, 8, 9], 'Endurance': [7, 8, 10]}
data = pd.DataFrame(data=d, index=['Tiger', 'Lion', 'Horse'])
print(data.values)

# Output:
[[ 9  8  7]
 [ 8  8  8]
 [ 7  9 10]]

更推荐使用DataFrame.to_numpy()来替代。

Pandas.DataFrame方法

  • 算数运算:+(DataFrame.add)、-(DataFrame.sub)、*(DataFrame.mul)、/(DataFrame.div)、%(DataFrame.mod)、 **(DataFrame.pow)
d = [[-10, 10, 5], [-3, 0, 6], [-6, 2, 7]]
data = pd.DataFrame(data=d,
                    index=['row1', 'row2', 'row3'],
                    columns=['col1', 'col2', 'col3'])
print(data)
print(data.add(1))
print(data.sub(1))
print(data.pow(2))

# Output:
      col1  col2  col3
row1   -10    10     5
row2    -3     0     6
row3    -6     2     7

      col1  col2  col3
row1    -9    11     6
row2    -2     1     7
row3    -5     3     8

      col1  col2  col3
row1   -11     9     4
row2    -4    -1     5
row3    -7     1     6

      col1  col2  col3
row1   100   100    25
row2     9     0    36
row3    36     4    49
  • pandas.DataFrame.drop:删除指定行或者列

Parameters:

  • labels: single label or list-like
  • axis:{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
  • index:Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
  • columns:Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
d = [[-10, 10, 5], [-3, 0, 6], [-6, 2, 7]]
data = pd.DataFrame(data=d,
                    index=['row1', 'row2', 'row3'],
                    columns=['col1', 'col2', 'col3'])

print(data.drop(labels=['col1', 'col2'], axis=1))
# Equivalent to: data.drop(columns=['col1', 'col2'])
# Output:
      col3
row1     5
row2     6
row3     7

print(data.drop(labels=['row1', 'row2'], axis=0))
# Equivalent to: data.drop(index=['row1', 'row2'])
# Output:
      col1  col2  col3
row3    -6     2     7
  • pandas.DataFrame.dropna:删除缺失值

Parameters:

  • axis: {0 or ‘index’, 1 or ‘columns’}, default 0. Determine if rows or columns which contain missing values are removed. 0, or ‘index’ : Drop rows which contain missing values. 1, or ‘columns’ : Drop columns which contain missing value.(Deprecated since version 0.23.0)
  • how:{‘any’, ‘all’}, default ‘any’. Determine if row or column is removed from DataFrame, when we have at least one NA or all NA. ‘any’ : If any NA values are present, drop that row or column. ‘all’ : If all values are NA, drop that row or column.
df = pd.DataFrame({'name': ['Alfred', 'Batman', 'Catwoman'],
                   'toy': [np.nan, 'Batmobile', 'Bullwhip'],
                   'born': [pd.NaT, pd.Timestamp("1940-04-25"),
                            pd.NaT]})

print(df)
# Output:
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

print(df.dropna())
# Output:
    name        toy       born
1  Batman  Batmobile 1940-04-25

print(df.dropna(axis='columns'))
# Output:
       name
0    Alfred
1    Batman
2  Catwoman

print(df.dropna(how='all'))
# Output:
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

print(df.dropna(how='any'))
# Output:
     name        toy       born
1  Batman  Batmobile 1940-04-25
  • pandas.DataFrame.fillna:填充NA/NaN值

Parameters:

  • value:scalar, dict, Series, or DataFrame
  • method:{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
  • axis:{0 or ‘index’, 1 or ‘columns’}. Axis along which to fill missing values.
df = pd.DataFrame({'name': ['Alfred', 'Batman', 'Catwoman'],
                   'toy': [np.nan, 'Batmobile', 'Bullwhip'],
                   'born': [np.nan, pd.Timestamp("1940-04-25"),
                            np.nan]})
print(df)
print(df.fillna(0))

# Output:
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

       name        toy                 born
0    Alfred          0                    0
1    Batman  Batmobile  1940-04-25 00:00:00
2  Catwoman   Bullwhip                    0
  • pandas.DataFrame.grouyby:将DataFrame或Series分组
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})

print(df)
print(df.groupby(['Animal']).mean())
# Output:
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0

        Max Speed
Animal
Falcon      375.0
Parrot       25.0
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值