Python数据分析工具Pandas——Pandas的数据结构

概述

Pandas的名称来自于面板数据(panel data)和Python数据分析(data analysis)。

Pandas是一个强大的分析结构化数据的工具集,基于NumPy构建,提供了 高级数据结构 和 数据操作工具,它是使Python成为强大而高效的数据分析环境的重要因素之一。

  • 一个强大的分析和操作大型结构化数据集所需的工具集

  • 基础是NumPy,提供了高性能矩阵的运算

  • 提供了大量能够快速便捷地处理数据的函数和方法

  • 应用于数据挖掘,数据分析

  • 提供数据清洗功能


Pandas安装

pip install Pandas

建议使用镜像源安装

pip install Pandas -i http://pypi.douban.com/simple

Numpy导入

import Pandas as pd

Pandas有两个最主要也是最重要的数据结构:SeriesDataFrame

一、Series

Series是一种类似于一维数组的对象,由一组数据(各种NumPy数据类型)以及一组与之对应的索引(数据标签)组成

  • 类似一维数组的对象
  • 由数据和索引组成
    • 索引(index)在左,数据(values)在右
    • 索引是自动创建的
      在这里插入图片描述

1.1 构建Series

1. 由list构建

ser_obj = pd.Series(range(10))

示例代码:

# 通过list构建Series
ser_obj = pd.Series(range(10, 20))

# 读取前三行数据
print(ser_obj.head(3))

# 读取全部数据
print(ser_obj)

# 查看数据类型
print(type(ser_obj))

运行结果:

0    10
1    11
2    12
dtype: int64

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64

<class 'pandas.core.series.Series'>

2. 由dict构建

ser_obj = pd.Series(dict())

示例代码:

# 通过dict构建Series
year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)
print(ser_obj2.head())
print(ser_obj2.index)

运行结果:

2001    17.8
2002    20.1
2003    16.5
dtype: float64
Int64Index([2001, 2002, 2003], dtype='int64')

3. 由数组(一维数组)构建

ser_obj = pd.Series(arr)

示例代码:

arr = np.random.randn(5)
ser_obj = pd.Series(arr)
print(arr)
print(ser_obj)
# 默认index是从0开始,步长为1的数字

ser_obj = pd.Series(arr, index = ['a','b','c','d','e'],dtype = np.object)
print(ser_obj)
# index参数:设置index,长度保持一致
# dtype参数:设置数值类型

运行结果:

[ 0.11206121  0.1324684   0.59930544  0.34707543 -0.15652941]
0    0.112061
1    0.132468
2    0.599305
3    0.347075
4   -0.156529
dtype: float64

a    0.112061
b    0.132468
c    0.599305
d    0.347075
e   -0.156529
dtype: object

4. 由标量构建

示例代码:

ser_obj = pd.Series(10, index = range(4))

ser_obj = pd.Series(10, index = range(4))
print(ser_obj =)
# 如果data是标量值,则必须提供索引。该值会重复,来匹配索引的长度

运行结果:

0    10
1    10
2    10
3    10
dtype: int64

1.2 name属性

对象名:ser_obj.name

对象索引名:ser_obj.index.name

示例代码:

# name属性
ser_obj2.name = 'temp'
ser_obj2.index.name = 'year'
print(ser_obj2.head())

运行结果:

year
2001    17.8
2002    20.1
2003    16.5
Name: temp, dtype: float64

1.3 Series索引

分为位置下标标签索引切片索引布尔型索引

0. 数据和索引值获取

在介绍索引方法之前,先说明一下如何获取数据和索引的值:

ser_obj.index 和 ser_obj.values

示例代码:

# 获取数据
print(ser_obj.values)

# 获取索引
print(ser_obj.index)

运行结果:

# print(ser_obj.values)
[10 11 12 13 14 15 16 17 18 19]

# print(ser_obj.index)
RangeIndex(start=0, stop=10, step=1)

值得注意的是:索引与数据的对应关系不被运算结果影响:

# 索引与数据的对应关系不被运算结果影响
print(ser_obj * 2)
print(ser_obj > 15)

运行结果:

0    20
1    22
2    24
3    26
4    28
5    30
6    32
7    34
8    36
9    38
dtype: int64

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
dtype: bool

1. 位置下标索引

ser_obj[idx]

示例代码:

#通过索引获取数据
print(ser_obj[0])
print(ser_obj[8])

运行结果:

10
18

2. 标签索引

ser_obj[str]

示例代码:

ser_obj = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])
print(ser_obj)
print(ser_obj['a'],type(ser_obj['a']),ser_obj['a'].dtype)
# 方法类似下标索引,用[]表示,内写上index,注意index是字符串

sci = ser_obj[['a','b','e']]
print(sci,type(sci))
# 如果需要选择多个标签的值,用[[]]来表示(相当于[]中包含一个列表)
# 多标签索引结果是新的数组

运行结果:

a    0.714630
b    0.213957
c    0.172188
d    0.972158
e    0.875175
dtype: float64
0.714630383451 <class 'numpy.float64'> float64

a    0.714630
b    0.213957
e    0.875175
dtype: float64 <class 'pandas.core.series.Series'>

3. 切片索引

ser_obj[start: end: step]

示例代码:

ser_obj1 = pd.Series(np.random.rand(5))
ser_obj2 = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])
print(ser_obj1[1:4],ser_obj1[4])
print(ser_obj2['a':'c'],ser_obj2['c'])
print(ser_obj2[0:3],ser_obj2[3])
print('-----')
# 注意:用index做切片是末端包含

print(ser_obj2[:-1])
print(ser_obj2[::2])  # step = 2
# 下标索引做切片,和list写法一样

运行结果:

1    0.865967
2    0.114500
3    0.369301
dtype: float64 0.411702342342
a    0.717378
b    0.642561
c    0.391091
dtype: float64 0.39109096261
a    0.717378
b    0.642561
c    0.391091
dtype: float64 0.998978363818
-----
a    0.717378
b    0.642561
c    0.391091
d    0.998978
dtype: float64
a    0.717378
c    0.391091
e    0.957639
dtype: float64

4. 布尔型索引

示例代码:

# 布尔型索引
s = pd.Series(np.random.rand(3)*100)
s[4] = None  # 添加一个空值
print(s)
bs1 = s > 50
bs2 = s.isnull()
bs3 = s.notnull()
print(bs1, type(bs1), bs1.dtype)
print(bs2, type(bs2), bs2.dtype)
print(bs3, type(bs3), bs3.dtype)
print('-----')
# 数组做判断之后,返回的是一个由布尔值组成的新的数组
# .isnull() / .notnull() 判断是否为空值 (None代表空值,NaN代表有问题的数值,两个都会识别为空值)

print(s[s > 50])
print(s[bs3])
# 布尔型索引方法:用[判断条件]表示,其中判断条件可以是 一个语句,或者是 一个布尔型数组!

运行结果:

0    2.03802
1    40.3989
2    25.2001
4       None
dtype: object
0    False
1    False
2    False
4    False
dtype: bool <class 'pandas.core.series.Series'> bool
0    False
1    False
2    False
4     True
dtype: bool <class 'pandas.core.series.Series'> bool
0     True
1     True
2     True
4    False
dtype: bool <class 'pandas.core.series.Series'> bool
-----
Series([], dtype: object)
0    2.03802
1    40.3989
2    25.2001
dtype: object

1.4 Series基本技巧

分为数据查看重新索引对齐添加修改删除值等几个方面

1. 数据查看

.head()查看头部数据,默认查看5条
.tail()查看尾部数据,默认查看5条

示例代码:

s = pd.Series(np.random.rand(50))

# .head()查看头部数据
print(s.head(10))
# .tail()查看尾部数据
print(s.tail())

运行结果:

0    0.730540
1    0.116711
2    0.787693
3    0.969764
4    0.324540
5    0.061827
6    0.377060
7    0.820383
8    0.964477
9    0.451936
dtype: float64
45    0.899540
46    0.237008
47    0.298762
48    0.848487
49    0.829858
dtype: float64

2. 重新索引

重新索引.reindex
.reindex将会根据索引重新排序,如果当前索引不存在,则引入缺失值

示例代码:

s = pd.Series(np.random.rand(3), index = ['a','b','c'])
print(s)
s1 = s.reindex(['c','b','a','d'])
print(s1)
# .reindex()中也是写列表
# 这里'd'索引不存在,所以值为NaN

s2 = s.reindex(['c','b','a','d'], fill_value = 0)
print(s2)
# fill_value参数:填充缺失值的值

运行结果:

a    0.343718
b    0.322228
c    0.746720
dtype: float64

c    0.746720
b    0.322228
a    0.343718
d         NaN
dtype: float64

c    0.746720
b    0.322228
a    0.343718
d    0.000000
dtype: float64

3. 对齐

Series 和 ndarray 之间的主要区别是,Series 上的操作会根据标签自动对齐
index顺序不会影响数值计算,以标签来计算
空值和任何值计算结果扔为空值

示例代码:

s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom'])
s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry'])
print(s1)
print(s2)

# 对齐
print(s1+s2)

运行结果:

Jack     0.753732
Marry    0.180223
Tom      0.283704
dtype: float64

Wang     0.309128
Jack     0.533997
Marry    0.626126
dtype: float64

Jack     1.287729
Marry    0.806349
Tom           NaN
Wang          NaN
dtype: float64

4. 添加

直接通过下标索引/标签index添加值
通过.append方法,直接添加一个数组(.append方法生成一个新的数组,不改变之前的数组)

示例代码:

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = list('ngjur'))
print(s1)
print(s2)

# 直接通过下标索引/标签index添加值
s1[5] = 100
s2['a'] = 100
print(s1)
print(s2)
print('-----')
 
# 过.append方法,直接添加一个数组
# .append方法生成一个新的数组,不改变之前的数组
s3 = s1.append(s2)
print(s3)
print(s1)

运行结果:

0    0.516447
1    0.699382
2    0.469513
3    0.589821
4    0.402188
dtype: float64

n    0.615641
g    0.451192
j    0.022328
u    0.977568
r    0.902041
dtype: float64

0      0.516447
1      0.699382
2      0.469513
3      0.589821
4      0.402188
5    100.000000
dtype: float64

n      0.615641
g      0.451192
j      0.022328
u      0.977568
r      0.902041
a    100.000000
dtype: float64
-----
0      0.516447
1      0.699382
2      0.469513
3      0.589821
4      0.402188
5    100.000000
n      0.615641
g      0.451192
j      0.022328
u      0.977568
r      0.902041
a    100.000000
dtype: float64

0      0.516447
1      0.699382
2      0.469513
3      0.589821
4      0.402188
5    100.000000
dtype: float64

5. 修改

通过索引直接修改,类似序列
示例代码:

s = pd.Series(np.random.rand(3), index = ['a','b','c'])
print(s)

# 通过索引直接修改,类似序列
s['a'] = 100
s[['b','c']] = 200
print(s)

运行结果:

a    0.873604
b    0.244707
c    0.888685
dtype: float64

a    100.0
b    200.0
c    200.0
dtype: float64

6. 删除值

drop 删除元素之后返回副本(inplace=False)

示例代码:

s = pd.Series(np.random.rand(5), index = list('ngjur'))
print(s)

# drop 删除元素之后返回副本(inplace=False)
s1 = s.drop('n')
s2 = s.drop(['g','j'])
print(s1)
print(s2)
print(s)

运行结果:

n    0.876587
g    0.594053
j    0.628232
u    0.360634
r    0.454483
dtype: float64

g    0.594053
j    0.628232
u    0.360634
r    0.454483
dtype: float64

n    0.876587
u    0.360634
r    0.454483
dtype: float64

n    0.876587
g    0.594053
j    0.628232
u    0.360634
r    0.454483
dtype: float64

二、 DataFrame

DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同类型的值。DataFrame既有行索引也有列索引,它可以被看做是由Series组成的字典(共用同一个索引),数据是以二维结构存放的。

  • 类似多维数组/表格数据 (如,excel, R中的data.frame)
  • 每列数据可以是不同的类型
  • 索引包括列索引和行索引

在这里插入图片描述


2.1 创建DataFrame

1. 由数组/list组成的字典构建

创建方法: pandas.Dataframe()
由数组/list组成的字典 创建Dataframe,columns为字典key,index为默认数字标签
字典的值的长度必须保持一致!
columns参数:可以重新指定列的顺序,格式为list,如果现有数据中没有该列(比如’d’),则产生NaN值
如果columns重新指定时候,列的数量可以少于原数据

示例代码:

data1 = {'a':[1,2,3],
        'b':[3,4,5],
        'c':[5,6,7]}
data2 = {'one':np.random.rand(3),
        'two':np.random.rand(3)}   # 这里如果尝试  'two':np.random.rand(4) 会怎么样?
print(data1)
print(data2)

# 由数组/list组成的字典 创建Dataframe,columns为字典key,index为默认数字标签
# 字典的值的长度必须保持一致!
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)

# columns参数:可以重新指定列的顺序,格式为list,如果现有数据中没有该列(比如'd'),则产生NaN值
# 如果columns重新指定时候,列的数量可以少于原数据
df1 = pd.DataFrame(data1, columns = ['b','c','a','d'])
print(df1)
df1 = pd.DataFrame(data1, columns = ['b','c'])
print(df1)

# index参数:重新定义index,格式为list,长度必须保持一致
df2 = pd.DataFrame(data2, index = ['f1','f2','f3'])  # 这里如果尝试  index = ['f1','f2','f3','f4'] 会怎么样?
print(df2)

运行结果:

{'a': [1, 2, 3], 'c': [5, 6, 7], 'b': [3, 4, 5]}
{'one': array([ 0.00101091,  0.08807153,  0.58345056]), 'two': array([ 0.49774634,  0.16782565,  0.76443489])}

   a  b  c
0  1  3  5
1  2  4  6
2  3  5  7

        one       two
0  0.001011  0.497746
1  0.088072  0.167826
2  0.583451  0.764435

   b  c  a    d
0  3  5  1  NaN
1  4  6  2  NaN
2  5  7  3  NaN

   b  c
0  3  5
1  4  6
2  5  7

         one       two
f1  0.001011  0.497746
f2  0.088072  0.167826
f3  0.583451  0.764435

2. 由Series组成的字典构建

创建方法: pandas.Dataframe()
由Seris组成的字典 创建Dataframe,columns为字典key,index为Series的标签(如果Series没有指定标签,则是默认数字标签)
Series可以长度不一样,生成的Dataframe会出现NaN值

示例代码:

data1 = {'one':pd.Series(np.random.rand(2)),
        'two':pd.Series(np.random.rand(3))}  # 没有设置index的Series
data2 = {'one':pd.Series(np.random.rand(2), index = ['a','b']),
        'two':pd.Series(np.random.rand(3),index = ['a','b','c'])}  # 设置了index的Series
print(data1)
print(data2)

# 由Seris组成的字典 创建Dataframe,columns为字典key,index为Series的标签(如果Series没有指定标签,则是默认数字标签)
# Series可以长度不一样,生成的Dataframe会出现NaN值
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)

运行结果:

{'one': 0    0.892580
1    0.834076
dtype: float64, 'two': 0    0.301309
1    0.977709
2    0.489000
dtype: float64}

{'one': a    0.470947
b    0.584577
dtype: float64, 'two': a    0.122659
b    0.136429
c    0.396825
dtype: float64}

        one       two
0  0.892580  0.301309
1  0.834076  0.977709
2       NaN  0.489000

        one       two
a  0.470947  0.122659
b  0.584577  0.136429
c       NaN  0.396825

3. 由二维数组直接创建构建(常用)

创建方法: pandas.Dataframe()
通过二维数组直接创建Dataframe,得到一样形状的结果数据,如果不指定index和columns,两者均返回默认数字格式
index和colunms指定长度与原数组保持一致

示例代码:

ar = np.random.rand(9).reshape(3,3)
print(ar)

df1 = pd.DataFrame(ar)
df2 = pd.DataFrame(ar, index = ['a', 'b', 'c'], columns = ['one','two','three'])  # 可以尝试一下index或columns长度不等于已有数组的情况

# 通过二维数组直接创建Dataframe,得到一样形状的结果数据,如果不指定index和columns,两者均返回默认数字格式
# index和colunms指定长度与原数组保持一致
print(df1)
print(df2)

运行结果:

[[ 0.54492282  0.28956161  0.46592269]
 [ 0.30480674  0.12917132  0.38757672]
 [ 0.2518185   0.13544544  0.13930429]]
 
          0         1         2
0  0.544923  0.289562  0.465923
1  0.304807  0.129171  0.387577
2  0.251819  0.135445  0.139304

        one       two     three
a  0.544923  0.289562  0.465923
b  0.304807  0.129171  0.387577
c  0.251819  0.135445  0.139304

4. 由字典组成的列表构建

创建方法: pandas.Dataframe()
由字典组成的列表创建Dataframe,columns为字典的key,index不做指定则为默认数组标签
colunms和index参数分别重新指定相应列及行标签

示例代码:

data = [{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
print(data)

# 由字典组成的列表创建Dataframe,columns为字典的key,index不做指定则为默认数组标签
# colunms和index参数分别重新指定相应列及行标签
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, index = ['a','b'])
df3 = pd.DataFrame(data, columns = ['one','two'])
print(df1)
print(df2)
print(df3)

运行结果:

[{'one': 1, 'two': 2}, {'one': 5, 'three': 20, 'two': 10}]
   one  three  two
0    1    NaN    2
1    5   20.0   10
   one  three  two
a    1    NaN    2
b    5   20.0   10
   one  two
0    1    2
1    5   10

5 由字典组成的字典构建

创建方法: pandas.Dataframe()
由字典组成的字典创建Dataframe,columns为字典的key,index为子字典的key
olumns参数可以增加和减少现有列,如出现新的列,值为NaN
ndex在这里和之前不同,并不能改变原有index,如果指向新的标签,值为NaN (非常重要!)

示例代码:

data = {'Jack':{'math':90,'english':89,'art':78},
       'Marry':{'math':82,'english':95,'art':92},
       'Tom':{'math':78,'english':67}}
       
# 由字典组成的字典创建Dataframe,columns为字典的key,index为子字典的key
df1 = pd.DataFrame(data)
print(df1)

# columns参数可以增加和减少现有列,如出现新的列,值为NaN
# index在这里和之前不同,并不能改变原有index,如果指向新的标签,值为NaN (非常重要!)
df2 = pd.DataFrame(data, columns = ['Jack','Tom','Bob'])
df3 = pd.DataFrame(data, index = ['a','b','c'])
print(df2)
print(df3)

运行结果:

         Jack  Marry   Tom
art        78     92   NaN
english    89     95  67.0
math       90     82  78.0

         Jack   Tom  Bob
art        78   NaN  NaN
english    89  67.0  NaN
math       90  78.0  NaN

   Jack  Marry  Tom
a   NaN    NaN  NaN
b   NaN    NaN  NaN
c   NaN    NaN  NaN

2.2 DataFrame索引

Dataframe既有行索引也有列索引,可以被看做由Series组成的字典(共用一个索引)

分为选择列选择行切片布尔判断

1. 选择行与列

示例代码:

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)

# 按照列名选择列,只选择一列输出Series,选择多列输出Dataframe
data1 = df['a']
data2 = df[['a','c']]
print(data1,type(data1))
print(data2,type(data2))
print('-----')

# 按照index选择行,只选择一行输出Series,选择多行输出Dataframe
data3 = df.loc['one']
data4 = df.loc[['one','two']]
print(data3,type(data3))
print(data4,type(data4))

运行结果:

               a          b          c          d
one    64.413076  64.375994  40.627911  37.738178
two    59.671212   5.855122  80.103200  69.379653
three   5.767027  61.162748  53.995211  10.903334

one      64.413076
two      59.671212
three     5.767027
Name: a, dtype: float64 
<class 'pandas.core.series.Series'>

               a          c
one    64.413076  40.627911
two    59.671212  80.103200
three   5.767027  53.995211 
<class 'pandas.core.frame.DataFrame'>
-----
a    64.413076
b    64.375994
c    40.627911
d    37.738178
Name: one, dtype: float64 
<class 'pandas.core.series.Series'>

             a          b          c          d
one  64.413076  64.375994  40.627911  37.738178
two  59.671212   5.855122  80.103200  69.379653 
<class 'pandas.core.frame.DataFrame'>

2. 选择列

df[] - 选择列
一般用于选择列,也可以选择行
核心笔记:df[col]一般用于选择列,[]中写列名

示例代码:

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)
print('-----')

# df[]默认选择列,[]中写列名(所以一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突)
# 单选列为Series,print结果为Series格式
# 多选列为Dataframe,print结果为Dataframe格式
data1 = df['a']
data2 = df[['b','c']]  # 尝试输入 data2 = df[['b','c','e']]
print(data1)
print(data2)

# df[]中为数字时,默认选择行,且只能进行切片的选择,不能单独选择(df[0])
# 输出结果为Dataframe,即便只选择一行
# df[]不能通过索引标签名来选择行(df['one'])
data3 = df[:1]
#data3 = df[0]
#data3 = df['one']
print(data3,type(data3))

运行结果:

               a          b          c          d
one    88.490183  93.588825   1.605172  74.610087
two    45.905361  49.257001  87.852426  97.490521
three  95.801001  97.991028  74.451954  64.290587
-----
one      88.490183
two      45.905361
three    95.801001
Name: a, dtype: float64

               b          c
one    93.588825   1.605172
two    49.257001  87.852426
three  97.991028  74.451954

             a          b         c          d
one  88.490183  93.588825  1.605172  74.610087 
<class 'pandas.core.frame.DataFrame'>

3. 选择行

df.loc[] - 按index选择行
核心笔记:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index

示例代码:

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df1)
print(df2)
print('-----')

# 单个标签索引,返回Series
data1 = df1.loc['one']
data2 = df2.loc[1]
print('单标签索引\n-----')
print(data1)
print(data2)

# 多个标签索引,如果标签不存在,则返回NaN
# 顺序可变
data3 = df1.loc[['two','three','five']]
data4 = df2.loc[[3,2,1]]
print('多标签索引\n-----')
print(data3)
print(data4)

# 可以做切片对象
# 末端包含
data5 = df1.loc['one':'three']
data6 = df2.loc[1:3]
print('切片索引')
print(data5)
print(data6)

运行结果:

               a          b          c          d
one    73.070679   7.169884  80.820532  62.299367
two    34.025462  77.849955  96.160170  55.159017
three  27.897582  39.595687  69.280955  49.477429
four   76.723039  44.995970  22.408450  23.273089
           a          b          c          d
0  93.871055  28.031989  57.093181  34.695293
1  22.882809  47.499852  86.466393  86.140909
2  80.840336  98.120735  84.495414   8.413039
3  59.695834   1.478707  15.069485  48.775008
-----
单标签索引
a    73.070679
b     7.169884
c    80.820532
d    62.299367
Name: one, dtype: float64
a    22.882809
b    47.499852
c    86.466393
d    86.140909
Name: 1, dtype: float64
-----
多标签索引
               a          b          c          d
two    34.025462  77.849955  96.160170  55.159017
three  27.897582  39.595687  69.280955  49.477429
five         NaN        NaN        NaN        NaN
           a          b          c          d
3  59.695834   1.478707  15.069485  48.775008
2  80.840336  98.120735  84.495414   8.413039
1  22.882809  47.499852  86.466393  86.140909
-----
切片索引
               a          b          c          d
one    73.070679   7.169884  80.820532  62.299367
two    34.025462  77.849955  96.160170  55.159017
three  27.897582  39.595687  69.280955  49.477429
           a          b          c          d
1  22.882809  47.499852  86.466393  86.140909
2  80.840336  98.120735  84.495414   8.413039
3  59.695834   1.478707  15.069485  48.775008

4. 按照整数位置选择行

df.iloc[] - 按照整数位置(从轴的0到length-1)选择行
类似list的索引,其顺序就是dataframe的整数位置,从0开始计

示例代码:

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

# 单位置索引
# 和loc索引不同,不能索引超出数据行数的整数位置
print('单位置索引\n-----')
print(df.iloc[0])
print(df.iloc[-1])
#print(df.iloc[4])

# 多位置索引
# 顺序可变
print('多位置索引\n-----')
print(df.iloc[[0,2]])
print(df.iloc[[3,2,1]])

# 切片索引
# 末端不包含
print('切片索引')
print(df.iloc[1:3])
print(df.iloc[::2])

运行结果:

               a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
two    99.092794   0.601173  18.598736  61.166478
three  87.183015  85.973426  48.839267  99.930097
four   75.007726  84.208576  69.445779  75.546038
------
单位置索引
a    21.848926
b     2.482328
c    17.338355
d    73.014166
Name: one, dtype: float64
a    75.007726
b    84.208576
c    69.445779
d    75.546038
Name: four, dtype: float64
-----
多位置索引
               a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
three  87.183015  85.973426  48.839267  99.930097
               a          b          c          d
four   75.007726  84.208576  69.445779  75.546038
three  87.183015  85.973426  48.839267  99.930097
two    99.092794   0.601173  18.598736  61.166478
-----
切片索引
               a          b          c          d
two    99.092794   0.601173  18.598736  61.166478
three  87.183015  85.973426  48.839267  99.930097
               a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
three  87.183015  85.973426  48.839267  99.930097

5. 布尔型索引

原理同Series

示例代码:

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
b1 = df < 20
print(b1,type(b1))
print(df[b1])  # 也可以书写为 df[df < 20]
print('------')

# 单列做判断
# 索引结果保留 单列判断为True的行数据,包括其他列
b2 = df['a'] > 50
print(b2,type(b2))
print(df[b2])  # 也可以书写为 df[df['a'] > 50]
print('------')

# 多列做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3])  # 也可以书写为 df[df[['a','b']] > 50]
print('------')

# 多行做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4])  # 也可以书写为 df[df.loc[['one','three']] < 50]
print('------')

运行结果:

               a          b          c          d
one    19.185849  20.303217  21.800384  45.189534
two    50.105112  28.478878  93.669529  90.029489
three  35.496053  19.248457  74.811841  20.711431
four   24.604478  57.731456  49.682717  82.132866
------
           a      b      c      d
one     True  False  False  False
two    False  False  False  False
three  False   True  False  False
four   False  False  False  False <class 'pandas.core.frame.DataFrame'>

               a          b   c   d
one    19.185849        NaN NaN NaN
two          NaN        NaN NaN NaN
three        NaN  19.248457 NaN NaN
four         NaN        NaN NaN NaN
------
one      False
two       True
three    False
four     False
Name: a, dtype: bool <class 'pandas.core.series.Series'>

             a          b          c          d
two  50.105112  28.478878  93.669529  90.029489
------
           a      b
one    False  False
two     True  False
three  False  False
four   False   True <class 'pandas.core.frame.DataFrame'>

               a          b   c   d
one          NaN        NaN NaN NaN
two    50.105112        NaN NaN NaN
three        NaN        NaN NaN NaN
four         NaN  57.731456 NaN NaN
------
          a     b      c     d
one    True  True   True  True
three  True  True  False  True <class 'pandas.core.frame.DataFrame'>

               a          b          c          d
one    19.185849  20.303217  21.800384  45.189534
two          NaN        NaN        NaN        NaN
three  35.496053  19.248457        NaN  20.711431
four         NaN        NaN        NaN        NaN
------

6. 多重索引(比如同时索引行和列)

先选择列再选择行 —— 相当于对于一个数据,先筛选字段,再选择数据量

示例代码:

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

print(df['a'].loc[['one','three']])   # 选择a列的one,three行
print(df[['b','c','d']].iloc[::2])   # 选择b,c,d列的one,three行
print(df[df['a'] < 50].iloc[:2])   # 选择满足判断索引的前两行数据

运行结果:

               a          b          c          d
one    50.660904  89.827374  51.096827   3.844736
two    70.699721  78.750014  52.988276  48.833037
three  33.653032  27.225202  24.864712  29.662736
four   21.792339  26.450939   6.122134  52.323963
------
one      50.660904
three    33.653032
Name: a, dtype: float64

               b          c          d
one    89.827374  51.096827   3.844736
three  27.225202  24.864712  29.662736

               a          b          c          d
three  33.653032  27.225202  24.864712  29.662736
four   21.792339  26.450939   6.122134  52.323963

2.3 DataFrame基本技巧

分为数据查看转置对齐排序添加修改删除值等几个方面

1. 数据查看、转置

.head()查看头部数据
.tail()查看尾部数据
.T 转置

示例代码:

df = pd.DataFrame(np.random.rand(16).reshape(8,2)*100,
                   columns = ['a','b'])

# .head()查看头部数据
# .tail()查看尾部数据
# 默认查看5条
print(df.head(2))
print(df.tail())

# .T 转置
print(df.T)

运行结果:

           a          b
0   5.777208  18.374283
1  85.961515  55.120036
           a          b
3  21.236577  15.902872
4  46.137564  29.350647
5  70.157709  58.972728
6   8.368292  42.011356
7  29.824574  87.062295
           0          1          2          3          4          5  \
a   5.777208  85.961515  11.005284  21.236577  46.137564  70.157709   
b  18.374283  55.120036  35.595598  15.902872  29.350647  58.972728   

           6          7  
a   8.368292  29.824574  
b  42.011356  87.062295 

2. 添加与修改

示例代码:

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df)

# 新增列/行并赋值
df['e'] = 10
df.loc[4] = 20
print(df)

# 索引后直接修改值
df['e'] = 20
df[['a','c']] = 100
print(df)

运行结果:

           a          b          c          d
0  17.148791  73.833921  39.069417   5.675815
1  91.572695  66.851601  60.320698  92.071097
2  79.377105  24.314520  44.406357  57.313429
3  84.599206  61.310945   3.916679  30.076458
           a          b          c          d   e
0  17.148791  73.833921  39.069417   5.675815  10
1  91.572695  66.851601  60.320698  92.071097  10
2  79.377105  24.314520  44.406357  57.313429  10
3  84.599206  61.310945   3.916679  30.076458  10
4  20.000000  20.000000  20.000000  20.000000  20
     a          b    c          d   e
0  100  73.833921  100   5.675815  20
1  100  66.851601  100  92.071097  20
2  100  24.314520  100  57.313429  20
3  100  61.310945  100  30.076458  20
4  100  20.000000  100  20.000000  20

3. 删除

del / drop()

示例代码:

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df)

# del语句 - 删除列
del df['a']
print(df)
print('-----')

# drop()删除行,inplace=False → 删除后生成新的数据,不改变原数据
print(df.drop(0))
print(df.drop([1,2]))
print(df)
print('-----')


print(df.drop(['d'], axis = 1))
print(df)

运行结果:

           a          b          c          d
0  91.866806  88.753655  18.469852  71.651277
1  64.835568  33.844967   6.391246  54.916094
2  75.930985  19.169862  91.042457  43.648258
3  15.863853  24.788866  10.625684  82.135316
           b          c          d
0  88.753655  18.469852  71.651277
1  33.844967   6.391246  54.916094
2  19.169862  91.042457  43.648258
3  24.788866  10.625684  82.135316
-----
           b          c          d
1  33.844967   6.391246  54.916094
2  19.169862  91.042457  43.648258
3  24.788866  10.625684  82.135316
           b          c          d
0  88.753655  18.469852  71.651277
3  24.788866  10.625684  82.135316
           b          c          d
0  88.753655  18.469852  71.651277
1  33.844967   6.391246  54.916094
2  19.169862  91.042457  43.648258
3  24.788866  10.625684  82.135316
-----
           b          c
0  88.753655  18.469852
1  33.844967   6.391246
2  19.169862  91.042457
3  24.788866  10.625684
           b          c          d
0  88.753655  18.469852  71.651277
1  33.844967   6.391246  54.916094
2  19.169862  91.042457  43.648258
3  24.788866  10.625684  82.135316

4. 对齐

DataFrame对象之间的数据自动按照列和索引(行标签)对齐

示例代码:

df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

# DataFrame对象之间的数据自动按照列和索引(行标签)对齐
print(df1 + df2)

运行结果:

          A         B         C   D
0 -0.281123 -2.529461  1.325663 NaN
1 -0.310514 -0.408225 -0.760986 NaN
2 -0.172169 -2.355042  1.521342 NaN
3  1.113505  0.325933  3.689586 NaN
4  0.107513 -0.503907 -1.010349 NaN
5 -0.845676 -2.410537 -1.406071 NaN
6  1.682854 -0.576620 -0.981622 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN

5.1 排序 - 按值排序

.sort_values
同样适用于Series

示例代码:

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])             
print(df1)

# 单列排序
# ascending参数:设置升序降序,默认升序
print(df1.sort_values(['a'], ascending = True))  # 升序
print(df1.sort_values(['a'], ascending = False))  # 降序
print('------')

# 多列排序,按列顺序排序
df2 = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
                  'b':list(range(8)),
                  'c':list(range(8,0,-1))})
print(df2)
print(df2.sort_values(['a','c']))

运行结果:

           a          b          c          d
0  16.519099  19.601879  35.464189  58.866972
1  34.506472  97.106578  96.308244  54.049359
2  87.177828  47.253416  92.098847  19.672678
3  66.673226  51.969534  71.789055  14.504191
           a          b          c          d
0  16.519099  19.601879  35.464189  58.866972
1  34.506472  97.106578  96.308244  54.049359
3  66.673226  51.969534  71.789055  14.504191
2  87.177828  47.253416  92.098847  19.672678
           a          b          c          d
2  87.177828  47.253416  92.098847  19.672678
3  66.673226  51.969534  71.789055  14.504191
1  34.506472  97.106578  96.308244  54.049359
0  16.519099  19.601879  35.464189  58.866972
------
   a  b  c
0  1  0  8
1  1  1  7
2  1  2  6
3  1  3  5
4  2  4  4
5  2  5  3
6  2  6  2
7  2  7  1
   a  b  c
3  1  3  5
2  1  2  6
1  1  1  7
0  1  0  8
7  2  7  1
6  2  6  2
5  2  5  3
4  2  4  4

5.2 排序 - 索引排序

.sort_index

示例代码:

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                  index = [5,4,3,2],
                   columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                  index = ['h','s','x','g'],
                   columns = ['a','b','c','d'])

# 按照index排序
# 默认 ascending=True, inplace=False
print(df1)
print(df1.sort_index())
print(df2)
print(df2.sort_index())

运行结果:

 a          b          c          d
5  57.327269  87.623119  93.655538   5.859571
4  69.739134  80.084366  89.005538  56.825475
3  88.148296   6.211556  68.938504  41.542563
2  29.248036  72.005306  57.855365  45.931715
           a          b          c          d
2  29.248036  72.005306  57.855365  45.931715
3  88.148296   6.211556  68.938504  41.542563
4  69.739134  80.084366  89.005538  56.825475
5  57.327269  87.623119  93.655538   5.859571
           a          b          c          d
h  50.579469  80.239138  24.085110  39.443600
s  30.906725  39.175302  11.161542  81.010205
x  19.900056  18.421110   4.995141  12.605395
g  67.760755  72.573568  33.507090  69.854906
           a          b          c          d
g  67.760755  72.573568  33.507090  69.854906
h  50.579469  80.239138  24.085110  39.443600
s  30.906725  39.175302  11.161542  81.010205
x  19.900056  18.421110   4.995141  12.605395
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值