【Python】Pandas库基本概念与常用函数

最新推荐文章于 2023-02-20 10:57:35 发布

小西几y

最新推荐文章于 2023-02-20 10:57:35 发布

阅读量774

点赞数 1

分类专栏： Python 文章标签： Python Pandas 机器学习库

本文链接：https://blog.csdn.net/qq_41748260/article/details/102420358

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Pandas学习

结合老师教学ppt和易百教程

一、数据结构

pandas处理以下三个数据结构 -

系列(Series)
数据帧(DataFrame)
面板(Panel)

这些数据结构构建在Numpy数组之上，这意味着它们很快。

维数和描述

考虑这些数据结构的最好方法是，较高维数据结构是其较低维数据结构的容器。例如，DataFrame是Series的容器，Panel是DataFrame的容器。

数据结构	维数	描述
系列	1	`1`D标记均匀数组，大小不变。
数据帧	2	一般`2`D标记，大小可变的表结构与潜在的异质类型的列。
面板	3	一般`3`D标记，大小可变数组。

构建和处理两个或更多个维数组是一项繁琐的任务，用户在编写函数时要考虑数据集的方向。但是使用Pandas数据结构，减少了用户的思考。

例如，使用表格数据(DataFrame)，在语义上更有用于考虑索引(行)和列，而不是轴0和轴1。

可变性

所有Pandas数据结构是值可变的(可以更改)，除了系列都是大小可变的。系列是大小不变的。

注 - DataFrame被广泛使用，是最重要的数据结构之一。面板使用少得多。

系列

系列是具有均匀数据的一维数组结构。例如，以下系列是整数：10,23,56，...的集合。

关键点

均匀数据
尺寸大小不变
数据的值可变

数据帧

数据帧(DataFrame)是一个具有异构数据的二维数组。例如，

姓名	年龄	性别	等级
Maxsu	25	男	4.45
Katie	34	女	2.78
Vina	46	女	3.9
Lia	女	x女	4.6

上表表示具有整体绩效评级组织的销售团队的数据。数据以行和列表示。每列表示一个属性，每行代表一个人。

列的数据类型

上面数据帧中四列的数据类型如下：

列	类型
姓名	字符串
年龄	整数
性别	字符串
等级	浮点型

关键点

异构数据
大小可变
数据可变

面板

面板是具有异构数据的三维数据结构。在图形表示中很难表示面板。但是一个面板可以说明为DataFrame的容器。

关键点

异构数据
大小可变
数据可变

二、Series

系列(Series)是能够保存任何类型的数据(整数，字符串，浮点数，Python对象等)的一维标记数组。轴标签统称为索引。

pandas.Series

Pandas系列可以使用以下构造函数创建 -

pandas.Series( data, index, dtype, copy)。


Python

构造函数的参数如下 -

编号	参数	描述
1	`data`	数据采取各种形式，如：`ndarray`，`list`，`constants`
2	`index`	索引值必须是唯一的和散列的，与数据的长度相同。默认`np.arange(n)`如果没有索引被传递。
3	`dtype`	`dtype`用于数据类型。如果没有，将推断数据类型
4	`copy`	复制数据，默认为`false`。

可以使用各种输入创建一个系列，如 -

数组
字典
标量值或常数

创建一个空的系列

创建一个基本系列是一个空系列。

示例

#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print s


Python

执行上面示例代码，输出结果如下 -

Series([], dtype: float64)


Shell

从ndarray创建一个系列

如果数据是ndarray，则传递的索引必须具有相同的长度。如果没有传递索引值，那么默认的索引将是arange(n)，其中n是数组长度，

示例1

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s


Python

执行上面示例代码，输出结果如下 -

0   a
1   b
2   c
3   d
dtype: object


Shell

这里没有传递任何索引，因此默认情况下，它分配了从0到len(data)-1的索引，即：0到3。

示例2

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s


Python

执行上面示例代码，输出结果如下 -

100  a
101  b
102  c
103  d
dtype: object


Python

在这里传递了索引值。现在可以在输出中看到自定义的索引值。

从标量创建一个系列

如果数据是标量值，则必须提供索引。将重复该值以匹配索引的长度。

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s


Python

执行上面示例代码，得到以下结果 -

0  5
1  5
2  5
3  5
dtype: int64

从具有位置的系列中访问数据

系列中的数据可以使用类似于访问ndarray中的数据来访问。

示例-1

检索第一个元素。比如已经知道数组从零开始计数，第一个元素存储在零位置等等。

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print s[0]


Python

执行上面示例，得到以下结果 -

1


Shell

示例-2

检索系列中的前三个元素。如果a:被插入到其前面，则将从该索引向前的所有项目被提取。如果使用两个参数(使用它们之间)，两个索引之间的项目(不包括停止索引)。

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print s[:3]


Python

执行上面示例，得到以下结果 -

a  1
b  2
c  3
dtype: int64


Shell

示例-3

检索最后三个元素，参考以下示例代码 -

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print s[-3:]


Python

执行上面示例代码，得到以下结果 -

c  3
d  4
e  5
dtype: int64

使用标签检索数据(索引)

一个系列就像一个固定大小的字典，可以通过索引标签获取和设置值。

示例1

使用索引标签值检索单个元素。

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print s['a']


Python

执行上面示例代码，得到以下结果 -

1


Shell

示例2

使用索引标签值列表检索多个元素。

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print s[['a','c','d']]


Python

执行上面示例代码，得到以下结果 -

a  1
c  3
d  4
dtype: int64


Shell

示例3

如果不包含标签，则会出现异常。

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print s['f']


Python

执行上面示例代码，得到以下结果 -

…
KeyError: 'f'

三、DataFrame

数据帧(DataFrame)是二维数据结构，即数据以行和列的表格方式排列。

数据帧(DataFrame)的功能特点：

潜在的列是不同的类型
大小可变
标记轴(行和列)
可以对行和列执行算术运算

pandas.DataFrame

pandas中的DataFrame可以使用以下构造函数创建 -

pandas.DataFrame( data, index, columns, dtype, copy)


Python

构造函数的参数如下 -

编号	参数	描述
1	`data`	数据采取各种形式，如:`ndarray`，`series`，`map`，`lists`，`dict`，`constant`和另一个`DataFrame`。
2	`index`	对于行标签，要用于结果帧的索引是可选缺省值`np.arrange(n)`，如果没有传递索引值。
3	`columns`	对于列标签，可选的默认语法是 - `np.arange(n)`。这只有在没有索引传递的情况下才是这样。
4	`dtype`	每列的数据类型。
5	`copy`	如果默认值为`False`，则此命令(或任何它)用于复制数据。

创建DataFrame

Pandas数据帧(DataFrame)可以使用各种输入创建，如 -

列表
字典
系列
Numpy ndarrays
另一个数据帧(DataFrame)

在本章的后续章节中，我们将看到如何使用这些输入创建数据帧(DataFrame)。

创建一个空的DataFrame

创建基本数据帧是空数据帧。
示例

#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df


Python

执行上面示例代码，得到以下结果 -

Empty DataFrame
Columns: []
Index: []


Shell

从列表创建DataFrame

可以使用单个列表或列表列表创建数据帧(DataFrame)。

实例-1

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df


Python

执行上面示例代码，得到以下结果 -

实例-2

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df


Python

执行上面示例代码，得到以下结果 -

      Name      Age
0     Alex      10
1     Bob       12
2     Clarke    13


Shell

实例-3

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df


Python

执行上面示例代码，得到以下结果 -

      Name     Age
0     Alex     10.0
1     Bob      12.0
2     Clarke   13.0


Shell

注意 - 可以观察到，dtype参数将Age列的类型更改为浮点。

从ndarrays/Lists的字典来创建DataFrame

所有的ndarrays必须具有相同的长度。如果传递了索引(index)，则索引的长度应等于数组的长度。

如果没有传递索引，则默认情况下，索引将为range(n)，其中n为数组长度。

实例-1

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df


Python

执行上面示例代码，得到以下结果 -

      Age      Name
0     28        Tom
1     34       Jack
2     29      Steve
3     42      Ricky


Shell

注 - 观察值0,1,2,3。它们是分配给每个使用函数range(n)的默认索引。

示例-2

使用数组创建一个索引的数据帧(DataFrame)。

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df


Python

执行上面示例代码，得到以下结果 -

         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky


Shell

注意 - index参数为每行分配一个索引。

从列表创建数据帧DataFrame

字典列表可作为输入数据传递以用来创建数据帧(DataFrame)，字典键默认为列名。

实例-1

以下示例显示如何通过传递字典列表来创建数据帧(DataFrame)。

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df


Python

执行上面示例代码，得到以下结果 -

    a    b      c
0   1   2     NaN
1   5   10   20.0


Shell

注意 - 观察到，NaN(不是数字)被附加在缺失的区域。

示例-2

以下示例显示如何通过传递字典列表和行索引来创建数据帧(DataFrame)。

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df


Python

执行上面示例代码，得到以下结果 -

        a   b       c
first   1   2     NaN
second  5   10   20.0


Shell

实例-3

以下示例显示如何使用字典，行索引和列索引列表创建数据帧(DataFrame)。

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2


Python

执行上面示例代码，得到以下结果 -

#df1 output
         a  b
first    1  2
second   5  10

#df2 output
         a  b1
first    1  NaN
second   5  NaN


Shell

注意 - 观察，df2使用字典键以外的列索引创建DataFrame; 因此，附加了NaN到位置上。而df1是使用列索引创建的，与字典键相同，所以也附加了NaN。

从系列的字典来创建DataFrame

字典的系列可以传递以形成一个DataFrame。所得到的索引是通过的所有系列索引的并集。

示例

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df
`


Python

执行上面示例代码，得到以下结果 -

      one    two
a     1.0    1
b     2.0    2
c     3.0    3
d     NaN    4


Shell

注意 - 对于第一个系列，观察到没有传递标签'd'，但在结果中，对于d标签，附加了NaN。

现在通过实例来了解列选择，添加和删除。

列选择

下面将通过从数据帧(DataFrame)中选择一列。

示例

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df ['one']


Python

执行上面示例代码，得到以下结果 -

a     1.0
b     2.0
c     3.0
d     NaN
Name: one, dtype: float64


Shell

列添加

下面将通过向现有数据框添加一个新列来理解这一点。

示例

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print df

print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']

print df


Python

执行上面示例代码，得到以下结果 -

Adding a new column by passing as Series:
     one   two   three
a    1.0    1    10.0
b    2.0    2    20.0
c    3.0    3    30.0
d    NaN    4    NaN

Adding a new column using the existing columns in DataFrame:
      one   two   three    four
a     1.0    1    10.0     11.0
b     2.0    2    20.0     22.0
c     3.0    3    30.0     33.0
d     NaN    4     NaN     NaN

列删除

列可以删除或弹出; 看看下面的例子来了解一下。

例子

# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print df

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print df

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print df


Python

执行上面示例代码，得到以下结果 -

Our dataframe is:
      one   three  two
a     1.0    10.0   1
b     2.0    20.0   2
c     3.0    30.0   3
d     NaN     NaN   4

Deleting the first column using DEL function:
      three    two
a     10.0     1
b     20.0     2
c     30.0     3
d     NaN      4

Deleting another column using POP function:
   three
a  10.0
b  20.0
c  30.0
d  NaN

行选择，添加和删除

现在将通过下面实例来了解行选择，添加和删除。我们从选择的概念开始。

标签选择

可以通过将行标签传递给loc()函数来选择行。参考以下示例代码 -

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.loc['b']
#df.ix['b']也行

Python

执行上面示例代码，得到以下结果 -

one 2.0
two 2.0
Name: b, dtype: float64

结果是系列标签是DataFrame的列名称。而且，系列的名称是检索的标签。

按整数位置选择

可以通过将整数位置传递给iloc()函数来选择行。参考以下示例代码 -

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df.iloc[2]


Python

执行上面示例代码，得到以下结果 -

one   3.0
two   3.0
Name: c, dtype: float64

行切片

可以使用:运算符选择多行。参考以下示例代码 -

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df[2:4]


Python

执行上面示例代码，得到以下结果 -

      one    two
c     3.0     3
d     NaN     4

附加行

使用append()函数将新行添加到DataFrame。此功能将附加行结束。

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print df


Python

执行上面示例代码，得到以下结果 -

删除行

使用索引标签从DataFrame中删除或删除行。如果标签重复，则会删除多行。

如果有注意，在上述示例中，有标签是重复的。这里再多放一个标签，看看有多少行被删除。

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print df


Python

执行上面示例代码，得到以下结果 -

  a b
1 3 4
1 7 8

四、pandas基本功能

系列基本功能

编号	属性或方法	描述
1	`axes`	返回行轴标签列表。
2	`dtype`	返回对象的数据类型(`dtype`)。
3	`empty`	如果系列为空，则返回`True`。
4	`ndim`	返回底层数据的维数，默认定义：`1`。
5	`size`	返回基础数据中的元素数。
6	`values`	将系列作为`ndarray`返回。
7	`head()`	返回前`n`行。
8	`tail()`	返回最后`n`行。

现在创建一个系列并演示如何使用上面所有列出的属性操作。

示例

import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print s


Python

执行上面示例代码，得到以下输出结果 -

0   0.967853
1  -0.148368
2  -1.395906
3  -1.758394
dtype: float64


Python

axes示例

返回系列的标签列表。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("The axes are:")
print s.axes


Python

执行上面示例代码，得到以下输出结果 -

The axes are:
[RangeIndex(start=0, stop=4, step=1)]


Python

上述结果是从0到5的值列表的紧凑格式，即：[0,1,2,3,4]。

empty示例

返回布尔值，表示对象是否为空。返回True则表示对象为空。

import pandas as pd
import numpy as np

#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("Is the Object empty?")
print s.empty


Python

执行上面示例代码，得到以下输出结果 -

Is the Object empty?
False


Python

ndim示例

返回对象的维数。根据定义，一个系列是一个1D数据结构，参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print s

print ("The dimensions of the object:")
print s.ndim


Python

执行上面示例代码，得到以下结果 -

0   0.175898
1   0.166197
2  -0.609712
3  -1.377000
dtype: float64

The dimensions of the object:
1


Shell

size示例

返回系列的大小(长度)。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(2))
print s
print ("The size of the object:")
print s.size


Python

执行上面示例代码，得到以下结果 -

0   3.078058
1  -1.207803
dtype: float64

The size of the object:
2


Shell

values示例

以数组形式返回系列中的实际数据值。

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print s

print ("The actual data series is:")
print s.values


Python

执行上面示例代码，得到以下结果 -

0   1.787373
1  -0.605159
2   0.180477
3  -0.140922
dtype: float64

The actual data series is:
[ 1.78737302 -0.60515881 0.18047664 -0.1409218 ]


Shell

head()和tail()方法示例

要查看Series或DataFrame对象的小样本，请使用head()和tail()方法。

head()返回前n行(观察索引值)。要显示的元素的默认数量为5，但可以传递自定义这个数字值。

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print s

print ("The first two rows of the data series:")
print s.head(2)


Python

执行上面示例代码，得到以下结果 -

The original series is:
0   0.720876
1  -0.765898
2   0.479221
3  -0.139547
dtype: float64

The first two rows of the data series:
0   0.720876
1  -0.765898
dtype: float64


Shell

tail()返回最后n行(观察索引值)。要显示的元素的默认数量为5，但可以传递自定义数字值。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print s

print ("The last two rows of the data series:")
print s.tail(2)


Python

执行上面示例代码，得到以下结果 -

The original series is:
0 -0.655091
1 -0.881407
2 -0.608592
3 -2.341413
dtype: float64

The last two rows of the data series:
2 -0.608592
3 -2.341413
dtype: float64

DataFrame基本功能

下面来看看数据帧(DataFrame)的基本功能有哪些？下表列出了DataFrame基本功能的重要属性或方法。

编号	属性或方法	描述
1	`T`	转置行和列。
2	`axes`	返回一个列，行轴标签和列轴标签作为唯一的成员。
3	`dtypes`	返回此对象中的数据类型(`dtypes`)。
4	`empty`	如果`NDFrame`完全为空[无项目]，则返回为`True`; 如果任何轴的长度为`0`。
5	`ndim`	轴/数组维度大小。
6	`shape`	返回表示`DataFrame`的维度的元组。
7	`size`	`NDFrame`中的元素数。
8	`values`	NDFrame的Numpy表示。
9	`head()`	返回开头前`n`行。
10	`tail()`	返回最后`n`行。

下面来看看如何创建一个DataFrame并使用上述属性和方法。

示例

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print df


Python

执行上面示例代码，得到以下结果 -

Our data series is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Minsu   4.60
6   23    Jack    3.80


Shell

T(转置)示例

返回DataFrame的转置。行和列将交换。参考以下示例代码 -

import pandas as pd
import numpy as np

# Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

# Create a DataFrame
df = pd.DataFrame(d)
print ("The transpose of the data series is:")
print df.T


Python

执行上面示例代码，得到以下结果 -

The transpose of the data series is:
         0     1       2      3      4      5       6
Age      25    26      25     23     30     29      23
Name     Tom   James   Ricky  Vin    Steve  Minsu   Jack
Rating   4.23  3.24    3.98   2.56   3.2    4.6     3.8


Shell

axes示例

返回行轴标签和列轴标签列表。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Row axis labels and column axis labels are:")
print df.axes


Python

执行上面示例代码，得到以下结果 -

Row axis labels and column axis labels are:

[RangeIndex(start=0, stop=7, step=1), Index([u'Age', u'Name', u'Rating'],
dtype='object')]


Shell

dtypes示例

返回每列的数据类型。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("The data types of each column are:")
print df.dtypes


Python

执行上面示例代码，得到以下结果 -

The data types of each column are:
Age     int64
Name    object
Rating  float64
dtype: object


Shell

empty示例

返回布尔值，表示对象是否为空; 返回True表示对象为空。

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Is the object empty?")
print df.empty


Python

执行上面示例代码，得到以下结果 -

Is the object empty?
False


Shell

ndim示例

返回对象的维数。根据定义，DataFrame是一个2D对象。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The dimension of the object is:")
print df.ndim


Python

执行上面示例代码，得到以下结果 -

Our object is:
      Age    Name     Rating
0     25     Tom      4.23
1     26     James    3.24
2     25     Ricky    3.98
3     23     Vin      2.56
4     30     Steve    3.20
5     29     Minsu    4.60
6     23     Jack     3.80

The dimension of the object is:
2


Shell

shape示例

返回表示DataFrame的维度的元组。元组(a，b)，其中a表示行数，b表示列数。

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The shape of the object is:")
print df.shape


Python

执行上面示例代码，得到以下结果 -

Our object is:
   Age   Name    Rating
0  25    Tom     4.23
1  26    James   3.24
2  25    Ricky   3.98
3  23    Vin     2.56
4  30    Steve   3.20
5  29    Minsu   4.60
6  23    Jack    3.80

The shape of the object is:
(7, 3)


Shell

size示例

返回DataFrame中的元素数。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The total number of elements in our object is:")
print df.size


Python

执行上面示例代码，得到以下结果 -

Our object is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Minsu   4.60
6   23    Jack    3.80

The total number of elements in our object is:
21


Shell

values示例

将DataFrame中的实际数据作为NDarray返回。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The actual data in our data frame is:")
print df.values


Python

执行上面示例代码，得到以下结果 -

Our object is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Minsu   4.60
6   23    Jack    3.80
The actual data in our data frame is:
[[25 'Tom' 4.23]
[26 'James' 3.24]
[25 'Ricky' 3.98]
[23 'Vin' 2.56]
[30 'Steve' 3.2]
[29 'Minsu' 4.6]
[23 'Jack' 3.8]]


Shell

head()和tail()示例

要查看DataFrame对象的小样本，可使用head()和tail()方法。head()返回前n行(观察索引值)。显示元素的默认数量为5，但可以传递自定义数字值。参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The first two rows of the data frame is:")
print df.head(2)


Python

执行上面示例代码，得到以下结果 -

Our data frame is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Minsu   4.60
6   23    Jack    3.80

The first two rows of the data frame is:
   Age   Name   Rating
0  25    Tom    4.23
1  26    James  3.24


Shell

tail()返回最后n行(观察索引值)。显示元素的默认数量为5，但可以传递自定义数字值。

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]), 
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}

#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The last two rows of the data frame is:")
print df.tail(2)


Python

执行上面示例代码，得到以下结果 -

Our data frame is:
    Age   Name    Rating
0   25    Tom     4.23
1   26    James   3.24
2   25    Ricky   3.98
3   23    Vin     2.56
4   30    Steve   3.20
5   29    Minsu   4.60
6   23    Jack    3.80

The last two rows of the data frame is:
    Age   Name    Rating
5   29    Minsu    4.6
6   23    Jack     3.8

五、pandas描述性统计

有很多方法用来集体计算DataFrame的描述性统计信息和其他相关操作。其中大多数是sum()，mean()等聚合函数，但其中一些，如sumsum()，产生一个相同大小的对象。一般来说，这些方法采用轴参数，就像ndarray.{sum，std，...}，但轴可以通过名称或整数来指定：

数据帧(DataFrame) - “index”(axis=0，默认)，columns(axis=1)

下面创建一个数据帧(DataFrame)，并使用此对象进行演示本章中所有操作。

示例

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df


Python

执行上面示例代码，得到以下结果 -

    Age  Name   Rating
0   25   Tom     4.23
1   26   James   3.24
2   25   Ricky   3.98
3   23   Vin     2.56
4   30   Steve   3.20
5   29   Minsu   4.60
6   23   Jack    3.80
7   34   Lee     3.78
8   40   David   2.98
9   30   Gasper  4.80
10  51   Betina  4.10
11  46   Andres  3.65


Shell

sum()方法

返回所请求轴的值的总和。默认情况下，轴为索引(axis=0)。

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.sum()


Python

执行上面示例代码，得到以下结果 -

Age                                                    382
Name     TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...
Rating                                               44.92
dtype: object


Shell

每个单独的列单独添加(附加字符串)。

axis=1示例

此语法将给出如下所示的输出，参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.sum(1)


Python

执行上面示例代码，得到以下结果 -

0    29.23
1    29.24
2    28.98
3    25.56
4    33.20
5    33.60
6    26.80
7    37.78
8    42.98
9    34.80
10   55.10
11   49.65
dtype: float64


Shell

mean()示例
返回平均值，参考以下示例代码 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.mean()


Python

执行上面示例代码，得到以下结果 -

Age       31.833333
Rating     3.743333
dtype: float64


Shell

std()示例

返回数字列的Bressel标准偏差。

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.std()


Python

执行上面示例代码，得到以下结果 -

Age       9.232682
Rating    0.661628
dtype: float64


Shell

函数和说明

下面来了解Python Pandas中描述性统计信息的函数，下表列出了重要函数 -

编号	函数	描述
1	`count()`	非空观测数量
2	`sum()`	所有值之和
3	`mean()`	所有值的平均值
4	`median()`	所有值的中位数
5	`mode()`	值的模值
6	`std()`	值的标准偏差
7	`min()`	所有值中的最小值
8	`max()`	所有值中的最大值
9	`abs()`	绝对值
10	`prod()`	数组元素的乘积
11	`cumsum()`	累计总和
12	`cumprod()`	累计乘积

注 - 由于DataFrame是异构数据结构。通用操作不适用于所有函数。

类似于：sum()，cumsum()函数能与数字和字符(或)字符串数据元素一起工作，不会产生任何错误。字符聚合从来都比较少被使用，虽然这些函数不会引发任何异常。
由于这样的操作无法执行，因此，当DataFrame包含字符或字符串数据时，像abs()，cumprod()这样的函数会抛出异常。

汇总数据

describe()函数是用来计算有关DataFrame列的统计信息的摘要。

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()


Python

执行上面示例代码，得到以下结果 -

               Age         Rating
count    12.000000      12.000000
mean     31.833333       3.743333
std       9.232682       0.661628
min      23.000000       2.560000
25%      25.000000       3.230000
50%      29.500000       3.790000
75%      35.500000       4.132500
max      51.000000       4.800000

该函数给出了平均值，标准差和IQR值。而且，函数排除字符列，并给出关于数字列的摘要。 include是用于传递关于什么列需要考虑用于总结的必要信息的参数。获取值列表; 默认情况下是”数字值”。

object - 汇总字符串列
number - 汇总数字列
all - 将所有列汇总在一起(不应将其作为列表值传递)

现在，在程序中使用以下语句并检查输出 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df.describe(include=['object'])


Python

执行上面示例代码，得到以下结果 -

          Name
count       12
unique      12
top      Ricky
freq         1


Shell

现在，使用以下语句并查看输出 -

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print df. describe(include='all')


Shell

执行上面示例代码，得到以下结果 -

          Age          Name       Rating
count   12.000000        12    12.000000
unique        NaN        12          NaN
top           NaN     Ricky          NaN
freq          NaN         1          NaN
mean    31.833333       NaN     3.743333
std      9.232682       NaN     0.661628
min     23.000000       NaN     2.560000
25%     25.000000       NaN     3.230000
50%     29.500000       NaN     3.790000
75%     35.500000       NaN     4.132500
max     51.000000       NaN     4.800000

pandas函数应用

要将自己或其他库的函数应用于Pandas对象，应该了解三种重要的方法。以下讨论了这些方法。使用适当的方法取决于函数是否期望在整个DataFrame，行或列或元素上进行操作。

表格函数应用：pipe()
行或列函数应用：apply()
元素函数应用：applymap()

表格函数应用

可以通过将函数和适当数量的参数作为管道参数来执行自定义操作。因此，对整个DataFrame执行操作。

例如，为DataFrame中的所有元素相加一个值2。然后，

加法器函数

加法器函数将两个数值作为参数添加并返回总和。

def adder(ele1,ele2):
return ele1+ele2


Python

现在将使用自定义函数对DataFrame进行操作。

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)


Python

下面来看看完整的程序 -

import pandas as pd
import numpy as np

def adder(ele1,ele2):
   return ele1+ele2

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
print df


Python

执行上面示例代码，得到以下结果 -

        col1       col2       col3
0   2.176704   2.219691   1.509360
1   2.222378   2.422167   3.953921
2   2.241096   1.135424   2.696432
3   2.355763   0.376672   1.182570
4   2.308743   2.714767   2.130288


Shell

行或列智能函数应用

可以使用apply()方法沿DataFrame或Panel的轴应用任意函数，它与描述性统计方法一样，采用可选的轴参数。默认情况下，操作按列执行，将每列列为数组。

示例

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean)
print df


Python

执行上面示例代码，得到以下结果 -

col1   -0.366338
col2    0.406637
col3   -0.417213
dtype: float64

通过传递axis参数，可以在行上执行操作。

示例-2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean,axis=1)
print df


Python

执行上面示例代码，得到以下结果 -

0   -0.322927
1    0.339565
2   -0.263882
3    0.319440
4   -0.700388
dtype: float64

示例-3

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(lambda x: x.max() - x.min())
print df

执行上面示例代码，得到以下结果 -

col1    1.439610
col2    1.276750
col3    2.099601
dtype: float64

元素智能函数应用

并不是所有的函数都可以向量化(也不是返回另一个数组的NumPy数组，也不是任何值)，在DataFrame上的方法applymap()和类似地在Series上的map()接受任何Python函数，并且返回单个值。

示例-1

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])

# My custom function
df['col1'].map(lambda x:x*100)
print df

执行上面示例代码，得到以下结果 -

0   -103.554722
1      2.356706
2     40.406299
3    -31.896847
4    -90.480527
Name: col1, dtype: float64

示例-2

import pandas as pd
import numpy as np

# My custom function
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.applymap(lambda x:x*100)
print df


Python

执行上面示例代码，得到以下结果 -

output is as follows:
         col1        col2        col3
0 -103.554722   45.310438  -38.633791
1    2.356706  100.003736   -0.490864
2   40.406299   24.167938 -143.738792
3  -31.896847   61.507665   66.221266
4  -90.480527  -27.671272  -91.964477

六、pandas重建索引

重新索引会更改DataFrame的行标签和列标签。重新索引意味着符合数据以匹配特定轴上的一组给定的标签。

可以通过索引来实现多个操作 -

重新排序现有数据以匹配一组新的标签。
在没有标签数据的标签位置插入缺失值(NA)标记。

示例

import pandas as pd
import numpy as np

N=20

df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print (df_reindexed)


Python

执行上面示例代码，得到以下结果 -

            A    C     B
0  2016-01-01  Low   NaN
2  2016-01-03  High  NaN
5  2016-01-06  Low   NaN

重建索引与其他对象对齐

有时可能希望采取一个对象和重新索引，其轴被标记为与另一个对象相同。考虑下面的例子来理解这一点。

示例

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])

df1 = df1.reindex_like(df2)
print df1

执行上面示例代码，得到以下结果 -

          col1         col2         col3
0    -2.467652    -1.211687    -0.391761
1    -0.287396     0.522350     0.562512
2    -0.255409    -0.483250     1.866258
3    -1.150467    -0.646493    -0.222462
4     0.152768    -2.056643     1.877233
5    -1.155997     1.528719    -1.343719
6    -1.015606    -1.245936    -0.295275
#对齐时，丢弃了多余行

注意 - 在这里，df1数据帧(DataFrame)被更改并重新编号，如df2。列名称应该匹配，否则将为整个列标签添加NAN。

填充时重新加注

reindex()采用可选参数方法，它是一个填充方法，其值如下：

pad/ffill - 向前填充值
bfill/backfill - 向后填充值
nearest - 从最近的索引值填充

示例

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print df2.reindex_like(df1)

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print df2.reindex_like(df1,method='ffill')


Python

执行上面示例代码时，得到以下结果 -

         col1        col2       col3
0    1.311620   -0.707176   0.599863
1   -0.423455   -0.700265   1.133371
2         NaN         NaN        NaN
3         NaN         NaN        NaN
4         NaN         NaN        NaN
5         NaN         NaN        NaN

Data Frame with Forward Fill:
         col1        col2        col3
0    1.311620   -0.707176    0.599863
1   -0.423455   -0.700265    1.133371
2   -0.423455   -0.700265    1.133371
3   -0.423455   -0.700265    1.133371
4   -0.423455   -0.700265    1.133371
5   -0.423455   -0.700265    1.133371


Shell

注 - 最后四行被填充了。

重建索引时的填充限制

限制参数在重建索引时提供对填充的额外控制。限制指定连续匹配的最大计数。考虑下面的例子来理解这个概念 -

示例

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print df2.reindex_like(df1)

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print df2.reindex_like(df1,method='ffill',limit=1)


Python

在执行上面示例代码时，得到以下结果 -

         col1        col2        col3
0    0.247784    2.128727    0.702576
1   -0.055713   -0.021732   -0.174577
2         NaN         NaN         NaN
3         NaN         NaN         NaN
4         NaN         NaN         NaN
5         NaN         NaN         NaN

Data Frame with Forward Fill limiting to 1:
         col1        col2        col3
0    0.247784    2.128727    0.702576
1   -0.055713   -0.021732   -0.174577
2   -0.055713   -0.021732   -0.174577
3         NaN         NaN         NaN
4         NaN         NaN         NaN
5         NaN         NaN         NaN


Shell

注意 - 只有第3行由前2`行填充。然后，其它行按原样保留。

重命名

rename()方法允许基于一些映射(字典或者系列)或任意函数来重新标记一个轴。
看看下面的例子来理解这一概念。

示例

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print df1

print ("After renaming the rows and columns:")
print df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'})


Python

执行上面示例代码，得到以下结果 -

         col1        col2        col3
0    0.486791    0.105759    1.540122
1   -0.990237    1.007885   -0.217896
2   -0.483855   -1.645027   -1.194113
3   -0.122316    0.566277   -0.366028
4   -0.231524   -0.721172   -0.112007
5    0.438810    0.000225    0.435479

After renaming the rows and columns:
                c1          c2        col3
apple     0.486791    0.105759    1.540122
banana   -0.990237    1.007885   -0.217896
durian   -0.483855   -1.645027   -1.194113
3        -0.122316    0.566277   -0.366028
4        -0.231524   -0.721172   -0.112007
5         0.438810    0.000225    0.435479


Shell

rename()方法提供了一个inplace命名参数，默认为False并复制底层数据。指定传递inplace = True则表示将数据重命名。

七、pandas迭代

Pandas对象之间的基本迭代的行为取决于类型。当迭代一个系列时，它被视为数组式，基本迭代产生这些值。其他数据结构，如：DataFrame和Panel，遵循类似惯例迭代对象的键。

简而言之，基本迭代(对于i在对象中)产生 -

Series - 值
DataFrame - 列标签
Pannel - 项目标签

迭代DataFrame

迭代DataFrame提供列名。现在来看看下面的例子来理解这个概念。

import pandas as pd
import numpy as np

N=20

df = pd.DataFrame({
    'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
    'x': np.linspace(0,stop=N-1,num=N),
    'y': np.random.rand(N),
    'C': np.random.choice(['Low','Medium','High'],N).tolist(),
    'D': np.random.normal(100, 10, size=(N)).tolist()
    })

for col in df:
   print (col)


Python

执行上面示例代码，得到以下结果 -

A
C
D
x
y

要遍历数据帧(DataFrame)中的行，可以使用以下函数 -

iteritems() - 迭代(key，value)对
iterrows() - 将行迭代为(索引，系列)对
itertuples() - 以namedtuples的形式迭代行

iteritems()示例

将每个列作为键，将值与值作为键和列值迭代为Series对象。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
for key,value in df.iteritems():
   print (key,value)

执行上面示例代码，得到以下结果 -

col1 0    0.802390
1    0.324060
2    0.256811
3    0.839186
Name: col1, dtype: float64

col2 0    1.624313
1   -1.033582
2    1.796663
3    1.856277
Name: col2, dtype: float64

col3 0   -0.022142
1   -0.230820
2    1.160691
3   -0.830279
Name: col3, dtype: float64

#key是列标签，vlue是该列的series

观察一下，单独迭代每个列作为系列中的键值对。

iterrows()示例

iterrows()返回迭代器，产生每个索引值以及包含每行数据的序列。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
   print (row_index,row)

执行上面示例代码，得到以下结果 -

0  col1    1.529759
   col2    0.762811
   col3   -0.634691
Name: 0, dtype: float64

1  col1   -0.944087
   col2    1.420919
   col3   -0.507895
Name: 1, dtype: float64

2  col1   -0.077287
   col2   -0.858556
   col3   -0.663385
Name: 2, dtype: float64
3  col1    -1.638578
   col2     0.059866
   col3     0.493482
Name: 3, dtype: float64

注意 - 由于iterrows()遍历行，因此不会跨该行保留数据类型。0,1,2是行索引，col1，col2，col3是列索引。

itertuples()示例

itertuples()方法将为DataFrame中的每一行返回一个产生一个命名元组的迭代器。元组的第一个元素将是行的相应索引值，而剩余的值是行值。

示例

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row in df.itertuples():
    print (row)

执行上面示例代码，得到以下结果 -

Pandas(Index=0, col1=1.5297586201375899, col2=0.76281127433814944, col3=-
0.6346908238310438)

Pandas(Index=1, col1=-0.94408735763808649, col2=1.4209186418359423, col3=-
0.50789517967096232)

Pandas(Index=2, col1=-0.07728664756791935, col2=-0.85855574139699076, col3=-
0.6633852507207626)

Pandas(Index=3, col1=0.65734942534106289, col2=-0.95057710432604969,
col3=0.80344487462316527)


Shell

注意 - 不要尝试在迭代时修改任何对象。迭代是用于读取，迭代器返回原始对象(视图)的副本，因此更改将不会反映在原始对象上。

示例代码

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])

for index, row in df.iterrows():
   row['a'] = 10
print (df)

执行上面示例代码，得到以下结果 -

        col1       col2       col3
0  -1.739815   0.735595  -0.295589
1   0.635485   0.106803   1.527922
2  -0.939064   0.547095   0.038585
3  -1.016509  -0.116580  -0.523158

注意观察结果，修改变化并未反映出来。

八、pandas排序

Pandas有两种排序方式，它们分别是 -

按标签
按实际值

下面来看看一个输出的例子。

import pandas as pd
import numpy as np

unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu
mns=['col2','col1'])
print (unsorted_df)


Python

执行上面示例代码，得到以下结果 -

       col2      col1
1  1.069838  0.096230
4 -0.542406 -0.219829
6 -0.071661  0.392091
2  1.399976 -0.472169
3  0.428372 -0.624630
5  0.471875  0.966560
9 -0.131851 -1.254495
8  1.180651  0.199548
0  0.906202  0.418524
7  0.124800  2.011962


Shell

在unsorted_df数据值中，标签和值未排序。下面来看看如何按标签来排序。

按标签排序

使用sort_index()方法，通过传递axis参数和排序顺序，可以对DataFrame进行排序。默认情况下，按照升序对行标签进行排序。

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df=unsorted_df.sort_index()
print (sorted_df)

执行上面示例代码，得到以下结果 -

       col2      col1
0  0.431384 -0.401538
1  0.111887 -0.222582
2 -0.166893 -0.237506
3  0.476472  0.508397
4  0.670838  0.406476
5  2.065969 -0.324510
6 -0.441630  1.060425
7  0.735145  0.972447
8 -0.051904 -1.112292
9  0.134108  0.759698

排序顺序

通过将布尔值传递给升序参数，可以控制排序顺序。来看看下面的例子来理解一下。

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)


Python

执行上面示例代码，得到以下结果 -

       col2      col1
9  0.750452  1.754815
8  0.945238  2.079394
7  0.345238 -0.162737
6 -0.512060  0.887094
5  1.163144  0.595402
4 -0.063584 -0.185536
3 -0.275438 -2.286831
2 -1.504792 -1.222394
1  1.031234 -1.848174
0 -0.615083  0.784086

按列排列

通过传递axis参数值为0或1，可以对列标签进行排序。默认情况下，axis = 0，逐行排列。来看看下面的例子来理解这个概念。

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df=unsorted_df.sort_index(axis=1)

print (sorted_df)

执行上面示例代码，得到以下结果 -

       col1      col2
1 -0.997962  0.736707
4  1.196464  0.703710
6 -0.387800  1.207803
2  1.614043  0.356389
3 -0.057181 -0.551742
5  1.034451 -0.731490
9 -0.564355  0.892203
8 -0.763526  0.684207
0 -1.213615  1.268649
7  0.316543 -1.450784


Shell

按值排序

像索引排序一样，sort_values()是按值排序的方法。它接受一个by参数，它将使用要与其排序值的DataFrame的列名称。

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')

print (sorted_df)

执行上面示例代码，得到以下结果 -

   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1

注意：观察上面的输出结果，col1值被排序，相应的col2值和行索引将随col1一起改变。因此，它们看起来没有排序。

通过by参数指定需要列值，参考以下示例代码 -

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by=['col1','col2'])

print (sorted_df)
#类似excel

执行上面示例代码，得到以下结果 -

   col1  col2
2     1     2
1     1     3
3     1     4
0     2     1

排序算法

sort_values()提供了从mergeesort，heapsort和quicksort中选择算法的一个配置。Mergesort是唯一稳定的算法。参考以下示例代码 -

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')

print (sorted_df)

执行上面示例代码，得到以下结果 -

   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1

九、Pandas字符串和文本数据

在本章中，我们将使用基本系列/索引来讨论字符串操作。在随后的章节中，将学习如何将这些字符串函数应用于数据帧(DataFrame)。

Pandas提供了一组字符串函数，可以方便地对字符串数据进行操作。最重要的是，这些函数忽略(或排除)丢失/NaN值。

几乎这些方法都使用Python字符串函数(请参阅： http://docs.python.org/3/library/stdtypes.html#string-methods )。因此，将Series对象转换为String对象，然后执行该操作。

下面来看看每个操作的执行和说明。

编号	函数	描述
1	`lower()`	将`Series/Index`中的字符串转换为小写。
2	`upper()`	将`Series/Index`中的字符串转换为大写。
3	`len()`	计算字符串长度。
4	`strip()`	帮助从两侧的系列/索引中的每个字符串中删除空格(包括换行符)。
5	`split(' ')`	用给定的模式拆分每个字符串。
6	`cat(sep=' ')`	使用给定的分隔符连接系列/索引元素。
7不明白	`get_dummies()`	返回具有单热编码值的数据帧(DataFrame)。
8	`contains(pattern)`	如果元素中包含子字符串，则返回每个元素的布尔值`True`，否则为`False`。
9	`replace(a,b)`	将值`a`替换为值`b`。
10	`repeat(value)`	重复每个元素指定的次数。
11	`count(pattern)`	返回每个元素的该模式出现总数。
12	`startswith(pattern)`	如果系列/索引中的元素以模式开始，则返回`true`。
13	`endswith(pattern)`	如果系列/索引中的元素以模式结束，则返回`true`。
14	`find(pattern)`	返回模式第一次出现的位置。-1表示没有
15	`findall(pattern)`	返回模式的所有出现的列表。
16	`swapcase`	变换字母大小写。
17	`islower()`	检查系列/索引中每个字符串中的所有字符是否小写，返回布尔值
18	`isupper()`	检查系列/索引中每个字符串中的所有字符是否大写，返回布尔值
19	`isnumeric()`	检查系列/索引中每个字符串中的所有字符是否为数字，返回布尔值。

现在创建一个系列，看看上述所有函数是如何工作的。

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveMinsu'])

print (s)


Python

执行上面示例代码，得到以下结果 -

0             Tom
1    William Rick
2            John
3         Alber@t
4             NaN
5            1234
6      SteveMinsu
dtype: object


Shell

1. lower()函数示例

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveMinsu'])

print (s.str.lower())


Python

执行上面示例代码，得到以下结果 -

0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      steveminsu
dtype: object


Shell

2. upper()函数示例

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveMinsu'])

print (s.str.upper())


Python

执行上面示例代码，得到以下结果 -

0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVESMITH
dtype: object


Shell

3. len()函数示例

import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveMinsu'])
print (s.str.len())


Python

执行上面示例代码，得到以下结果 -

0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64


Shell

4. strip()函数示例

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s)
print ("=========== After Stripping ================")
print (s.str.strip())


Python

执行上面示例代码，得到以下结果 -

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
=========== After Stripping ================
0             Tom
1    William Rick
2            John
3         Alber@t
dtype: object


Shell

5. split(pattern)函数示例

import pandas as pd
import numpy as np
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s)
print ("================= Split Pattern: ==================")
print (s.str.split(' '))


Python

执行上面示例代码，得到以下结果 -

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
================= Split Pattern: ==================
0              [Tom, ]
1    [, William, Rick]
2               [John]
3            [Alber@t]
dtype: object


Shell

6. cat(sep=pattern)函数示例

import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.cat(sep=' <=> '))


Python

执行上面示例代码，得到以下结果 -

Tom  <=>  William Rick <=> John <=> Alber@t


Shell

7. get_dummies()函数示例

import pandas as pd
import numpy as np

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.get_dummies())


Python

执行上面示例代码，得到以下结果 -

    William Rick  Alber@t  John  Tom 
0              0        0     0     1
1              1        0     0     0
2              0        0     1     0
3              0        1     0     0


Shell

8. contains()函数示例

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s.str.contains(' '))


Python

执行上面示例代码，得到以下结果 -

0     True
1     True
2    False
3    False
dtype: bool


Shell

9. replace(a,b)函数示例

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s)
print ("After replacing @ with $: ============== ")
print (s.str.replace('@','$'))


Python

执行上面示例代码，得到以下结果 -

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
After replacing @ with $: ============== 
0             Tom 
1     William Rick
2             John
3          Alber$t
dtype: object


Shell

10. repeat(value)函数示例

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.repeat(2))


Python

执行上面示例代码，得到以下结果 -

0                      Tom Tom 
1     William Rick William Rick
2                      JohnJohn
3                Alber@tAlber@t
dtype: object


Shell

11. count(pattern)函数示例

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
print (s.str.count('m'))


Python

执行上面示例代码，得到以下结果 -

The number of 'm's in each string:
0    1
1    1
2    0
3    0
dtype: int64


Shell

12. startswith(pattern)函数示例

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("Strings that start with 'T':")
print (s.str. startswith ('T'))


Python

执行上面示例代码，得到以下结果 -

Strings that start with 'T':
0     True
1    False
2    False
3    False
dtype: bool


Shell

13. endswith(pattern)函数示例

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end with 't':")
print (s.str.endswith('t'))


Python

执行上面示例代码，得到以下结果 -

Strings that end with 't':
0    False
1    False
2    False
3     True
dtype: bool


Shell

14. find(pattern)函数示例

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s.str.find('e'))


Python

执行上面示例代码，得到以下结果 -

0   -1
1   -1
2   -1
3    3
dtype: int64


Shell

注意：-1表示元素中没有这样的模式可用。

15. findall(pattern)函数示例

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s.str.findall('e'))


Python

执行上面示例代码，得到以下结果 -

0     []
1     []
2     []
3    [e]
dtype: object


Shell

空列表([])表示元素中没有这样的模式可用。

16. swapcase()函数示例

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print (s.str.swapcase())


Python

执行上面示例代码，得到以下结果 -

0             tOM
1    wILLIAM rICK
2            jOHN
3         aLBER@T
dtype: object


Shell

17. islower()函数示例

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print (s.str.islower())


Python

执行上面示例代码，得到以下结果 -

0    False
1    False
2    False
3    False
dtype: bool


Shell

18. isupper()函数示例

import pandas as pd

s = pd.Series(['TOM', 'William Rick', 'John', 'Alber@t'])

print (s.str.isupper())


Python

执行上面示例代码，得到以下结果 -

0    True
1    False
2    False
3    False
dtype: bool


Shell

19. isnumeric()函数示例

import pandas as pd
s = pd.Series(['Tom', '1199','William Rick', 'John', 'Alber@t'])
print (s.str.isnumeric())

执行上面示例代码，得到以下结果 -

0    False
1     True
2    False
3    False
4    False
dtype: bool
ith $: ============== ")
print (s.str.replace('@','$'))


Python

执行上面示例代码，得到以下结果 -

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
After replacing @ with $: ============== 
0             Tom 
1     William Rick
2             John
3          Alber$t
dtype: object


Shell

10. repeat(value)函数示例

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.repeat(2))


Python

执行上面示例代码，得到以下结果 -

0                      Tom Tom 
1     William Rick William Rick
2                      JohnJohn
3                Alber@tAlber@t
dtype: object


Shell

11. count(pattern)函数示例

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
print (s.str.count('m'))


Python

执行上面示例代码，得到以下结果 -

The number of 'm's in each string:
0    1
1    1
2    0
3    0
dtype: int64


Shell

12. startswith(pattern)函数示例

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("Strings that start with 'T':")
print (s.str. startswith ('T'))


Python

执行上面示例代码，得到以下结果 -

Strings that start with 'T':
0     True
1    False
2    False
3    False
dtype: bool


Shell

13. endswith(pattern)函数示例

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print ("Strings that end with 't':")
print (s.str.endswith('t'))


Python

执行上面示例代码，得到以下结果 -

Strings that end with 't':
0    False
1    False
2    False
3     True
dtype: bool


Shell

14. find(pattern)函数示例

import pandas as pd
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s.str.find('e'))


Python

执行上面示例代码，得到以下结果 -

0   -1
1   -1
2   -1
3    3
dtype: int64


Shell

注意：-1表示元素中没有这样的模式可用。

15. findall(pattern)函数示例

import pandas as pd

s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s.str.findall('e'))


Python

执行上面示例代码，得到以下结果 -

0     []
1     []
2     []
3    [e]
dtype: object


Shell

空列表([])表示元素中没有这样的模式可用。

16. swapcase()函数示例

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print (s.str.swapcase())


Python

执行上面示例代码，得到以下结果 -

0             tOM
1    wILLIAM rICK
2            jOHN
3         aLBER@T
dtype: object


Shell

17. islower()函数示例

import pandas as pd

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
print (s.str.islower())


Python

执行上面示例代码，得到以下结果 -

0    False
1    False
2    False
3    False
dtype: bool


Shell

18. isupper()函数示例

import pandas as pd

s = pd.Series(['TOM', 'William Rick', 'John', 'Alber@t'])

print (s.str.isupper())


Python

执行上面示例代码，得到以下结果 -

0    True
1    False
2    False
3    False
dtype: bool


Shell

19. isnumeric()函数示例

import pandas as pd
s = pd.Series(['Tom', '1199','William Rick', 'John', 'Alber@t'])
print (s.str.isnumeric())

执行上面示例代码，得到以下结果 -

0    False
1     True
2    False
3    False
4    False
dtype: bool

小西几y

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【Python】Pandas库基本概念与常用函数

Pandas学习结合老师教学ppt和易百教程一、数据结构pandas处理以下三个数据结构 -系列(Series)数据帧(DataFrame)面板(Panel)这些数据结构构建在Numpy数组之上，这意味着它们很快。维数和描述考虑这些数据结构的最好方法是，较高维数据结构是其较低维数据结构的容器。例如，DataFrame是Series的容器，Panel是DataFrame的容器...
复制链接

扫一扫