Python之Pandas

最新推荐文章于 2024-05-01 22:51:03 发布

yaoqinghao

最新推荐文章于 2024-05-01 22:51:03 发布

阅读量756

点赞数 2

文章标签： Python Pandas

本文链接：https://blog.csdn.net/weixin_44440552/article/details/102766027

版权

pandas是一个开源的Python数据分析库，
1.pandas基于NumPy库，整合NumPy，SciPy（科学计算）和Matplotlib（绘图库）功能。
2.pandas官网：https://pandas.pydata.org/
3.pandas源代码：https://github.com/pandas-dev/pandas

为什么选择pandas？

1.Python写出易读、整洁并且缺陷最少的代码。
2.使用pandas可以完成数据处理和分析中的五个典型步骤：
  数据加载、数据准备、数据操作、数据建模和数据分析。
3.pandas提供了快速高效的Series和DataFrame数据结构。
4.Pandas数据结构基于NumPy数组，而NumPy底层是用C语言实现速度快。
5.可以加载到来自不同文件格式的数据到内存中。
6.可以处理数据对齐和缺失数据。
7.支持基于标签（索引）下标和切片操作，可以处理大数据集。
8.按数据分组以进行聚合和转换。
9.高性能的数据合并和连接。
10.支持时间序列功能。

# 准备工作
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# plt.rcParams['font.sans-serif']=['SimHei']      # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False        # 用来正常显示负号
sns.set_style('darkgrid',{'font.sans-serif':['SimHei','Arial']})  

import warnings
warnings.filterwarnings('ignore')       #去除部分警告信息

1、pandas数据结构概述

两种主要的数据结构：
一维数据结构Series
二维数据结构DataFrame
这些数据结构都是带有索引（标签）的，DataFrame是由Series构成。

1.1 一维数据结构 Series

理解Series数据结构
（1）Series结构是一种带有标签的一维数组对象。
（2）能够保存任何数据类型。
（3）一个Series对象又包含两数组：数据和数据索引（标签）。
（4）数据部分是numpy的数组（ndarray）类型。
创建Series对象

pandas.Series(data, index, dtype, ...)
(1) data是Series数据部分，可以是列表，Numpy数组，标量值（常数），字典。
(2) index是Series索引部分，与数据的长度相同，默认np.arange(n)。
(3) dtype用于数据类型。如果没有则推断数据类型。

#创建一维数组
apples = pd.Series([[3,2,1,0],{2:4,5:6},'hello',8])
apples
0    [3, 2, 1, 0]
1    {2: 4, 5: 6}
2           hello
3               8
dtype: object

apples = pd.Series([3,2,0,1],index = ['a','b','c','d'])
apples
a    3
b    2
c    0
d    1
dtype: int64

apples = pd.Series([3,2,0,1],index = ['a','b','b','d'])
apples
a    3
b    2
b    0
d    1
dtype: int64

apples['b']
b    2
b    0
dtype: int64

data = {'a':3,'b':2,'c':0,'d':1}
apples = pd.Series(data)
apples
a    3
b    2
c    0
d    1
dtype: int64

data = {'a':3,'b':2,'c':0,'b':1}    #若重复则取最后一次出现的值
apples = pd.Series(data)
apples
a    3
b    1
c    0
dtype: int64

apples = pd.Series([3,2,0,1],index=['a','b','c'])    #必须匹配，否则报错
ValueError: Length of passed values is 4, index implies 3

1.2 访问Series数据

通过标签切片访问数据

data = {'a':3,'b':2,'c':0,'d':1}
apples = pd.Series(data)
apples['a':'c']
a    3
b    2
c    0
dtype: int64

通过位置切片访问数据

apples[:3]
a    3
b    2
c    0
dtype: int64

通过布尔数组访问数据

b = [True,False,True,False]
apples[b]
a    3
c    0
dtype: int64

apples[True,False,True,False]  #报错

通过花式下标访问Series数据

apples[['b','d']]
b    2
d    1
dtype: int64
# 注意不能是apples['b','d']

apples[[1,3]]
b    2
d    1
dtype: int64
# 注意不能是apples[1,3]

2、二维数据结构DataFrame

DataFrame数据结构
（1）由多个Series结构构成二维表格对象。
（2）每一个列可以不同数据类型。
（3）行和列是带有标签的轴。
（4）行和列可变的。

2.1 创建DataFrame对象

DataFrame构造函数语法格式如下：
    pandas.DataFrame(data, index, columns, dtype, ...)
 
 (1) data是DataFrame数据部分，可以是列表，Numpy数组、字典、Series对象和其他的DataFrame对象。
 (2) index是行索引（即行标签），默认np.arange(n)。
 (3) columns是列索引标签（列标签），默认np.arange(n)。
 (4) dtype用于数据类型。如果没有则推断数据类型。

使用列表创建DataFrame

L = [[1,2,4],[3,7,3],[6,2,1],[5,1,5]]
df = pd.DataFrame(L,columns=['apples','oranges','bananas'])
df
   apples  oranges  bananas
0       1        2        4
1       3        7        3
2       6        2        1
3       5        1        5

使用字典创建DataFrame

data ={
       'apples':[3,5,1,4],
       'oranges':[6,2,6,1],
       'bananas':[2,6,1,1]
       }
df = pd.DataFrame(data)
df
   apples  oranges  bananas
0       3        6        2
1       5        2        6
2       1        6        1
3       4        1        1

使用列表嵌套字典创建DataFrame

data = [{'apples':3,'oranges':0,'bananas':1},
        {'apples':2,'oranges':1,'bananas':2},
        {'apples':0,'oranges':2,'bananas':1},
        {'apples':1,'oranges':3,'bananas':0}]
df = pd.DataFrame(data)
df
   apples  bananas  oranges
0       3        1        0
1       2        2        1
2       0        1        2
3       1        0        3

使用字典嵌套Series创建DataFrame

data = {
        'apples':pd.Series([3,2,0,1]),
        'oranges':pd.Series([0,1,2,3]),
        'bananas':pd.Series([1,2,1,0])}
df = pd.DataFrame(data)
df
   apples  oranges  bananas
0       3        0        1
1       2        1        2
2       0        2        1
3       1        3        0

data = {
        'apples':pd.Series([3,2,0,1], index=['June','Robert','Lily','David']),
        'oranges':pd.Series([0,1,2,3], index=['June','Robert','Lily','David']),
        'bananas':pd.Series([1,2,1,0], index=['June','Robert','Lily','David'])}
df = pd.DataFrame(data)
df
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
----------------------------------------------------------------

data = {
        'apples':pd.Series([3,2,0,1]),
        'oranges':pd.Series([0,1,2,3]),
        'bananas':pd.Series([1,2,1,0])
        }
df = pd.DataFrame(data, index=['June','Robert','Lily','David'])
df

        apples  oranges  bananas
June       NaN      NaN      NaN
Robert     NaN      NaN      NaN
Lily       NaN      NaN      NaN
David      NaN      NaN      NaN

2.2 访问DataFrame列

使用 [ ] 运算符访问DataFrame列，有两种主要形式：
（1）单个标签下标，返回表示某列的Series对象。
（2）多个标签列表（或数组）下标，返回包含多列的DataFrame对象。

注：访问列不能使用位置下标，只能使用标签下标。

单个标签下标访问DataFrame列

#使用默认列标签：
L = [[3,0,1],
     [2,1,2],
     [0,2,1],
     [1,3,0]]
df = pd.DataFrame(L)
df
   0  1  2
0  3  0  1
1  2  1  2
2  0  2  1
3  1  3  0

df[0]
0    3
1    2
2    0
3    1
Name: 0, dtype: int64


#列标签命名后不能再使用默认列标签索引
df = pd.DataFrame(L,columns=['apples','oranges','bananas'])
df
   apples  oranges  bananas
0       3        0        1
1       2        1        2
2       0        2        1
3       1        3        0

df[0]   #KeyError: 0


#单个标签下标索引示例
data = {
        'apples':[3,2,0,1],
        'oranges':[0,1,2,3],
        'bananas':[1,2,1,0]
        }
df = pd.DataFrame(data, index=['June','Robert','Lily','David'])
df
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0

df['apples']
June      3
Robert    2
Lily      0
David     1
Name: apples, dtype: int64

多个标签下标访问DataFrame列

#使用默认列标签，列标签命名前：
df[[0,2]]
   0  2
0  3  1
1  2  2
2  0  1
3  1  0

#列标签命名后：
df[[0,2]]  #KeyError

df[['apples','bananas']]
        apples  bananas
June         3        1
Robert       2        2
Lily         0        1
David        1        0

2.3 访问DataFrame行

访问DataFrame行也可以使用[ ]运算符访问，有两种主要形式：
切片
布尔数组

通过切片访问DataFrame行

#使用行位置切片
L = [[3,0,1],
     [2,1,2],
     [0,2,1],
     [1,3,0]]
df = pd.DataFrame(L)
df[0:3]  #行位置切片不包括结束行数据
   0  1  2
0  3  0  1
1  2  1  2
2  0  2  1
-----------------------------------------------------------------
#使用行标签切片
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0

df['June':'Lily']
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1

df[0:3]
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1

通过布尔数组访问DataFrame行

        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
-----------------------------------------------------------------
df[[True, False, True, False]]
      apples  oranges  bananas
June       3        0        1
Lily       0        2        1

通过query方法访问DataFrame行

df.query("apples > 1")
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2

df.query("apples > 1 and bananas <= 2")
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2

df.query("apples > 1 or bananas < 1")
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
David        1        3        0

通过head和tail方法访问DataFrame行

head(n) 方法，返回前n行，省略n返回前5行。
tail(n) 方法，返回后n行，省略n返回后5行。

2.4 DataFrame存取器

DataFrame.loc[m,n]
n可以是单个行标签、多行标签数组（或列表）、行标签切片、布尔数组。
m可以是单个列标签、多列标签数组（或列表）、列标签切片、布尔数组。

        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
-------------------------------------------------------------------
df.loc['David','apples']
1

df.loc[['David','Robert'],'apples']
David     1
Robert    2
Name: apples, dtype: int64

type(df.loc['David','apples']
numpy.int64

b = [True, False, True, False]
df.loc[b,'apples']
June    3
Lily    0
Name: apples, dtype: int64

df.loc[b,'apples':]
      apples  oranges  bananas
June       3        0        1
Lily       0        2        1

DataFrame.iloc[n,m]
iloc[ ]用法与loc[ ]类似，区别只是iloc其中的参数都是位置。
n可以是单个行位置、多行标签数组（或列表）、行位置切片、布尔数组。
m可以是单个列位置、多列标签数组（或列表）、列位置切片、布尔数组。

        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
-------------------------------------------------------------------
df.iloc[3,0]
1

df.iloc[[3,1],0]
David     1
Robert    2
Name: apples, dtype: int64

...不再演示

DataFrame.at[ ]和DataFrame.iat[ ]
存取器at[ ] 和iat[ ] 可以访问DataFrame对象中的单个值

DataFrame.at[n, m]
n是行标签，m是列标签

DataFrame.iat[idx_n, idx_m]
idx_n是行位置，idx_m是列位置。

        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
-------------------------------------------------------------------

df.at['Robert','apples']
2

df.iat[1,0]
2

2.5 DataFrame行添加和删除

#DataFrame行添加  append()
data = {
        'apples':[3,2,0,1],
        'oranges':[0,1,2,3],
        'bananas':[1,2,1,0]
        }
df = pd.DataFrame(data, index=['June','Robert','Lily','David'])
df
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0


L = [[1,3,1],
     [2,4,1]]
df2 = pd.DataFrame(L,columns=['apples','oranges','bananas'],index=['Jack','Tom'])
df2
      apples  oranges  bananas
Jack       1        3        1
Tom        2        4        1


df3 = df.append(df2)
df3
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
Jack         1        3        1
Tom          2        4        1

#DataFrame行删除  drop(labels=None)
df4 = df3.drop('Robert')
df4
       apples  oranges  bananas
June        3        0        1
Lily        0        2        1
David       1        3        0
Jack        1        3        1
Tom         2        4        1

2.6 DataFrame列添加和删除

#DataFrame列添加
data = {
        'apples':[3,2,0,1],
        'oranges':[0,1,2,3]
        }
df = pd.DataFrame(data)
df
   apples  oranges
0       3        0
1       2        1
2       0        2
3       1        3

df['bananas'] = pd.Series([1,2,1,0])
df
   apples  oranges  bananas
0       3        0        1
1       2        1        2
2       0        2        1
3       1        3        0

#DataFrame列删除
# del语句
del df['bananas']
df
   apples  oranges
0       3        0
1       2        1
2       0        2
3       1        3

#pop方法，选择列并返回删除的列
df.pop('apples')
0    3
1    2
2    0
3    1
Name: apples, dtype: int64

df
   oranges
0        0
1        1
2        2
3        3

2.7 更改标签(rename）

        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
-------------------------------------------------------------
df = df.rename(columns={'apples':'苹果'})
df
        苹果  oranges  bananas
June     3        0        1
Robert   2        1        2
Lily     0        2        1
David    1        3        0


df = df.rename({'David':'戴维'})
df
        苹果  oranges  bananas
June     3        0        1
Robert   2        1        2
Lily     0        2        1
戴维      1        3        0

3、Index对象

3.1 一级索引Index对象

#从Series对象中获得Index对象
data = {'a':3,'b':2,'c':0,'d':1}
s = pd.Series(data)
s
a    3
b    2
c    0
d    1
dtype: int64

s.index
Index(['a', 'b', 'c', 'd'], dtype='object')


#从DataFrame对象中获得Index对象
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
------------------------------------------------------------
df.index
Index(['June', 'Robert', 'Lily', 'David'], dtype='object')
df.columns
Index(['apples', 'oranges', 'bananas'], dtype='object')

3.2 创建Index对象

#在Series对象中使用Index对象
labels = pd.Index(['a','b','c','d','e'])
s = pd.Series(np.arange(5), index=labels)
s
a    0
b    1
c    2
d    3
e    4
dtype: int32

#在DataFrame对象中使用Index对象
data = [[3,2,0,1],
        [0,1,2,3],
        [1,2,1,0]]

df = pd.DataFrame(data)
df
   0  1  2  3
0  3  2  0  1
1  0  1  2  3
2  1  2  1  0

row_labels = pd.Index(['June','Robert','Lily'])
col_labels = pd.Index(['apples','oranges','bananas','peaches'])

df = pd.DataFrame(data, index=row_labels, columns=col_labels)
df
        apples  oranges  bananas  peaches
June         3        2        0        1
Robert       0        1        2        3
Lily         1        2        1        0

3.3 重建索引（rcindex）

#在Series对象中重建索引
labels = pd.Index(['a','b','c','d','e'])
s = pd.Series(np.arange(5), index=labels)
s
a    0
b    1
c    2
d    3
e    4
dtype: int32

s1 = s.reindex(['a','b','c'])
s1
a    0
b    1
c    2
dtype: int32

s2 = s.reindex(['a','b','c','d','e','f'])
s2
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    NaN
dtype: float64

#在DataFrame对象中重建索引
        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
--------------------------------------------------------------
df2 = df.reindex(['June','Lily'])
df2
      apples  oranges  bananas
June       3        0        1
Lily       0        2        1

df3 = df.reindex(columns=['bananas','apples','oranges'])
df3
        bananas  apples  oranges
June          1       3        0
Robert        2       2        1
Lily          1       0        2
David         0       1        3

4、MultiIndex对象

4.1 创建多级索引对象

创建MultiIndex对象方法：
	pandas.MultiIndex.from_arrays，从数组创建。
	pandas.MultiIndex.from_product，从笛卡尔积创建。
	pandas.MultiIndex.from_tuples，从元组创建。
	pandas.MultiIndex.from_frame，另外DataFrame对象创建。

keys = [('June','apples'),('June','oranges'),('June','bananas'),
        ('Robert','apples'),('Robert','oranges'),('Robert','bananas'),
        ('Lily','apples'),('Lily','oranges'),('Lily','bananas'),
        ('David','apples'),('David','oranges'),('David','bananas')]

keys
[('June', 'apples'),
 ('June', 'oranges'),
 ('June', 'bananas'),
 ('Robert', 'apples'),
 ('Robert', 'oranges'),
 ('Robert', 'bananas'),
 ('Lily', 'apples'),
 ('Lily', 'oranges'),
 ('Lily', 'bananas'),
 ('David', 'apples'),
 ('David', 'oranges'),
 ('David', 'bananas')]

index = pd.MultiIndex.from_tuples(keys, names=['names','fruits'])
index
MultiIndex(levels=[['David', 'June', 'Lily', 'Robert'], ['apples', 'bananas', 'oranges']],
           labels=[[1, 1, 1, 3, 3, 3, 2, 2, 2, 0, 0, 0], [0, 2, 1, 0, 2, 1, 0, 2, 1, 0, 2, 1]],
           names=['names', 'fruits'])

data = [3,0,1,
        2,1,2,
        0,2,1,
        1,3,0]
s = pd.Series(data, index=index)
s
names   fruits 
June    apples     3
        oranges    0
        bananas    1
Robert  apples     2
        oranges    1
        bananas    2
Lily    apples     0
        oranges    2
        bananas    1
David   apples     1
        oranges    3
        bananas    0
dtype: int64

4.2 多级索引行列转换

多级索引转换普通索引

unstack()方法可以快速将一个多级索引的Series转换为普通索引的DataFrame。

s.unstack()
fruits  apples  bananas  oranges
names                           
David        1        0        3
June         3        1        0
Lily         0        1        2
Robert       2        2        1

普通索引转换多级索引

stack()方法可以将一个普通索引的DataFrame转换为多级索引的Series。

        apples  oranges  bananas
June         3        0        1
Robert       2        1        2
Lily         0        2        1
David        1        3        0
-----------------------------------------------------------------

df.stack()
June    apples     3
        oranges    0
        bananas    1
Robert  apples     2
        oranges    1
        bananas    2
Lily    apples     0
        oranges    2
        bananas    1
David   apples     1
        oranges    3
        bananas    0
dtype: int64

4.3 多级索引数据存取

多级索引数据存值可以使用[ ]运算符，也可以使用loc等存取器。
语法：[第1级索引的标签（或切片），第2级索引的标签（或切片），…，第n级索引的标签（或切片）]
loc存取器还可以使用标签列表。

s
names   fruits 
June    apples     3
        oranges    0
        bananas    1
Robert  apples     2
        oranges    1
        bananas    2
Lily    apples     0
        oranges    2
        bananas    1
David   apples     1
        oranges    3
        bananas    0
dtype: int64
----------------------------------------------------------

s[:,'apples']
names
June      3
Robert    2
Lily      0
David     1
dtype: int64

s.loc[:,'apples']
names
June      3
Robert    2
Lily      0
David     1
dtype: int64

s['June','apples']
3

s['June',:]
fruits
apples     3
oranges    0
bananas    1
dtype: int64

5、数据读写操作

数据包括格式：文本格式和二进制格式。
文本格式：CSV、HTML、JSON等
二进制格式：Excel、Pickle、HDFS等。

5.1 读取Excel文件

读取Excel文件数据函数是pandas.read_excel()，该函数返回值是DataFrame对象，函数语法格式：

  pandas.read_excel(io, sheet_name=0, header=0, 
index_col=None,skiprows=None,skipfooter=0)
主要参数如下：
io：是输入Excel文件。可以是字符串、文件对象、ExcelFile对象，可以是本地文件也可以是网络URL。
sheet_name：是Excel文件工作表名，可以是字符串、整数（基于0的工作表位置索引）、列表（选择多个工作表）。
header：用作DataFrame对象列标签的行号，默认是0（第一行）；如果设置为None，则没有指定列标签。
index_col：用作DataFrame对象的行标签的列号，默认是None。
skiprows：忽略文件头部行数，默认None。
skipfooter：忽略文件尾部行数，默认是0。

import pandas as pd
file_path = 'data\\'

df10 = pd.read_excel(file_path+'xxxx.xls')
df10

yaoqinghao

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Python之Pandas

pandas官网：https://pandas.pydata.org/pandas源代码：https://github.com/pandas-dev/pandas# 准备工作import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns# plt.rcParams['...
复制链接

扫一扫