数据分析与展示-Pandas入门

最新推荐文章于 2024-04-27 17:30:34 发布

Rocky_96

最新推荐文章于 2024-04-27 17:30:34 发布

阅读量928

点赞数

分类专栏：数据科学文章标签：数据分析 python

本文链接：https://blog.csdn.net/weixin_42093587/article/details/122217428

版权

数据科学专栏收录该内容

11 篇文章 0 订阅

订阅专栏

数据分析与展示-Pandas入门

1 Pandas引入
2 Series类型
3 DataFrame类型
- DataFrame类型创建
4 Pandas数据操作
5 Pandas数据运算
- 算数运算
- 比较运算

1 Pandas引入

Pandas是Python第三方库，提供高性能易用数据类型和分析工具。Pandas基于NumPy实现，常与NumPy和Matplotlib一同使用。
pandas的导入：

import pandas as pd

Pandas和Numpy的对比

NumPy	Pandas
关注数据的结构表达	关注数据的应用表达
基础数据类型	扩展数据类型
维度：数据间关系	数据与索引间关系

Pandas 主要包括两种数据类型，一是Series，二是DataFrame，下边两节详细介绍。

2 Series类型

Series类型由一组数据及与之相关的数据索引组成。
示例：

import pandas as pd
a=pd.Series([9,8,7,6,5])
print(a)
'''
[ouput]:
0    9
1    8
2    7
3    6
4    5
dtype: int64
'''
import pandas as pd
a=pd.Series([9,8,7,6,5],index=['a','b','c','d','e'])
print(a)
'''
[ouput]:
a    9
b    8
c    7
d    6
e    5
dtype: int64
'''

可以观察到Series包括一列索引，一列数据，索引默认是从0开始的数值。

Series类型创建

Series类型可以由Python列表、标量值、Python字典、ndarray等方式创建。
上边例子中已经展示了通过列表创建

import pandas as pd
#通过标量
a=pd.Series(1,index=['a','b','c','d','e'])
print(a)
'''
[output]:
a    1
b    1
c    1
d    1
e    1
dtype: int64
'''
# 通过字典创建时，可以发现索引会从字典中寻找对应数据，如果没有则取NaN
import pandas as pd
a=pd.Series({'a':1,'b':2,'c':3},index=['a','b','c','d','e'])
print(a)
'''
[output]:
a    1.0
b    2.0
c    3.0
d    NaN
e    NaN
'''
#从ndarray创建
import pandas as pd
import numpy as np
a=pd.Series(np.arange(5),index=['a','b','c','d','e'])
print(a)
'''
[output]:
a    0
b    1
c    2
d    3
e    4
dtype: int32
'''

注意：Series类型由Python列表创建时，index与列表元素个数一致；通过标量值创建时，index表达Series类型的尺寸；由字典创建时，键值对中的“键”是索引，index从字典中进行选择操作；由ndarray创建时，索引和数据都可以通过ndarray类型创建，且元素个数保持一致。

索引和切片

Series类型和ndarray类型索引方法相同，采用[]，同时NumPy中运算和操作可用于Series类型，可以通过自定义索引的列表进行切片，也可以通过自动索引进行切片，如果存在自定义索引，则一同被切片。
示例

import pandas as pd
import numpy as np
a=pd.Series(np.arange(5),index=['a','b','c','d','e'])
a[:3]
'''
[output]:
a    0
b    1
c    2
dtype: int32
'''
import pandas as pd
import numpy as np
a=pd.Series(np.arange(5),index=['a','b','c','d','e'])
a[a>a.median()]
'''
[output]:
d    3
e    4
dtype: int32
'''

in

通过in可以判断Series是否含有某元素

import pandas as pd
import numpy as np
a=pd.Series(np.arange(5),index=['a','b','c','d','e'])
'a' in a
#输出为True

.get()

得到某索引下的值

对齐

两个Serie数组进行运算时，相同index下数据进行运算，其他单一索引都是Na值。

import pandas as pd
a=pd.Series([1,2,3],['a','b','c'])
b=pd.Series([4,5,6,7],['a','b','d','e'])
a+b
'''
output:
a    5.0
b    7.0
c    NaN
d    NaN
e    NaN
dtype: float64
''''

元素修改

import pandas as pd
a=pd.Series([1,2,3],['a','b','c'])
a['a']=10
a
'''
output:
a    10
b     2
c     3
dtype: int64
'''

Series基本操作类似ndarray和字典，根据索引对齐。

3 DataFrame类型

DataFrame类型由共用相同索引的一组列组成。
DataFrame是一个表格型的数据类型，每列值类型可以不同，既有行索引、也有列索引，常用于表达二维数据，但可以表达多维数据。

DataFrame类型创建

DataFrame类型可以由二维ndarray对象、一维ndarray、列表、字典、元组或Series构成的字典、Series类型、其他的DataFrame类型创建。

import pandas as pd
import numpy as np
#通过二维Ndarray创建
d=pd.DataFrame(np.arange(10).reshape(2,5))
d
'''
[output]:

    0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9
'''

#由列表类型创建
dt={'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([4,5,6,7],index=['a','b','c','d'])}
d=pd.DataFrame(dt,index=['a','b','c','d'])
d
'''
[output]:
   one	two
a	1.0	4
b	2.0	5
c	3.0	6
d	NaN	7
'''
d1={'city':['a','b','c','d','e'],
'price':[100,200,300,400,500],
'base':[50,60,70,80,90]}
d=pd.DataFrame(d1,index=['c1','c2','c3','c4','c5'])
d
'''
[output]:
 city price base
c1	a	100	50
c2	b	200	60
c3	c	300	70
c4	d	400	80
c5	e	500	90
'''

DataFrame是二维带“标签”的数组，基本操作类似Series，依据行列索引。

4 Pandas数据操作

重新索引

.reindex()方法能够改变或重排Series和DataFrame索引。
主要格式：

.reindex(index=None, columns=None,…)

其中的主要参数：

参数	含义
index, columns	新的行列自定义索引
fill_value	重新索引中，用于填充缺失位置的值
method	填充方法, ffill当前值向前填充，bfill向后填充
limit	最大填充量
copy	默认True，生成新的对象，False时，新旧相等不复制

索引类型

Series和DataFrame的索引是Index类型，Index对象是不可修改类型。
常用方法：


方法	说明
.append(idx)	连接另一个Index对象，产生新的Index对象
.diff(idx)	计算差集，产生新的Index对象
.intersection(idx)	计算交集
.union(idx)	计算并集
.delete(loc)	删除loc位置处的元素
.insert(loc,e)	在loc位置增加一个元素e

d1={'city':['a','b','c','d','e'],
'price':[100,200,300,400,500],
'base':[50,60,70,80,90]}
d=pd.DataFrame(d1,index=['c1','c2','c3','c4','c5'])

nc=d.columns.delete(1) #删除一列
ni=d.index.insert(0,'c0') #插入一行
nd=d.reindex(index=ni,columns=nc)
nd
'''
[output]:
	city base
c0	NaN	NaN
c1	a	50.0
c2	b	60.0
c3	c	70.0
c4	d	80.0
c5	e	90.0
'''

删除特定列

.drop()可以删除特定行或列的指定行或列索引。

d1={'city':['a','b','c','d','e'],
'price':[100,200,300,400,500],
'base':[50,60,70,80,90]}
d=pd.DataFrame(d1,index=['c1','c2','c3','c4','c5'])
# 删除c1行
d.drop('c1')
'''
[output]:
 city price base
c2	b	200	60
c3	c	300	70
c4	d	400	80
c5	e	500	90
'''

5 Pandas数据运算

算数运算

算术运算根据行列索引，补齐后运算，运算默认产生浮点数，补齐时缺项填充NaN(空值)，采用+ ‐* /符号进行的二元运算产生新的对象。
方法形式的运算：


.add(d, **argws)	类型间加法运算，可选参数
.sub(d, **argws)	类型间减法运算，可选参数
.mul(d, **argws)	类型间乘法运算，可选参数
.div(d, **argws)	类型间除法运算，可选参数

示例：

import pandas as pd
import numpy as np
a=pd.DataFrame(np.arange(12).reshape(3,4))
b=pd.DataFrame(np.arange(20).reshape(4,5))
a+b
'''
[output]:

   0	1	2	3	4
0	0.0	2.0	4.0	6.0	NaN
1	9.0	11.0	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN
'''
#加法操作
a=pd.DataFrame(np.arange(12).reshape(3,4))
b=pd.DataFrame(np.arange(20).reshape(4,5))
b.add(a,fill_value=100)
'''
[output]:
	0	1	2	3	4
0	0.0	2.0	4.0	6.0	104.0
1	9.0	11.0 13.0 15.0 109.0
2	18.0 20.0 22.0 24.0 114.0
3	115.0 116.0 117.0 118.0 119.0
'''
#减法操作
b.sub(a,axis=0,fill_value=10)
'''
[output]:

   0	1	2	3	4
0	0.0	0.0	0.0	0.0	-6.0
1	1.0	1.0	1.0	1.0	-1.0
2	2.0	2.0	2.0	2.0	4.0
3	5.0	6.0	7.0	8.0	9.0
'''

比较运算

比较运算只能比较相同索引的元素，不进行补齐，因此相比较的两者需要尺寸一致；二维和一维、一维和零维间为广播运算；采用> < >= <= == !=等符号进行的二元运算产生布尔对象。

import pandas as pd
import numpy as np
a=pd.DataFrame(np.arange(12).reshape(3,4))
b=pd.DataFrame(np.arange(12).reshape(3,4))
a>b
'''
[output]:
	 0   1 	2	 3
0	False	False	False	False
1	False	False	False	False
2	False	False	False	False
'''

参考自中国大学MOOC。
以上。

Rocky_96

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据分析与展示-Pandas入门

数据分析与展示-Pandas入门1 Pandas引入2 Series类型Series类型创建索引和切片in.get()对齐元素修改3 DataFrame类型DataFrame类型创建4 Pandas数据操作重新索引索引类型删除特定列5 Pandas数据运算算数运算比较运算1 Pandas引入Pandas是Python第三方库，提供高性能易用数据类型和分析工具。Pandas基于NumPy实现，常与NumPy和Matplotlib一同使用。pandas的导入：import pandas as pdP
复制链接

扫一扫