数据分析3--Pandas基本使用

最新推荐文章于 2023-11-30 16:28:36 发布

第五本日记

最新推荐文章于 2023-11-30 16:28:36 发布

阅读量310

点赞数

分类专栏： Python数据分析

本文链接：https://blog.csdn.net/weixin_46197111/article/details/116698342

版权

Python数据分析专栏收录该内容

6 篇文章 1 订阅

订阅专栏

一、简介

Pandas库基于Numpy库，提供了很多用于数据操作与分析的功能。
安装与使用：
pip install pandas
根据惯例，我们使用如下的方式引入pandas：

import pandas as pd

两个常用数据类型：
Series
DataFrame

二、Series类型

Series类型类似于Numpy的一维数组对象，可以将该类型看做是一组数据与数据相关的标签（索引）联合而构成（带有标签的一维数组对象）。

创建方式
列表等可迭代对象、ndarray数组对象、字典对象、标量
关于Series类型，我们可以认为是一组数据，每个数据带有一个标签。（可以简单的看成是一个带有标签的ndarray数组类型）
Series使用类似字典的方式，进行存储。其中，标签就是字典的key，数据就是字典的value。
只支持一维的数据结构

a = pd.Series([1,2,3,4])
b = pd.Series(np.array([1,2,3,4]))
c = pd.Series({"a":1,"b":2})
d = pd.Series(1)
print(a)
print(b)
print(c)
print(d)

输出：

0    1
1    2
2    3
3    4
dtype: int64
0    1
1    2
2    3
3    4
dtype: int32
a    1
b    2
dtype: int64
0    1
dtype: int64

相关属性

index
可以通过index属性设置Series对象的标签。如果没有显式指定，则会自动生成0,1,2依次递增的标签

a = pd.Series([1,2,3,4])
b = pd.Series([1,2,3,4],index = ["a","b","c","d"])
print(a)
print(b)

输出：

0    1
1    2
2    3
3    4
dtype: int64
a    1
b    2
c    3
d    4
dtype: int64

也可以对index属性进行设置

a = pd.Series([1,2,3,4])
print(a)
a.index = ["a","b","c","d"]
print(a)

输出：

a = pd.Series([1,2,3,4])
print(a)
a.index = ["a","b","c","d"]
print(a)

还可以通过创建Index对象传入给index参数

new_index =pd.Index( ["a","b","c","d"])
a = pd.Series([1,2,3,4],index=new_index)
print(a)

输出：

new_index =pd.Index( ["a","b","c","d"])
a = pd.Series([1,2,3,4],index=new_index)
print(a)

创建Index对象的好处：当多个Series都需要相同的index时，我们就可以把创建好的Index对象指派给每一个Series，而不需要每个Series都单独去指定一个列表。另外，在索引进行修改时，我们只需要修改Index对象，对于每个Series，不需要进行改动。

values
values 类似与字典中的values方法。（但是，这里是属性）。返回Series中关联的数据

a = pd.Series([1,2,3])
print(a.values)

输出：

[1 2 3]

shape
返回Series数据的形状。

a = pd.Series([1,2,3])
print(a.shape)

输出：

(3,)

size
返回Series数据的个数

a = pd.Series([1,2,3])
print(a.size)

输出：

dtype
返回元素数据类型

a = pd.Series([1,2,3])
print(a.dtype)

输出：

int64

name
Series与Series的index（标签对象）都具有name属性
Series的name属性可以通过创建Series时，指定name参数来设置
Series的name属性可以在输出Series对象时，能够体现。但是，其作用不仅仅只体现在输出中

a  =pd.Series([1,2,3],index=["a","b","c"],name = "Series1")
print(a)

输出：

a    1
b    2
c    3
Name: Series1, dtype: int64

对于Series的Index属性（依然是一个对象）的name属性，可以在创建Index对象时指定
也可以后期进行修改

my_index = pd.Index(["a","b","c"],name="列1")
a = pd.Series([1,2,3],index =my_index)
print(a)
my_index.name = "新列"
print(a)

输出：

列1
a    1
b    2
c    3
dtype: int64
新列
a    1
b    2
c    3
dtype: int64

head与tail
最多显示前n/后n个数据。如果没有给定参数，则参数值默认为5

a = pd.Series([i for i in range(12)])
print(a.head())
print(a.tail(3))

输出：

0    0
1    1
2    2
3    3
4    4
dtype: int64
9      9
10    10
11    11
dtype: int64

ndarray与Series都可以通过索引来访问元素，但是，二者是有区别的。
对于ndarray，类似于Python中的列表的形式，是基于位置进行的访问。在创建对象后，每个元素的位置就固定了，我们不能自行去改变元素的索引。
对于Series，类似于Python中的字典的形式，是基于key值访问元素的。我们可以自行去改变（指定）key值。

Series相关操作
Series在操作上，与Numpy数据具有如下的相似性：
支持广播与矢量化运算
支持索引与切片
支持整数数组与布尔数组提取元素

Series也能进行标量或矢量化运算。但是,Series的计算规则与ndarray是不太相同的。
对于ndarray，是进行对位的计算（根据元素的位置进行计算），对于Series，会根据索引进行对齐计算。如果索引无法进行匹配，则会产生空值。（NaN）

s = pd.Series([1,2,3])
s1 = pd.Series([1,2,3],index=[1,2,3])
print(s+s1)

输出：

0    NaN
1    3.0
2    5.0
3    NaN
dtype: float64

同时，Series提供了用于计算的方法，例如add，multiply等。方法没有运算符简便，但是，方法可以提供更多的行为。可以指定fill_value参数，当索引无法匹配时，使用fill_value进行填充。

s = pd.Series([1,2,3])
s1 = pd.Series([1,2,3],index=[1,2,3])
print(s.add(s1,fill_value = 10))

输出：

0    11.0
1     3.0
2     5.0
3    13.0
dtype: float64

isnull会将None与NaN认为是空值。

s = pd.Series([1,2,3,float("NaN"),None,np.NaN])
print(s.isnull)

输出：

<bound method Series.isnull of 0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    NaN
dtype: float64>

运算中对于空值处理
numpy提供的统计函数，sum，mean等，也适用于Series类型。ndarray数组与Series在计算时，对空值的处理，不太相同。对于ndarray数组，会产生空值。对于Series，会忽略空值。

a = np.array([1,2,3,np.NaN])
b = pd.Series([1,2,3,np.NaN])
print(a.sum())
print(b.sum())

输出：

nan
6.0

通过索引与切片获得元素

a = pd.Series([1,2,3])
print(a[0])
print(a[:2])

输出：

1
0    1
1    2
dtype: int64

根据标签数组与布尔数组提取元素

s = pd.Series([1,2,3])
print(s[[0,2]])
print(s[[True,False,True]])
print(s[s>2]

输出：

0    1
2    3
dtype: int64
0    1
2    3
dtype: int64
2    3
dtype: int64

索引
如果Series对象的index值为非数值类型，通过[索引]访问元素，索引既可以是标签索引，也可以是位置索引。这会在一定程度上造成混淆。我们可以通过如下：
loc：仅通过标签索引访问
iloc：仅通过位置索引访问

Series也支持使用整数数组与布尔数组进行索引。与Numpy数组相同，二者返回的是原数组数据的拷贝（复制）。
与ndarray数组的整数索引不太相同，Series的整数数组索引，既可以是标签数组索引，也可以是位置数组索引

a = pd.Series([1,2,3,4,5],index=["a","b","e","d","c"])
print(a.loc["a"])
print(a.iloc[0])
print(a.loc["a":"c"])
print(a.iloc[:4])

输出：

1
1
a    1
b    2
e    3
d    4
c    5
dtype: int64
a    1
b    2
e    3
d    4
dtype: int64

注意：
对于位置索引切片，包含起始点，不包含终止点（这点与ndarray的切片方式相同）
对于标签索引切片，包含起始点，也包含终止点。

Series的获取、修改，增加，删除
因为Series内部也是基于Mapping映射的形式，因此，其与字典的特征类似。

a = pd.Series([1,2,3],index=["a","b","c"])
print(a.loc["a"])
a.loc["d"]=4
print(a)

输出：

1
a    1
b    2
c    3
d    4
dtype: int64

删除值，是创建一个新的对象，并没有修改原有的Series对象，采用drop方法
删除多个值，可以通过列表传入

a = pd.Series([1,2,3],index=["a","b","c"])
b=a.drop(["b"])
c =a.drop(["a","c"])
print(a)
print(b)
print(c)

输出：

a    1
b    2
c    3
dtype: int64
a    1
c    3
dtype: int64
b    2
dtype: int64

三、Dataframe类型

DataFrame是一个多维数据类型。因为通常使用二维数据，因此，我们可以将DataFrame理解成类似excel的表格型数据，由多列组成，每个列的类型可以不同。
因为DataFrame是多维数据类型，因此，DataFrame既有行索引，也有列索引

创建方式
我们可以使用如下的方式创建（初始化）DataFrame类型的对象（常用）：
二维数组结构（列表,ndarray数组，DataFrame等）类型。
字典类型，key为列名，value为一维数组结构（列表，ndarray数组,Series等）。

说明
如果没有显式指定行与列索引，则会自动生成以0开始的整数值索引。我们可以在创建DataFrame对象时，通过index与columns参数指定
可以通过head，tail访问前 / 后N行记录（数据）

利用二维数组创建DataFrame对象

df = pd.DataFrame([[1,2,3],[4,5,6]])
df2 = pd.DataFrame([[100,200,300],[400,500,600]])
display(df)
display(df2)

输出：

在这里插入图片描述
运用display函数可以在jupyternotebook上美化输出，该函数是基于ipython控制台的。

利用numpy数组创建Dataframe对象

x = np.arange(12).reshape(3,4)
y = pd.DataFrame(x)
display(y)

输出：
在这里插入图片描述

通过字典来创建DataFrame。字典的每组键值对表示DataFrame的一个列。键值对的key用来指定列索引（列标签），键值对的value用来指定该列的值。值为标量时会自动广播，但是如果是矢量长度不一致会报错

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9],"D":10})
display(df)

输出：
在这里插入图片描述

在创建DataFrame时，如果没有显式指定索引，则会默认以0开始，依次增1的方式来生成索引。
我们在创建时，也可以通过index来显式指定行索引，通过columns来显式指定列索引。
我们也可以在创建DataFrame对象之后，通过修改DataFrame的index(columns)属性来指定行（列）索引。

df = pd.DataFrame([[1,2,3],[4,5,6]],index = ["row1","row2"],columns = ["column1","column2","colum3"])
display(df)
df.index = ["new_row1","new_row2"]
display(df)

输出：
在这里插入图片描述
head / tail显示DataFrame的前 / 后N条数据（记录）。N默认为5。随便显示N条记录。

df = pd.DataFrame(np.random.random((10,5)))
display(df.tail())

输出：
在这里插入图片描述
sample随便显示N条记录。sample方法实现随机抽样。我们可以指定抽样的记录条数，如果没有显式指定，则默认为1。sample默认为不放回抽样（此时我们抽样的数量不能大于样本的数量）。我们可以通过将replace参数指定为True（放回抽样）

df = pd.DataFrame(np.random.random((10,5)))
display(df.sample(5))
display(df.sample(10,replace = True))

输出：
在这里插入图片描述
我们可以通过设置random_state参数，来重现（复制）抽样的序列。（该参数就是设置随机种子），相当于random.seed

df = pd.DataFrame(np.random.random((10,5)))
display(df.sample(5,random_state=1))
display(df.sample(5,random_state=1))

输出：
在这里插入图片描述
相关属性
index:DataFrame的行索引
columns:DataFrame的列索引
values：DataFrame对象所关联的值。（二维的ndarray数组）
shape：DataFrame的形状
ndim：DataFrame的维度
dtypes：返回DataFrame中每一列的类型。(返回的数据类型是Series类型)

可以通过index访问行索引，columns访问列索引，values访问数据，其中index与columns也可以进行设置（修改）。
可以为DataFrame的index与columns属性指定name属性值。
DataFrame的数据不能超过二维。

df = pd.DataFrame(np.random.random((5,5)))
print(df.index,type(df.index))
print(df.columns,type(df.columns))
print(df.values,type(df.values))
print(df.shape)
print(df.ndim)
print(df.dtypes)

输出：

RangeIndex(start=0, stop=5, step=1) <class 'pandas.core.indexes.range.RangeIndex'>
RangeIndex(start=0, stop=5, step=1) <class 'pandas.core.indexes.range.RangeIndex'>
[[0.33251005 0.8510337  0.71788137 0.77381074 0.65425201]
 [0.99578585 0.29798949 0.69886801 0.32140075 0.70113964]
 [0.74224716 0.44235554 0.68032993 0.68581262 0.89862978]
 [0.74005128 0.25788449 0.00407884 0.19108407 0.43086349]
 [0.02749532 0.3026536  0.52314652 0.61849358 0.71502121]] <class 'numpy.ndarray'>
(5, 5)
2
0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

设置行索引和列索引名称

df = pd.DataFrame(np.random.random((5,5)))
df.index.name = "行"
df.columns.name ="列"
display(df)

输出：
在这里插入图片描述
DataFrame的相关操作
假设df为DataFrame类型的对象
列操作：
获取列：
df[列索引]：比df.列索引方式要好一点，因为如果列索引为关键字时第二种方法无法获得
df.列索引
增加、修改列：df[列索引]=列数据
删除列：
del df[列索引]
df.pop[列索引]
df.drop[列索引或数组]

行操作
df.loc 根据标签进行索引
df.iloc 根据位置进行索引
df.ix 混合索引。先根据标签索引，如果没有找到，则根据位置进行索引（前提是标签不是数值类型）。【已不建议使用】
增加行：append【多次使用append增加行会比连接计算量更大，可考虑使用pd.concat来代替。】
删除行：
df.drop(行索引或数组)

行列混合操作：
先获取行，再获取列
先获取列，在获取行

说明：
drop方法既可以删除行，也可以删除列，通过axis指定轴方向。【可以原地修改，也可以返回修改之后的结果。】
通过df[索引]访问是对列进行操作
通过df[切片]访问是对行进行操作。【先按标签，然后按索引访问。如果标签是数值类型，则仅会按标签进行匹配。】
通过布尔索引是对行进行操作
通过数组索引是对列进行操作。

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]})
display(df)
display(df["A"])
df["D"]=[10,11,12]
display(df)
df.drop(["D"],axis ="columns",inplace = True)
display(df)

输出：
在这里插入图片描述

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index = ["row1","row2","row3"])
display(df.loc["row1"])
display(df.iloc[1])

输出：
在这里插入图片描述
append加入Series时，Series对象需要具有name属性。

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index = ["row1","row2","row3"])
row_1 = pd.Series([10,11,12],name = "row4",index = ["A","B","C"])
display(row_1)
df2=df.append(row_1)
display(df2)

输出：
在这里插入图片描述
如果在append方法中将ignore_index设为True，则会重新生成从0开始，依次增1的索引（默认形式的索引）。此时，Series可以没有name属性。

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index = ["row1","row2","row3"])
row_1 = pd.Series([10,11,12],index = ["A","B","C"])
display(row_1)
df2=df.append(row_1,ignore_index=True)
display(df2)

输出：
在这里插入图片描述
在增加多行的时候，建议大家使用concat。【concat在性能方面会比append好些】

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index = ["row1","row2","row3"])
df1 = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index = ["row4","row5","row6"])
display(df)
display(df1)
display(pd.concat((df,df1),axis = 0))
display(pd.concat((df,df1),axis = 1,ignore_index=True))

在这里插入图片描述
drop进行删除操作

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index=["row1","row2","row3"])
df.drop("row1",axis=0,inplace = True)
display(df)

输出：
在这里插入图片描述

也可以进行多行删除

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index=["row1","row2","row3"])
df.drop(["row1","row2"],axis=0,inplace = True)
display(df)

输出：
在这里插入图片描述
行列混合操作

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index=["row1","row2","row3"])
display(df.loc["row1"].loc["A"])
display(df["A"].loc["row1"])

输出:
在这里插入图片描述
注意我们获取行列时，数据的返回类型,我们单独提取一行（一列）的时候，返回Series类型，但是，当我们使用切片提取元素时，返回的是DataFrame类型。即使切片只能提取一行（一列）的数据，也是如此。

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index=["row1","row2","row3"])
print(type(df.iloc[0]))
print(type(df.iloc[0:1]))

输出：

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

DataFrame结构
DataFrame的一行或一列，都是Series类型的对象。对于行来说，Series对象的name属性值就是行索引名称，其内部元素的值，就是对应的列索引名称。对于列来说，Series对象的name属性值就是列索引名称，其内部元素的值，就是对应的行索引名称。

df = pd.DataFrame({"A":[1,2,3],"B":[4,5,6],"C":[7,8,9]},index=["row1","row2","row3"])
print(df.iloc[0])
print(df["A"])

输出：

A    1
B    4
C    7
Name: row1, dtype: int64
row1    1
row2    2
row3    3
Name: A, dtype: int64

DataFrame运算
DataFrame的一行或一列都是Series类型的对象。因此，DataFrame可以近似看做是多行或多列Series构成的，Series对象支持的很多操作，对于DataFrame对象也同样适用，我们可以参考之前Series对象的操作。
转置

df = pd.DataFrame(np.arange(24).reshape(4,6))
display(df)
display(df.T)

输出：
在这里插入图片描述

DataFrame进行运算时，会根据行索引与列索引进行对齐。当索引无法匹配时，产生空值（NaN）。如果不想产生空值，可以使用DataFrame提供的运算函数来代替运算符计算，通过fill_value参数来指定填充值，填充时只针对一个值为NaN才有效，如果两个都缺失，那么返回NaN

df = pd.DataFrame(np.arange(12).reshape(3, 4))
df2 = pd.DataFrame(np.arange(12, 24).reshape(3, 4), index=[0, 1, 3], columns=[0, 1, 2, 4])
display(df)
display(df2)
display(df+df2)
display(df.add(df2,fill_value=10))

输出：
在这里插入图片描述

如果使用DataFrame与Series进行计算，默认情况下，Series的标签会匹配DataFrame的列标签
如果需要匹配DataFrame的行标签，可以使用方法进行计算，同时指定axis的值为index或0。

df = pd.DataFrame(np.arange(12).reshape(3, 4))
s = pd.Series([100, 200, 300], index=[0, 1, 2])
df.add(s, axis="index")

输出：
在这里插入图片描述
排序
索引排序：Series与DataFrame对象可以使用sort_index方法对索引进行排序。DataFrame对象在排序时，还可以通过axis参数来指定轴（行索引还是列索引）。也可以通过ascending参数指定升序还是降序。
值排序：Series与DataFrame对象可以使用sort_values方法对值进行排序。

df = pd.DataFrame([[1, -5, 3], [-5, 2, 4], [9, 6, -3]], index=[-2, 1, -3], columns=[-2, 1, -3])
display(df)
df1=df.sort_index(axis=1, ascending=False)
display(df1)
df.sort_values(-2, axis="index", inplace=True)
display(df)

输出：

在这里插入图片描述

索引对象
Series(DataFrame)的index或者DataFrame的columns就是一个索引对象
索引对象可以向数组那样进行索引访问

df = pd.DataFrame([[1, 2], [3, 4]])
df.index[0]

输出：

索引对象是不可修改的

df.index[0] = "a"

会报错，可以重新指定

df = pd.DataFrame([[1, 2], [3, 4]])
df.index = ["a", "b"]
display(df)

输出：
在这里插入图片描述
相关方法
mean / sum / count
max / min
cumsum / cumprod
argmax / argmin
idxmax / idxmin： s.argmax()与idxmax作用相同。在pandas中，argmax不建议使用，建议使用idxmax代替。
var / std：方差标准差
corr / cov：协方差

df = pd.DataFrame(np.random.rand(10, 5))
df.corr()

输出：
在这里插入图片描述
unique：去掉重复的元素，但是没有排序的功能（与numpy中ndarray数组的方法行为不太一样。）

s = pd.Series([1, 10, -2, -5, 20, 10, -5])
display(s.unique())

输出：

array([ 1, 10, -2, -5, 20], dtype=int64)

value_counts：我们可以通过指定ascending参数来指定升序排列。

s = pd.Series([1, 10, -2, -5, 20, 10, -5])
s.value_counts(ascending=True)

输出：

 1     1
-2     1
 20    1
 10    2
-5     2
dtype: int64

第五本日记

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据分析3--Pandas基本使用

一、简介Pandas库基于Numpy库，提供了很多用于数据操作与分析的功能。安装与使用：pip install pandas根据惯例，我们使用如下的方式引入pandas：import pandas as pd两个常用数据类型：SeriesDataFrame二、Series类型Series类型类似于Numpy的一维数组对象，可以将该类型看做是一组数据与数据相关的标签（索引）联合而构成（带有标签的一维数组对象）。创建方式列表等可迭代对象、ndarray数组对象、字典对象、标量关于Se
复制链接

扫一扫

专栏目录