人工智能——pandas

最新推荐文章于 2023-02-15 12:30:04 发布

n南x星

最新推荐文章于 2023-02-15 12:30:04 发布

阅读量191

点赞数

文章标签： pandas 人工智能

本文链接：https://blog.csdn.net/weixin_50517509/article/details/125728649

版权

本文介绍了数据分析库Pandas的基础知识，包括Series和DataFrame数据结构，以及读写数据、数据过滤、合并、分组等操作。通过实例展示了如何创建、处理和分析数据，如缺失值处理、统计函数应用等，揭示了Pandas在数据处理中的强大功能。

摘要由CSDN通过智能技术生成

人工智能—pandas学习

在数据分析的学习中，pandas绝对时必不可少的一个库，pandas提供了大量能使我们快速便捷地处理数据的函数和方法。它的功能有：

数据文件读取/文本数据读取与文本存储
索引、选取和数据过滤
算法运算和数据对齐
函数的应用和映射
层次索引
排序
分组聚合

首先初步了解数据表的结构：
DataFrame

这就是其中的二维数据结构DataFrame。

一、数据结构

像上图所示的数据结构特别重要，在pandas中主要有以下几种：

Series：一维
DataFrame：二维
Panel ：三维
PanelND：N维
Time- Series：以时间为索引的Series

安装

pip install pandas

导入

import pandas as pd

二、函数用法

Series

# 创建
1.
从列表创建  		 --> pd.Series([1, 2, 3, 4, 5])
2.从ndarray创建  --> pd.Series(np.arange(3), index=['a', 'b', 'c'])
3.从字典创建 	 --> pd.Series({'a' : 1, 'b' : 2, 'c' : 3})
4.从标量值构造
In [1]: s4 = pd.Series(1., index=list("abc"))
		s4
Out[1]: a    1.0
		b    1.0
		c    1.0
		dtype: float64

# 缺失值
1.Series.isna()				检测缺失值
2.Series.fillna()			填充缺失值
3.Series.dropna()			删除缺失值

DataFrame

# 创建
1.使用单个列表创建 --> 
pd.DataFrame(np.random.randn(6,4), index=list("abcdef"), columns=list('ABCD'))
从字典创建
In [1]: data = {"AS": ["a", "b", "c"], "AD": [1, 2, 3]}
		Frame = pd.DataFrame(data, index=list("abc"))
		Frame
Out[1]:  
		   name  age
		a   Tom   28
		b  Jack   34
		c  Mike   29

# 基本属性
shape  		 	 形状，行数列数
dtypes  		 列数据类型
ndim 			 数据维度
index   		 行索引
columns  		 列索引
values  		 对象值

# 查询
head()  		 显示头部几行
tail()  		 显示尾部几行
info()  		 相关信息概览：行数，列数，列索引，列非空值个数，列类型，内存占用
describe()  	 快速综合统计结果：计数，均值，标准差，最大值，四分位数，最小值

# 合并
merge()			 基于共同列，将两个 DataFrame 合并
join() 			 行索引上的合并

# 分组
groupby()

# 缺失值
dropna()    		
fillna()			
isna()、isnull()

# 索引
loc()
iloc()

三、代码举例

In [1]: import pandas as pd
In [2]: # 创建一个 Series 数据结构，层次化索引
		li1 = [["first", "first", "second", "second", "third", "third"],
		          ["one", "two", "one", "two", "one", "two"]]
		li2 = [i for i in range(6)]
		s = pd.Series(li2, index=li1)
		s
Out[2]: first   one    0
		        two    1
		second  one    2
		        two    3
		third   one    4
		        two    5
		dtype: int64

In [3]: # 创建一个 DataFrame 数据结构，层次化索引
		li3 = [["first", "first", "second", "second", "third", "third"],
		          ["o", "t", "o", "t", "o", "t"]]
		li4 = [["yes", "yes", "yes", "no", "no", "no", "ok", "ok", "ok"],
		       ["A", "B", "C", "A", "B", "C", "A", "B", "C"]]
		li5 = [[i for i in range(9)],
		          [i for i in range(9, 18)],
		          [i for i in range(18, 27)],
		          [i for i in range(27, 36)],
		          [i for i in range(36, 45)],
		          [i for i in range(45, 54)]]
		d = pd.DataFrame(li5, index=li3, columns=li4)
		d
Out[3]:          yes          no          ok        
		           A   B   C   A   B   C   A   B   C
		first  o   0   1   2   3   4   5   6   7   8
		       t   9  10  11  12  13  14  15  16  17
		second o  18  19  20  21  22  23  24  25  26
		       t  27  28  29  30  31  32  33  34  35
		third  o  36  37  38  39  40  41  42  43  44
		       t  45  46  47  48  49  50  51  52  53
		       
In [4]: import numpy as np
		a = pd.DataFrame(np.arange(20).reshape(4,5), index = list("abcd"))
		print(a)
Out[4]:     0   1   2   3   4
		a   0   1   2   3   4
		b   5   6   7   8   9
		c  10  11  12  13  14
		d  15  16  17  18  19

In [5]: print("-" * 50)
		print(a.cumsum())  # 依次给出前1、2、…、n个数的和
Out[5]: --------------------------------------------------
		    0   1   2   3   4
		a   0   1   2   3   4
		b   5   7   9  11  13
		c  15  18  21  24  27
		d  30  34  38  42  46

In [6]: print("-" * 50)
		print(a.cumprod())  # 依次给出前1、2、…、n个数的积
Out[6]: --------------------------------------------------
		   0     1     2     3     4
		a  0     1     2     3     4
		b  0     6    14    24    36
		c  0    66   168   312   504
		d  0  1056  2856  5616  9576

In [7]: print("-" * 50)
		print(a.cummax())  # 依次给出前1、2、…、n个数的最大值
Out[7]: --------------------------------------------------
		    0   1   2   3   4
		a   0   1   2   3   4
		b   5   6   7   8   9
		c  10  11  12  13  14
		d  15  16  17  18  19

In [8]: print("-" * 50)
		print(a.cummin())  # 依次给出前1、2、…、n个数的最小值
Out[8]: --------------------------------------------------
		   0  1  2  3  4
		a  0  1  2  3  4
		b  0  1  2  3  4
		c  0  1  2  3  4
		d  0  1  2  3  4

In [9]: df = pd.DataFrame(np.random.randn(4, 5), index=["one", "two", "three", "four"], columns=list("ABCDE"))
		df.iloc[0, 2] = np.nan
		df.iloc[[1, 3], 4] = np.nan
		df
Out[9]:               A         B         C         D         E
		one    0.412442 -0.709743       NaN -0.821877  0.914725
		two   -0.411968 -1.805219  0.312372 -1.500475       NaN
		three  0.769233 -0.685669 -0.117410  0.144312 -0.458729
		four  -1.985896  0.767057  0.203740  2.295602       NaN
      		  
In [10]: print(df.dropna(how="any"))
Out[10]:              A         B        C         D         E
		three  0.769233 -0.685669 -0.11741  0.144312 -0.458729
		        
In [11]: print(df.fillna(value=df.mean()))
Out[11]:              A         B         C         D         E
		one    0.412442 -0.709743  0.132900 -0.821877  0.914725
		two   -0.411968 -1.805219  0.312372 -1.500475  0.227998
		three  0.769233 -0.685669 -0.117410  0.144312 -0.458729
		four  -1.985896  0.767057  0.203740  2.295602  0.227998

In [12]: print(pd.isnull(df))
Out[12]:           A      B      C      D      E
		one    False  False   True  False  False
		two    False  False  False  False   True
		three  False  False  False  False  False
		four   False  False  False  False   True