机器学习中另一个非常重要的库--Pandas库,Pandas是对数据进行预处理和数据清洗非常重要的库。
使用pandas库相比NumPy库有什么好处,pandas库比NumPy库封装了哪些特性
1. pandas会自动讲数据按照自定义的方式进行对齐显示,避免数据没有对齐造成处理数据的时候出现失误
2. pandas可以很灵活的处理缺失的数据,如果某个数据缺失,可以基于大部分数据的平均值进行填充,还可以填充想要的数值
3.使用pandas也可以像使用SQL语句一样,进行相应的链接操作
Pandas的安装
如果要使用Pandas库,首先要进行安装。
1)首先打开cmd终端,输入pip install pandas(这样可能会因为超时导致安装失败,不妨试一下pip --default-time=10000 install pandas)
2)其实中间安装失败很多次,我也找了其他的解决方案,如果方法一对你来说不可取,那么就看方法二吧,百度搜索(原文地址: http://www.wsmee.com/post/6版权声明:非商用-非衍生-保持署名)
Series的基本操作
pandas一维数组做法
from pandas import Series,DataFrame#Series,DataFrame两个重要的数据结构
import pandas as pd#如果觉得名称比较长也可以重命名,大部分机器学习pd都是使用了pandas库
obj=Series([4,5,6,-7])print(obj)
单独取出索引或者数值
from pandas import Series,DataFrameimport pandas as pdobj=Series([4,5,6,-7])print(obj)print(obj.index)print(obj.values)
注意:pandas的索引是可以重复的,字典是不一样的
字典是经过哈希运算的
哈希运算:通过一个简单的字符,经过哈希运算运算成唯一的哈希运算值
{’a’:1,’b’:2,’c’:3}分别映射到非常复杂的一串数值存储到内存,新的key加入字典当中,也会先进行映射;如果与字典存储相同的值,就会跟原有的结果覆盖,所以字典当中的key是不可以重复的
可作为字典的key:int、float、string、tuple
不可作为字典的key:list、集合(Q:为什么不可作为key?A:因为内容是可变化的,列表可以重新赋值,经过冲赋值之后,哈希运算的复杂字符串发生了相应的变化,发生变化,就没有办法key找到对应的值,无法进行哈希运算)
手工指定索引
from pandas import Series,DataFrameimport pandas as pdobj2=Series([4,5,6,-7],index=['a','b','c','d'])print(obj2)
按照定义的顺序输出
对c的索引赋值
from pandas import Series,DataFrameimport pandas as pdobj2=Series([4,5,6,-7],index=['a','b','c','d'])print(obj2)obj2['c']=8print(obj2)
注意
c值发生改变
可以把Series当作字典来用-如果在则返回true,不在返回false
from pandas import Series,DataFrameimport pandas as pdobj2=Series([4,5,6,-7],index=['a','b','c','d'])print(obj2)obj2['c']=8print(obj2)print('a' in obj2)
字典转化为Series
如果数据被存储到一个字典当中,能否很方便的转化Series呢
from pandas import Series,DataFrameimport pandas as pdstade={'beijing':2100,'shanghai':3500,'guangzhou':9088,'xuzhou':3908}obj3=Series(stade)print(obj3)
字典的key作为Series的索引,字典的value作为Series的相应取值
索引的修改
from pandas import Series,DataFrameimport pandas as pdstade={'beijing':2100,'shanghai':3500,'guangzhou':9088,'xuzhou':3908}obj3=Series(stade)print(obj3)#索引改为缩写obj3.index=['bj','sh','gz','xz']print(obj3)
DateFrame的基本操作
DateFrame更像是电子表格一样的形式
多维数据
Q:如何生成DateFrame?
A:一般会传入等长的列表,或者利用NumPy的数组传入
在这里我们利用字典,等长列表的方式,创建一个DateFrame
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)print(frame)
横坐标就是data的key,纵坐标系统自动生成的,列表会自动显示在表格当中
可以观察类似电子表格
那么,对电子表格大家经常会有哪些操作呢?
按顺序显示
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)frame2=DataFrame(data,columns=['year','city','pop'])print(frame)print(frame2)
对比查看排序前和排序后的信息
DataFrame是二维表格,转化一维数据
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)frame2=DataFrame(data,columns=['year','city','pop'])print(frame)print(frame2)print(frame2['city'])print(frame2.city)
增加新的列
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)frame2=DataFrame(data,columns=['year','city','pop'])frame2['new']=100print(frame2)
利用表格的计算生成新的列
--根据是否为shanghai产生的新的一列为true,否则为false
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)frame2=DataFrame(data,columns=['year','city','pop'])frame2['SH']=frame2.city=='shanghai'print(frame2)
字典的嵌套为DataFrame赋值
from pandas import Series,DataFrameimport pandas as pdpop={'beijing':{2008:1.5,2020:2.0}, 'shanghai':{2008:2.0,2020:3.0}}frame3=DataFrame(pop)print(frame3)
行和列的互换--行列式的转置
from pandas import Series,DataFrameimport pandas as pdpop={'beijing':{2008:1.5,2020:2.0}, 'shanghai':{2008:2.0,2020:3.0}}frame3=DataFrame(pop)print(frame3)print(frame3.T)
DataFrame的重新索引-reindex
from pandas import Series,DataFrameimport pandas as pdobj4=Series([4,5,6,-7],index=['c','d','b','a'])obj5=obj4.reindex(['a','b','c','d','e'])print(obj5)
说明e下面的索引是不存在的
如果是空值的话可能引起数值的清洗
将空值进行填充
from pandas import Series,DataFrameimport pandas as pdobj4=Series([4,5,6,-7],index=['c','d','b','a'])obj5=obj4.reindex(['a','b','c','d','e'],fill_value=0)print(obj5)
将空值填充为相邻的数值
from pandas import Series,DataFrameimport pandas as pdobj6=Series(['red','yellow','blue'],index=[0,2,4])print(obj6.reindex(range(6)))
from pandas import Series,DataFrameimport pandas as pdobj6=Series(['red','yellow','blue'],index=[0,2,4])print(obj6.reindex(range(6),method='ffill'))
将后面的值搬运到前面
print(obj6.reindex(range(6),method='bfill'))
Series结构缺失的数据删除
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata=Series([1,NA,2])print(data)print(data.dropna())
DataFrame结构缺失的数据删除
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,5,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,2]])print(data2.dropna())
只要出现了NA都会被dropna删除
删掉全是缺失值的行,部分缺失数据的行保留
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,5,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,NA]])print(data2.dropna(how='all'))
删除全部列是缺失数据
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,NA,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,NA]])print(data2)print(data2.dropna(axis=1,how='all'))
将缺失值填充为0
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,NA,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,NA]])print(data2)print(data2.dropna(axis=1,how='all'))data2.fillna(0)print(data2.fillna(0))
修改
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,NA,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,NA]])print(data2)print(data2.dropna(axis=1,how='all'))data2.fillna(0)print(data2.fillna(0,inplace=True))print(data2)
层次化索引
根据索引的层次提取数据
from pandas import Series, DataFrameimport pandas as pdimport numpy as npfrom numpy import nan as NAdata3 = Series(np.random.random(10), index=[['q', 'w', 'e', 'r', 't', 'q', 'q', 'a', 'a', 'f'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]])print(data3)
提取a层次的索引
from pandas import Series, DataFrameimport pandas as pdimport numpy as npfrom numpy import nan as NAdata3 = Series(np.random.random(10), index=[['q', 'w', 'e', 'r', 't', 'q', 'q', 'a', 'a', 'f'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]])print(data3)print(data3['a'])
输入多个索引
print(data3['a':'q'])
Series结构和DataFrame结构如何解决对里面的缺失数据进行删除、填充,这就是对数据的预处理,对数据清洗非常关键的步骤
通过以上演示可以格式进行预处理,后面可以根据这些数据完成绘图,进行建模。