Pandas学习笔记

机器学习别搞我！！

已于 2023-05-12 14:38:35 修改

阅读量115

点赞数

分类专栏：机器学习文章标签： python

于 2021-08-12 21:40:26 首次发布

本文链接：https://blog.csdn.net/weixin_52663081/article/details/119642076

版权

机器学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Pandas是用于数据分析的包，可以用于时间序列的处理中去。

Pandas数据结构介绍

series:一种类似于一维数组对象。由一组数据（numpy类型）以及一组相关数据标签组成的。

DataFrame:表格数据结构。含有一组有序的列。DataFrame既有行索引也有列索引，可以被看做是由Series组成的字典。

1、series

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
arr = np.array([1,2,3,4])
##将数组中的数据形成序列
series01 = Series(arr)
print("形成字典，key是序列号\n",series01)
#调用序列号为0的数据
print("调用序列号为0的数据")
print(series01[0])
print("序列号",series01.index)
print("值",series01.values)
print("类型",series01.dtype)

形成字典，key是序列号
 0    1
1    2
2    3
3    4
dtype: int32
调用序列号为0的数据
1
序列号 RangeIndex(start=0, stop=4, step=1)
值 [1 2 3 4]
类型 int32

2、更换索引内容判断series中的值的大小

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
arr = np.array([1,2,3])
serises01 = Series(arr)
serises01.index = ("tom","ad","cc")
print("更换索引内容")
print(serises01)
serises02 = Series(arr,index = ("a","b","c"))
print(serises02)

arr2 = Series({"a":100,"b":32,"c":22})
new_index= ["a","b","d","c"]
serises03 = Series(arr2,index=new_index)
print(serises03)

serises04 = pd.isnull(serises03)
serises05 = pd.notnull(serises03)
serises06 = serises03>10
print("判定是空\n")
print(serises04)
print("判定不是空\n")
print(serises05)
print("判定所有的值是\n")
print(serises06)

更换索引内容
tom    1
ad     2
cc     3
dtype: int32
a    1
b    2
c    3
dtype: int32
a    100.0
b     32.0
d      NaN
c     22.0
dtype: float64
判定是空

a    False
b    False
d     True
c    False
dtype: bool
判定不是空

a     True
b     True
d    False
c     True
dtype: bool
判定所有的值是

a     True
b     True
d    False
c     True
dtype: bool

3、将两个series进行整合处理

from pandas import Series,DataFrame
series01 = Series({"P1":23,"P2":33,"P3":24,"P4":36})
series02 = Series({"P2":29,"P4":26,"P5":67})
series03 = series01 + series02
series03.name = "温度"
print(series03)

P1     NaN
P2    62.0
P3     NaN
P4    62.0
P5     NaN
Name: 温度, dtype: float64

4、DataFrame进行处理：只能用于二维的

这玩意就相当于将一个二维表填入

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
dataFrame01 = DataFrame([[1,2,3,4,5],[1,4,5,6,8],[6,4,7,3,5]])
print("有序列号了\n")
print(dataFrame01)
arr = np.array([[1,2,3,4,5],[1,4,5,6,8],[6,4,7,3,5]])

dataFrame02 = DataFrame(arr,index=["a","b","c"],columns=["n","d","g","s","v"])
print("index定义行名，column定义列名")
print("定义行和列的名字\n",dataFrame02)

有序列号了

   0  1  2  3  4
0  1  2  3  4  5
1  1  4  5  6  8
2  6  4  7  3  5
index定义行名，column定义列名
定义行和列的名字
    n  d  g  s  v
a  1  2  3  4  5
b  1  4  5  6  8
c  6  4  7  3  5

5、表的信息

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
arr = np.array([1,2,3,4,5,6,7,8])
series01 = Series(arr)

print("不为零的个数\n",series01.count())
print("最大值\n",series01.max())
print("最小值\n",series01.min())
print("平均值\n",series01.mean())
print("索引位置\n",series01.argmax)
print("索引值\n",series01.idxmax())
print("和\n",series01.sum())
print("中位数\n",series01.median())
print("绝对离差\n",series01.mad())
print("方差\n",series01.var())
print("标准差\n",series01.std())
print("",series01.describe)

不为零的个数
 8
最大值
 8
最小值
 1
平均值
 4.5
索引位置
 <bound method IndexOpsMixin.argmax of 0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
dtype: int32>
索引值
 7
和
 36
中位数
 4.5
绝对离差
 2.0
方差
 6.0
标准差
 2.449489742783178
 <bound method NDFrame.describe of 0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
dtype: int32>

6、表的处理（相关系数，协方差，非0，缺失值，0填补）

from pandas import Series,DataFrame
dataFrame01 = DataFrame({"aa":[1,2,3,4],"bb":[3,1,5,2],"cc":[7,5,3,5]})
print("相关系数：")
print(dataFrame01.corr())
#自己和自己就是方差，协方差表示两个变量总体的误差，两个趋势一致为正，趋势相反则为负值
print("协方差：")
print(dataFrame01.cov())
print("是0不是0")
print(dataFrame01.isnull())
print(dataFrame01.notnull())
print("过滤缺失值\n",dataFrame01.dropna())
print("用0填补NAN\n",dataFrame01.fillna(0))

相关系数：
          aa        bb        cc
aa  1.000000  0.075593 -0.632456
bb  0.075593  1.000000 -0.478091
cc -0.632456 -0.478091  1.000000
协方差：
          aa        bb        cc
aa  1.666667  0.166667 -1.333333
bb  0.166667  2.916667 -1.333333
cc -1.333333 -1.333333  2.666667
是0不是0
      aa     bb     cc
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False  False
     aa    bb    cc
0  True  True  True
1  True  True  True
2  True  True  True
3  True  True  True
过滤缺失值
    aa  bb  cc
0   1   3   7
1   2   1   5
2   3   5   3
3   4   2   5
用0填补NAN
    aa  bb  cc
0   1   3   7
1   2   1   5
2   3   5   3
3   4   2   5

7、Pandas的包

1.调用read_csv包来读出数据文件

data = pd.read_csv("地址"，usecols = [一个使用第几列的数据数组])

2.get_dumies 将数据中的NA列进行独立出来，并且形成两个新的列

inputs = pd.get_dummies(inputs,dummy_na=True)

NuRooms  Alley_pave  Alley_nan
0      3.0           1          0
1      2.0           0          1
2      4.0           0          1
   NuRooms Alley  Price
0      NaN  pave   1314
1      2.0   NaN  10933
2      4.0   NaN  13423

3、to_excel 将dataFrame转化成为EXCEL模式

# # 在不用with语句时需要用writer.close()
#write = pd.ExcelWriter("../data/test.xlsx")   # 此处在对绝对路径新建一个xlsx文件。
write = pd.ExcelWriter("test.xlsx")      # 此处新建一个xlsx文件。
df1 = pd.DataFrame([[1, 2],[3,4]],columns=['a','b'])    #构建一个dataFarme文件
df1.to_excel(write, sheet_name='Sheet1', index=False)  # 写入文件的Sheet1   sheet工作表的意思

write.save()  # 这里一定要保存
write.close()

4、更给DataFrame中的值

1、使用index来进行更改

label_df.iloc[0,1] = 1

蓝色方框的表中的位置是0行1列的数据

5、直接通过标签来进行修改数据

label_df.loc['100000','label_0'] = 10001

这里直接通过两个标签来对数组中的值进行修改；

其中第一个为数据中默认且唯一的序列数

6、依据条件修改值

label_df.label_0[label_df.label_2==0] = 0

此处的按照条件进行更改，修改label_df中的label_0值，而修改的依据就是label_df中的label_2为0

修改之后的结果为：