Pands
Pandas:提供便于分析的数据类型,提供数据分析的各种函数
import pandas as pd
pandas基于numpy实现,常与numpy和matplotlib一同使用
提供的数据类型:Series(一维标签数据),DataFrame(二维-多维)
基于ndarray(数据的结构表达—维度)的扩展数据类型(应用表达—数据与索引之间)
是基于索引的数据结构,对数据的操作基于对索引的操作
Series
Series类型:由一组数据及与之相关的数据索引组成
自动索引,自定义索引
生成series:多种方法
基本操作: a.index a.values
a["a"] a[0]
切片a[1:]
判断索引是否在series列表中"c" in a
两个或多个series之间的对齐操作:对相同的索引进行对齐
series修改a["a"]=9,随时修改,随时应用
import pandas as pd
import numpy as np
# python列表创建
a = pd.Series([1,2,3,4], index=["a","b","c","d"])
print(a)
# 标量值创建,不能省略index
b = pd.Series(2, index=["a","b","c","d"])
print(b)
# 字典类型创建
c = pd.Series({"a":1,"b":2})
print(c)
d = pd.Series({"a":1,"b":2}, index = {"c", "a", "b"})# 索引指定挑取值
print(d)
# ndarray创建
e = pd.Series(np.arange(5))
print(e)
f = pd.Series(np.arange(5), index=np.arange(9,4,-1))# 创建值-索引
print(f)
# 基本操作—索引与值的读取
a = pd.Series([1,2,3,4], ["a","b","c","d"])
print(a.index)
print(a.values)
print(a["b"])# 两种索引可以单独使用,但不可混合使用
print(a[1:3])
"c" in a
a.get("f",100)
a 1 b 2 c 3 d 4 dtype: int64 a 2 b 2 c 2 d 2 dtype: int64 a 1 b 2 dtype: int64 b 2.0 a 1.0 c NaN dtype: float64 0 0 1 1 2 2 3 3 4 4 dtype: int32 9 0 8 1 7 2 6 3 5 4 dtype: int32 Index(['a', 'b', 'c', 'd'], dtype='object') [1 2 3 4] 2 b 2 c 3 dtype: int64
100
DataFrame类型
共用同一索引的多列表格
index(行索引)—Column(列索引),0开始
创建:
import pandas as pd
import numpy as np
# 从二维ndarray创建
a = pd.DataFrame(np.arange(10).reshape(2,5))
print(a)
# 从字典创建
b = {"one":pd.Series([1,2,3],index=["a","b","c"]),
"two":pd.Series([6,7,8,9], index=["a","b","c","d"])}
c = pd.DataFrame(b)
print(c)
print(pd.DataFrame(b, index=["a","d"],columns=["one"]))
# 从列表类型的字典创建
dl = {"one":[1,2,3],"two":[6,7,8]}
d = pd.DataFrame(dl, index=["a","b","c"])
print(d)
0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 one two a 1.0 6 b 2.0 7 c 3.0 8 d NaN 9 one a 1.0 d NaN one two a 1 6 b 2 7 c 3 8
Pandas库的数据类型操作
改变结构: 增加或重排:重新索引,reindex
删除:drop
fill_value:填充缺失值
索引类型的操作: .append(idx):连接另一个index对象
.diff(idx):计算差集,产生新的index对象
……
import numpy as np
import pandas as pd
dl = {"one":[1,2,3],"two":[6,7,8],"three":[4,5,9]}
d = pd.DataFrame(dl, index=["a","b","c"])
print(d)
print(d.drop("a"))
print(d.drop("one",axis=1))# axis=1代表横向
d = d.reindex(index=["b","c","a"])# 行重排
print(d)
d = d.reindex(columns=["three","one","two"])# 列重排
print(d)
# f = d.columns.insert(4,"新增")
# f = d.reindex(columns+f, fill_value=20)
# print(f)
# 索引的操作
nc = d.columns.delete(2)
print(nc)
ni = d.index.insert(3,"m")
print(ni)
nd = d.reindex(index=ni,columns=nc)
print(nd)
n = pd.Series([1,2,3,4],index=["j","k","l","o"])
print(n)
print(n.drop(["j"]))# .drop函数会产生新的series,而不改变原来的series
print(n)
one two three a 1 6 4 b 2 7 5 c 3 8 9 one two three b 2 7 5 c 3 8 9 two three a 6 4 b 7 5 c 8 9 one two three b 2 7 5 c 3 8 9 a 1 6 4 three one two b 5 2 7 c 9 3 8 a 4 1 6 Index(['three', 'one'], dtype='object') Index(['b', 'c', 'a', 'm'], dtype='object') three one b 5.0 2.0 c 9.0 3.0 a 4.0 1.0 m NaN NaN j 1 k 2 l 3 o 4 dtype: int64 k 2 l 3 o 4 dtype: int64 j 1 k 2 l 3 o 4 dtype: int64
Pandas算术运算
广播运算:不同维度,不同尺寸就补齐(NaN)后运算,值为NaN
四则运算:符号运算,参数运算,两种方式
series与DataFrame之间的运算:series默认在axis=1参与运算
比较运算:同维度运算,需要尺寸一致;不同维度,默认在一轴
import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
print(a)
b = pd.DataFrame(np.arange(20).reshape(4,5))
print(b)
print(a+b)# 出现补齐运算
# 四则运算使用参数进行运算,好处是可以增加参数
print(a.add(b,fill_value=10))# 将缺失值补为某个确定的值
c = pd.Series(np.arange(4))
print(c)
print(b-c)# series默认在axis=1参与运算
# 比较运算
# print(a>b) # 报错
print(a>c)
0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 0 1 2 3 4 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaN 0 1 2 3 4 0 0.0 2.0 4.0 6.0 14.0 1 9.0 11.0 13.0 15.0 19.0 2 18.0 20.0 22.0 24.0 24.0 3 25.0 26.0 27.0 28.0 29.0 0 0 1 1 2 2 3 3 dtype: int32 0 1 2 3 4 0 0.0 0.0 0.0 0.0 NaN 1 5.0 5.0 5.0 5.0 NaN 2 10.0 10.0 10.0 10.0 NaN 3 15.0 15.0 15.0 15.0 NaN 0 1 2 3 0 False False False False 1 True True True True 2 True True True True
数据的排序
指定轴上进行索引排序.sort_index():默认升序,默认是0轴操作,也就是纵向,指定1,横向操作
指定轴上根据数值进行排序.sort_values():默认升序,默认是0轴操作
import pandas as pd
import numpy as np
# 索引排序
a = pd.DataFrame(np.arange(12).reshape(3,4), index=["a","b","c"])
print(a)
b = a.sort_index(ascending=False)# 默认在0轴操作
print(b)
c = a.sort_index(axis=1, ascending=False)
print(c)
# 值排序
d = a.sort_values(2, ascending=False)# 根据column=2这一列进行排序
print(d)
e = a.sort_values("a", axis=1, ascending=False)# 根据index="a"这一行进行排序
print(e)
0 1 2 3 a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 0 1 2 3 c 8 9 10 11 b 4 5 6 7 a 0 1 2 3 3 2 1 0 a 3 2 1 0 b 7 6 5 4 c 11 10 9 8 0 1 2 3 c 8 9 10 11 b 4 5 6 7 a 0 1 2 3 3 2 1 0 a 3 2 1 0 b 7 6 5 4 c 11 10 9 8
数据基本统计分析
一些函数:.sum()......
.decribe():包含多种信息
import pandas as pd
import numpy as np
# series
a = pd.Series(np.arange(3), index=["a","b","c"])
print(a)
print(a.describe())# 是包含多种计算的series类型,可以根据索引获得其中的值
print(a.describe()["mean"])
# dataframe类型
b = pd.DataFrame(np.arange(12).reshape(3,4), index=["a","b","c"])
print(b.describe())
print(b.describe()[2])
a 0 b 1 c 2 dtype: int32 count 3.0 mean 1.0 std 1.0 min 0.0 25% 0.5 50% 1.0 75% 1.5 max 2.0 dtype: float64 1.0 0 1 2 3 count 3.0 3.0 3.0 3.0 mean 4.0 5.0 6.0 7.0 std 4.0 4.0 4.0 4.0 min 0.0 1.0 2.0 3.0 25% 2.0 3.0 4.0 5.0 50% 4.0 5.0 6.0 7.0 75% 6.0 7.0 8.0 9.0 max 8.0 9.0 10.0 11.0 count 3.0 mean 6.0 std 4.0 min 2.0 25% 4.0 50% 6.0 75% 8.0 max 10.0 Name: 2, dtype: float64
累计统计
前n个元素累计求和,运算
窗口计算:滚动计算
import numpy as np
import pandas as pd
b = pd.DataFrame(np.arange(12).reshape(3,4), index=["a","b","c"])
print(b)
print(b.cumsum())# 默认0轴
print(b.cummin())
print(b.rolling(2).sum())# 凑不够相邻元素的就NaN,
0 1 2 3 a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 0 1 2 3 a 0 1 2 3 b 4 6 8 10 c 12 15 18 21 0 1 2 3 a 0 1 2 3 b 0 1 2 3 c 0 1 2 3 0 1 2 3 a NaN NaN NaN NaN b 4.0 6.0 8.0 10.0 c 12.0 14.0 16.0 18.0
数据的相关分析
协方差>0,正相关:.cov(),协方差矩阵
Pearson相关系数:.corr(),相关系矩阵