python_pandas学习

最新推荐文章于 2020-11-27 17:33:11 发布

O_tongwandou

最新推荐文章于 2020-11-27 17:33:11 发布

阅读量305

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/O_tongwandou/article/details/79836826

版权

python 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

pandas学习

pandas入门

pandas库理解：

两个数据类型：series（一维数据）、 dataframe（二维或者多维数据类型）

numpy(基础数据类型) 注重数据结构

pandas（扩展数据类型）注重应用（索引）

筛选：

.loc 标签索引 (‘ ’)

.iloc 位置索引（1,2，3...）

.ix 标签与位置混合索引（先按标签进行索引，然后再按位置索引）

pandas库的Series类型

Series类型由一组数据以及与之相关的数据索引组成

import pandas as pd
a = pd.Series([9, 8, 7, 6])
print(a)
print(a.dtype)

自定义索引：

# 自定义索引
import pandas as pd
a = pd.Series([9, 8, 7, 6], index=["a", 'b', 'c', 'd']) # index= 可以省略
print(a)
print(a.dtype)

series数据类型的创建

《1》由字典创建

# # <1> 由字典创建
# d = pd.Series({'a': 8, 'b': 6, 'c': 5}) # 字典与series本身结构就相似
# print(d)
# # 改变series的索引 TODO 有问题
# d = pd.Series({'a': 8, 'b': 6, 'c': 5}, index=['c', 'd', 'f', 'p'])
# print(d)

《2》由ndarray创建

# <2> 从ndarray类型创建
import numpy as np
n = pd.Series(np.arange(5))
m = pd.Series(np.arange(5), index=np.arange(9, 3, -1))
print(n)
print(m)

《3》由列表创建

# a = pd.Series([9, 8, 7, 6], index=["a", 'b', 'c', 'd']) # index= 可以省略
# print(a)
# print(a.dtype)

《4》由函数创建，range()

series 类型基本操作

# series 类型基本操作
import numpy as np
b = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])
print(b.index) # 获得索引
print(b.values) # 获得数据
print(b["b"]) # 由索引获取数据
print(b[['a', 'b', 'c']]) # 由索引获取数据
# 切片
print("*************************")
print(b[:3])
print(b[1:4])
# 运算
print("+++++++++++++++++++++++++")
print(np.sin(b)) # 运算后仍是series类型
# 判断某个索引值是否在series中
print('c' in b) # 确实存在，返回True
print(3 in b) # 不存在，返回False
print(b.get('f', 100)) # 如果series中有标签 f 则返回标签值，否者放回100
# series 修改name（属性）
print("################################")
b.name = "Series"
b.index.name = "索引列"
print(b)
# 修改series数据值（赋值），以及name
b.name = "new_name"
b["c", 'd'] = [777, 666]
print(b)

Pandas库的DataFrame类型

DataFrame类型由相同索引的一组列组成（索引 + 多列数据）

每列值的数据类型可以不同，

DataFrame既有行索引（index）也有列索引（column）

DataFrame可以由二维ndarray创建、一维(ndarray、列表、字典、元组、或者series构成的字典）创建、其他series、其他DataFrame类型创建。

由二维ndarray创建DataFrame

由一维ndarray对象字典创建

由列表类型的字典创建DataFrame

# 由二维ndarray创建DataFrame
import numpy as np
d = pd.DataFrame(np.arange(1, 21, 1).reshape(4, 5))
print(d)
# 由一维ndarray对象字典创建
dt = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two': pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])}
d = pd.DataFrame(dt) # 创建DataFrame时，数据不完整时会使用NaN补全
print(d)
# 由列表类型的字典创建DataFrame
dl = {'one': [1, 2, 3, 4], 'two': [9, 8, 7, 6]}
d = pd.DataFrame(dl, index=['a', 'b', 'c', 'd'])
print(d)
print("^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^")
print(d['one']) # 获取DataFrame中的某一列数据（column）并构成series
print(d.ix['d']) # 获取DataFrame中的某一行数据（index）并构成series
print(d['one']['d']) # 获取交汇的数据

pandas库的数据类型操作

改变series和DataFrame对象：

重新索引：.reindex() 能够改变或重排series和DataFrame索引

# 重新索引 .reindex()
dl = {'城市': ['北京', '上海', '广州', '深圳', '沈阳'],
      '环比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
print(d)
# 重新排列列的顺序（把城市排在最前面）
d = d.reindex(columns=['城市', '同比', '环比', '定基'])
print(d)

.reindex() 的参数（属性）index columns fill_value method limit copy

新增：

dl = {'城市': ['北京', '上海', '广州', '深圳', '沈阳'],
      '环比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
print(d)
# 重新排列列的顺序（把城市排在最前面）
d = d.reindex(columns=['城市', '同比', '环比', '定基'])
print(d)
# 新增
newc = d.columns.insert(4, "新增")
newd = d.reindex(columns=newc, fill_value=200)
print(newd)

DataFrame索引类型：

# 索引类型
dl = {'城市': ['北京', '上海', '广州', '深圳', '沈阳'],
      '环比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
print(d.index) # 行索引
print(d.columns) # 列索引

DataFrame索引类型使用：（增加删除）

# DataFrame索引类型使用
dl = {'城市': ['北京', '上海', '广州', '深圳', '沈阳'],
      '环比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
d = d.reindex(columns=['城市', '同比', '环比', '定基'])
print(d)
nc = d.columns.delete(2) # 删除第三列
ni = d.index.insert(5, "c0")
nd = d.reindex(index=ni, columns=nc)
print(nd)

删除指定索引对象：

# 删除指定索引对象
# <1> series
a = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])
print(a)
print("111111111111111111")
print(a.drop(['b', 'c']))
# <2> dataframe
dl = {'城市': ['北京', '上海', '广州', '深圳', '沈阳'],
      '环比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
d = d.reindex(columns=['城市', '同比', '环比', '定基'])
print(d)
print("22222222222222222")
print(d.drop('c5')) # 根据行索引（index）进行删除
print("333333333333333333")
print(d.drop("同比", axis=1)) # 根据列索引（axis=1,表示列（columns））进行删除

数据类型的算术运算

算术运算根据行列索引，补齐后运算，（结果为浮点数）

补齐时缺项填充NaN（空值）

二维和一维、一维和单个数字间为广播运算（即同MATLAB的运算法则）

# 加法
a = pd.DataFrame(np.arange(12).reshape(3, 4))
b = pd.DataFrame(np.arange(20).reshape(4, 5))
print(a)
print(b)
print('喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵')
print(a + b) # 对应位置进行相关运算,缺项补NaN

数据类型的方法形式运算（区别于符号运算+-*/）

说明：比符号运算增加了更多参数的运算

.add(d, **argws) 类型间加法运算，可选参数

.sub(d, **argws) 类型间减法运算，可选参数

.mul(d, **argws) 类型间乘法运算，可选参数

.div(d, **argws) 类型间除法运算，可选参数

# 数据类型的方法形式运算（区别于符号运算+-*/）
# 缺项的填充方式（fill_value）
a = pd.DataFrame(np.arange(12).reshape(3, 4))
b = pd.DataFrame(np.arange(20).reshape(4, 5))
print(a)
print(b)
print(a.add(b, fill_value=100)) # 缺项补齐为100，在进行运算
print('_______________________')
print(b.mul(a, fill_value=0)) # 缺项补齐为0，在进行运算

不同维度间进行运算（广播运算）

# 不同维度间进行运算（广播运算）
a = pd.DataFrame(np.arange(12).reshape(3, 4))
b = pd.DataFrame(np.arange(20).reshape(4, 5))
print(a)
print(b)
c = pd.Series(np.arange(4))
print(c)
print(c + 10) # 每个元素依次加10
print(b.sub(c, axis=0)) # 二维的b的每列与series对应元素进行运算
print(b.sub(a, fill_value=100)) # 默认axis为1，二维的b的每行与series对应元素进行运
# 算，缺省时补全为100

比较运算法则

说明：比较运算只能比较相同索引的元素，不进行补齐

二维与一维、一维与零维间为广播运算

采用< > <= >= == != 等符号进行的二元运算产生布尔对象

# 比较运算法则
a = pd.DataFrame(np.arange(12).reshape(3, 4))
d = pd.DataFrame(np.arange(12, 0, -1).reshape(3, 4))
print(d)
print(a > d)

数据特征分析

数据排序

.sort_index()方法在指定轴上根据索引进行排序，默认升序

具体：.sort_index(axis=0, ascending=True)

# 数据排序（对索引值）
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4, 5),
index=['c', 'a', 'b', 'd'])
print(b)
print(b.sort_index()) # 默认在axis=0上进行操作,按索引值顺序进行排序，可以添加axis=1
print(b.sort_index(ascending=False)) # 修改ascending（上升）按照索引进行降序排序
c = b.sort_index(axis=1, ascending=False) # 对列进行降序排列
print('^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^')
print(c)
print(c.sort_index()) # 对行进行升序排列

数据排序

.sort_values()方法在指定轴上根据数值进行排序，默认升序

Series.sort_values(axis=0, ascending=True)

DataFrame.sort_values(by, axis=0, ascending=True) # 对某个轴上的某一个索引对应的一串数据进行排序（升序或者降序）

by: axis轴上的某一个索引或索引列表

注意：axis与行列的关系

# 数据排序（对数值大小）
import pandas as pd
import numpy as np
b = pd.DataFrame(np.random.rand(20).reshape(4, 5), index=['c', 'a', 'b', 'd'])
print(b)
c = b.sort_values(2, ascending=False) # 默认为axis=0,对竖直方向（列数据）进行排序
print(c)
print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
print(c.sort_values('a', axis=1, ascending=False)) # 对水平方向（行数据）进行排序

NaN的排序处理（统一放到末尾，不管升降序，一律放在末尾）

数据的基本统计分析

《1》说明（Series类型和DataFrame类型均适用）

.sum() 计算数据总和，按0轴计算，下同

.count() 非NaN值的数量

.mean() .median() 数据算术平均值、中位数

.var() .std() 数据的方差、标准差

.min() .max() 数据最值

《2》说明（只适用series）

.argmin() .argmax() 计算数据最值所在位置的索引位置（自动索引）

.idmin() .idmax() 计算最值所在位置的索引（自定义索引）

《3》Series类型和DataFrame类型均适用

.describe() 针对0轴（各列）的统计汇总

# 数据的基本统计分析
# # <1> 一维series
# import pandas as pd
# a = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])
# print(a)
# print(a.describe()) # describe输出结果是series对象，所以可以用索引获取数据
# print(a.describe()['count'])
# print(a.describe()['25%'])
# 二维DataFrame
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4, 5), index=['a', 'b', 'c', 'd'])
print(b)
print(b.describe())
print("@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@")
# 获取某一行的信息,返回结果为series
print(b.describe().ix['max'])
# 获取某一列的信息，返回结果为series
print(b.describe()[2])

数据的累计统计分析

.cumsum() 依次给出前1,、2、3...n个数的和

.cumprod() 给出前n个数的积

.cummax() 给出前n个数的最大值

.cummin() 。。。给出。。最小值

数据相关分析

正相关、负相关、不相关

判断标准：

（1）、协方差（协方差 > 0 ,两个变量正相关、协方差 < 0, 两个变量负相关、协方差 = 0 ，两个变量独立无关） ----------------------- > .cov() 计算协方差矩阵

（2）、Pearson相关系数（r范围【-1， 1】）

当0.8 -- 1 极强相关

0.6 --0.8 强相关

0.4 --0.6 中等程度相关

0.2 --0.4 弱相关

0.0 - 0.2 极弱相关或无相关

-----------------------> .corr() 计算相关系数矩阵，Pearson Spearman Kendall 等系数

# 数据相关分析
# 分析房价增幅与人民币发行增幅间关系
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
hprice = pd.Series([3.04, 22.93, 12.75, 22.6, 12.33], index=['2008', '2009', '2010', '2011', '2012'])
m2 = pd.Series([8.18, 18.38, 9.13, 7.82, 6.69], index=['2008', '2009', '2010', '2011', '2012'])
print(hprice.corr(m2)) # r = 0.5239即中等程度相关
# 绘制散点图，观察相关性
plt.scatter(hprice, m2)
plt.show()

pandas处理丢失数据

# 处理NaN 数据
# <1> 删除
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 5) + 10,
index=['a', 'b', 'c', 'd'], columns=[np.arange(5)])

# .loc 标签索引 (‘ ’)
# .iloc 位置索引（1,2，3...）
# .ix 标签与位置混合索引

df.ix['c', 3] = np.nan
df.loc['c', 2] = np.nan
df.iloc[3, 4] = np.nan
print(df)
print('_____________________________________')
# how={'any', 'all'} any表示只要有nan存在该行（列）数据全部删除
# all表示该行（列）数据全是nan 时才删除整行（列）
print(df.dropna(axis=0, how='any'))
print(df)

# <2> nan填入某个数据
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 5) + 10,
index=['a', 'b', 'c', 'd'], columns=[np.arange(5)])

# .loc 标签索引 (‘ ’)
# .iloc 位置索引（1,2，3...）
# .ix 标签与位置混合索引

df.ix['c', 3] = np.nan
df.loc['c', 2] = np.nan
df.iloc[3, 4] = np.nan
print(df)
print('_____________________________________')
print(df.fillna(value=0)) # 将nan数据替换为0
print(df)

# 检验数据表格中是否有缺失值（表格过大不便于查看）
print(np.any(df.isnull()) == True) # 如果存在丢失数据，则返回True,也可以省略== True

数据的合并

"""
concatenating合并
"""
# df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
# df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])
# df3 = pd.DataFrame(np.ones((3, 4)) * 2, columns=['a', 'b', 'c', 'd'])
# print(df1)
# print(df2)
# print(df3)
# # axis=0表示竖向合并,axis=1表示横向合并
# # ignore_index=True表示忽略原先的index，重新默认排序index
# res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
# print(res)

# # join,['inner', 'outer']
# df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
# df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
# print(df1)
# print(df2)
# # 默认为join=outer（缺失部分使用nan补全），join=inner表示保留共同部分
# res = pd.concat([df1, df2], join='inner') # 也可以ignore_index=True表示忽略原先的index，重新默认排序index
# print(res)

# join_axes参数

"""
append参数，与concat相似
"""
# 待更...

"""
merge合并
"""
# 待更...