文章目录
前言
Python基础知识8
本节主要学习pandas相关知识点
一、pandas是什么?
pandas是基于numpy的一种工具,能使我们快速便捷地处理数据的函数和方法。其纳入了很多库和标准的数据模型。
二、series对象
series是Pandas中最基本的对象,类似于一维数组,和numpy数组对象差不多,但有所差别,series可以为数据自定义标签,也就是索引(index),然后通过索引来访问数组中的数据。
2.1Series常用操作
创建:
#创建索引
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
#创建Series对象并忽略索引
sel=Series([1,2,3,4])
print(sel)
#通常会自己创建索引
sel=Series(data=[1,2,3,4],index=["a","b","c","d"])
sel=Series(data=[1,2,3,4],index=list("abcd"))
print(sel)
输出:
0 1
1 2
2 3
3 4
dtype: int64
a 1
b 2
c 3
d 4
dtype: int64
获取:
sel=Series(data=[1,2,3,4],index=["a","b","c","d"])
sel=Series(data=[1,2,3,4],index=list("abcd"))
print(sel)
#获取内容
print(sel.values)
#获取索引
print(sel.index)
#获取索引和值对
print(list(sel.iteritems()))
输出:
a 1
b 2
c 3
d 4
dtype: int64
[1 2 3 4]
Index(['a', 'b', 'c', 'd'], dtype='object')
[('a', 1), ('b', 2), ('c', 3), ('d', 4)]
将字典转为series:
dict={"red":100,"black":400,"green":300,"pink":900}
sel=Series(dict)
print(sel)
输出:
red 100
black 400
green 300
pink 900
dtype: int64
Series的数据获取:
sel=Series(data=[1,2,3,4],index=list("abcd"))
print(sel)
print("=========")
#Series对象同时支持位置和标签两种方式获取数据
print("索引下标",sel["c"])
print("=========")
print("位置下标",sel[2])
print("=========")
#获取不连续的数据
print("索引下标",sel[["a","c"]])
print("=========")
print("位置下标",sel[[1,3]])
print("=========")
#可以使用切片获取数据
print("位置切片",sel[1:3])
print("=========")
print("索引切片",sel["b":"d"])
输出:
a 1
b 2
c 3
d 4
dtype: int64
=========
索引下标 3
=========
位置下标 3
=========
索引下标 a 1
c 3
dtype: int64
=========
位置下标 b 2
d 4
dtype: int64
=========
位置切片 b 2
c 3
dtype: int64
=========
索引切片 b 2
c 3
d 4
dtype: int64
重新赋值索引的值:
sel.index=list("dcba")
print(sel)
#reindex重新索引,会返回一个新的series,(调用reindex将会重新排序,缺失值使用nan填补)
print(sel.reindex(["b","a","c","d","e"]))
输出:
d 1
c 2
b 3
a 4
dtype: int64
b 3.0
a 4.0
c 2.0
d 1.0
e NaN
dtype: float64
drop丢弃指定轴上的项:
sel=pd.Series(range(10,15))
print(sel)
print(sel.drop([2,3]))
输出:
0 10
1 11
2 12
3 13
4 14
dtype: int64
0 10
1 11
4 14
dtype: int64
2.2.series进行算术运算操作
对series的算术操作都是基于index进行的,我们可以用加减乘除这样的运算符对两个series进行计算,pandas会根据索引index对应的数据进行计算,结果以浮点数的形式存储,以避免丢失精度。如果pandas在两个series中找不到相同的index,对应的位置就返回一个空值nan。
#series进行算术操作运算
series1=pd.Series([1,2,3,4],["London","Hongkong","Humbai","lagos"])
series2=pd.Series([1,3,6,4],["London","Accra","lagos","Delhi"])
print(series1-series2)
print("===============")
print(series1+series2)
print("===============")
print(series1*series2)
print("===============")
#同样也支持numpy的数组运算
sel=Series(data=[1,6,3,5],index=list("abcd"))
print(sel[sel>3])#布尔数据过滤
print("===============")
print(sel*2)#标量乘法
输出:
Accra NaN
Delhi NaN
Hongkong NaN
Humbai NaN
London 0.0
lagos -2.0
dtype: float64
===============
Accra NaN
Delhi NaN
Hongkong NaN
Humbai NaN
London 2.0
lagos 10.0
dtype: float64
===============
Accra NaN
Delhi NaN
Hongkong NaN
Humbai NaN
London 1.0
lagos 24.0
dtype: float64
===============
b 6
d 5
dtype: int64
===============
a 2
b 12
c 6
d 10
dtype: int64
三、DataFrame的创建
DataFrame(数据表)是一种二维数据结构,数据以表格的形式存储,分成若干行和列。常见操作有选取、替换行或列的数据,还能重组数据表,修改索引,多重筛选等。可以把DataFrame理解成为一组采用同样索引的series的集合,调用DataFrame可以将多种格式的数据转换为DataFrame对象,它的三个参数data\index\columns分别是数据、行索引和列索引。
3.1.DataFrame的创建
使用二维数组
#使用二维数组
df1=DataFrame(np.random.randint(0,10,(4,4)),index=[1,2,3,4],columns=["a","b","c","d"])
print(df1)
输出:
a b c d
1 3 7 9 2
2 7 3 1 6
3 2 1 6 0
4 2 1 0 6
使用字典创建(行索引由字典创建,列索引由字典的键决定)
dict={
"province":["Guangdong","Beijing","Qinghai","Fujian"],
"pop":[1.3,2.5,1.1,0.7] ,"year":[2018,2018,2018,2018]}
df2=pd.DataFrame(dict,index=[1,2,3,4])
print(df2)
输出:
province pop year
1 Guangdong 1.3 2018
2 Beijing 2.5 2018
3 Qinghai 1.1 2018
4 Fujian 0.7 2018
===============
a b
0 1 4
1 2 5
2 3 6
使用from_dict创建
dict2={"a":[1,2,3],"b":[4,5,6]}
df6=pd.DataFrame.from_dict(dict2)
print(df6)
输出:
a b
0 1 4
1 2 5
2 3 6
#索引相同的情况下,相同索引的值会相应减少,缺少的值会添加NaN
data={
"name":pd.Series(["zs","ls","we"],index=["a","b","c"]),
"Age":pd.Series(["10","20","30","40"],index=["a","b","c","d"]),
"country":pd.Series(["中国","日本","韩国"],index=["a","c","b"])}
df=pd.DataFrame(data)
print(df)
print("=====================")
#to_dict()方法将DataFrame转换成字典
dict=df.to_dict()
print(dict)
输出:
name Age country
a zs 10 中国
b ls 20 韩国
c we 30 日本
d NaN 40 NaN
=====================
{'name': {'a': 'zs', 'b': 'ls', 'c': 'we', 'd': nan}, 'Age': {'a': '10', 'b': '20', 'c': '30', 'd': '40'}, 'country': {'a': '中国', 'b': '韩国', 'c': '日本', 'd': nan}}
3.2.DataFrame的对象常用属性
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
#DataFrame常用属性
df_dict={
"name":["James","Curry","Iversion"],
"age":["18","20","19"],
"national":["us","china","us"]}
df=pd.DataFrame(data=df_dict,index=["0","1","2"])
print(df)
print("=====================")
#获取行数和列数
print(df.shape)
print("=====================")
#获取行索引
print(df.index.tolist())
print("=====================")
#获取列索引
print(df.columns.tolist())
print("=====================")
#获取数据的维度
print(df.dtypes)
print("=====================")
#values属性也会以二维ndarray的形式返回DataFrame的数据
print(df.values)
print("=====================")
#展示df的概览
print(df.info())
print("=====================")
#显示头几行,默认显示头5行
print(df.head(2))
print("=====================")
#显示后几行
print(df.tail(1))
print("=====================")
#获取DataFrame的列
print(df["name"])
print("=====================")
#如果获取多个列,那么默认返回一个DataFrame类型:
print(df[["name","age"]])
print(type(df[["name","age"]]))
print("=====================")
#获取一行
print(df[0:1])
print("=====================")
#获取多行
print(df[1:3])
print("=====================")
#取多行里面的某一列
print(df[1:3][["name","age"]])
#注意:df[]只能进行行选择,或列选择,不能同时多行多列
print("=====================")
输出:
name age national
0 James 18 us
1 Curry 20 china
2 Iversion 19 us
=====================
(3, 3)
=====================
['0', '1', '2']
=====================
['name', 'age', 'national']
=====================
name object
age object
national object
dtype: object
=====================
[['James' '18' 'us']
['Curry' '20' 'china']
['Iversion' '19' 'us']]
=====================
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 0 to 2
Data columns (total 3 columns):
name 3 non-null object
age 3 non-null object
national 3 non-null object
dtypes: object(3)
memory usage: 96.0+ bytes
None
=====================
name age national
0 James 18 us
1 Curry 20 china
=====================
name age national
2 Iversion 19 us
=====================
0 James
1 Curry
2 Iversion
Name: name, dtype: object
=====================
name age
0 James 18
1 Curry 20
2 Iversion 19
<class 'pandas.core.frame.DataFrame'>
=====================
name age national
0 James 18 us
=====================
name age national
1 Curry 20 china
2 Iversion 19 us
=====================
name age
1 Curry 20
2 Iversion 19
=====================
继续:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
#DataFrame常用属性
df_dict={
"name":["James","Curry","Iversion"],
"age":["18","20","19"],
"national":["us","china","us"]}
df=pd.DataFrame(data=df_dict,index=["0","1","2"])
#通过标签索引行数据——df.loc
#通过位置获取行数据——df.iloc
#获取某一行某一列的数据
print(df.loc["0","name"])
#一行所有列
print(df.loc["0",:])
print("=====================")
#某一行多列的数据
print(df.loc["0",["name","age"]])
print("=====================")
#选择间隔的多行多列
print(df.loc[["0","2"],["name","national"]])
print("=====================")
#取一行
print(df.iloc[1])
print("=====================")
#取连续多行
print(df.iloc[[0,2],:])
#取某一列
print("=====================")
print(df.iloc[:,1])
print("=====================")
#某一个值
print(df.iloc[1,0])
print("=====================")
#修改值
df.iloc[0,0]="panda"
print(df)
print("=====================")
#dataframe中的排序方法
df=df.sort_values(by="age",ascending=False)#ascending=False表示降序排序,默认是升序
print(df)
输出:
James
name James
age 18
national us
Name: 0, dtype: object
=====================
name James
age 18
Name: 0, dtype: object
=====================
name national
0 James us
2 Iversion us
=====================
name Curry
age 20
national china
Name: 1, dtype: object
=====================
name age national
0 James 18 us
2 Iversion 19 us
=====================
0 18
1 20
2 19
Name: age, dtype: object
=====================
Curry
=====================
name age national
0 panda 18 us
1 Curry 20 china
2 Iversion 19 us
=====================
name age national
1 Curry 20 china
2 Iversion 19 us
0 panda 18 us
3.3.DataFrame修改index、columns
#dataframe修改index\columns
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
df1=pd.DataFrame(np.arange(9).reshape(3,3),index=["bj","sh","gz"],columns=["a","b","c"])
print(df1)
#修改df1的index
print(df1.index)#可以打印出df1的index,并为其重新赋值
print("=====================")
df1.index=["beijing","shanghai","guangzhou"]
print(df1)
#使用自定义map函数(x是原有的行列值)
def test_map(x):
return x+"_ABC"
print(df1.rename(index=test_map,columns=test_map,inplace=True))
#inplace布尔值,默认为False,指定是否返回新的dataframe,如果为True,则在原Dataframe上修改,返回值为None。
#同时rename还可以传入字典,为某个index单独修改名称。
输出:
a b c
bj 0 1 2
sh 3 4 5
gz 6 7 8
Index(['bj', 'sh', 'gz'], dtype='object')
=====================
a b c
beijing 0 1 2
shanghai 3 4 5
guangzhou 6 7 8
None
=====================
a_ABC b_ABC c_ABC
beijing_ABC 0 1 2
shanghai_ABC 3 4 5
guangzhou_ABC 6 7 8
还有:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
df1=pd.DataFrame(np.arange(9).reshape(3,3),index=["bj","sh","gz"],columns=["a","b","c"])
print(df1)
print("=====================")
#同时rename还可以传入字典,为某个index单独修改名称
df3=df1.rename(index={"bj":"beijing"},columns={"a":"aa"})
print(df3)
输出:
a b c
bj 0 1 2
sh 3 4 5
gz 6 7 8
aa b c
beijing 0 1 2
sh 3 4 5
gz 6 7 8
还有:
#列转化为索引
df1=pd.DataFrame({"x":range(5),"Y":range(5),"s":list("abcde"),"z":[1,1,2,2,2]})
print(df1)
#指定一列为索引:drop=False指定的同时保留作为索引的那一列
result=df1.set_index("s",drop=False)
result.index.name=None
print(result)
#拓展:rest_index是把索引变成列
输出:
x Y s z
0 0 0 a 1
1 1 1 b 1
2 2 2 c 2
3 3 3 d 2
4 4 4 e 2
x Y s z
a 0 0 a 1
b 1 1 b 1
c 2 2 c 2
d 3 3 d 2
e 4 4 e 2
and:
#行转为列索引
result=df1.set_axis(df1.iloc[0],axis=1,inplace=False)
result.columns.name=None
print(result)
0 0 a 1
0 0 0 a 1
1 1 1 b 1
2 2 2 c 2
3 3 3 d 2
4 4 4 e 2
3.4添加数据
#添加数据
df1=pd.DataFrame([["snow","M",22],["Tyrion","M",32],["sansa","F",18],["Arya","F",14]],columns=["name","gender","age"])
df1["score"]=[80,98,67,90]#在数据框最后加上score一列,并且增加列的元素必须和元数据列的个数一样
print(df1)
输出:
name gender age score
0 snow M 22 80
1 Tyrion M 32 98
2 sansa F 18 67
3 Arya F 14 90
插入数据:
#原始数据框
df1=pd.DataFrame([["snow","M",22],["Tyrion","M",32],["sansa","F",18],["Arya","F",14]],columns=["name","gender","age"])
df1["score"]=[80,98,67,90]#在数据框最后加上score一列,并且增加列的元素必须和元数据列的个数一样
print(df1)
print("===================================")
#在原始数据框中插入数据
col_name=df1.columns.tolist()#将数据框的列名全部取出放在列表里
col_name.insert(2,"city")#在列索引为2的位置插入一列,列名为city,刚插入时不会有值,整列都是Nan
df1=df1.reindex(columns=col_name)#DataFrame.reindex对原行、列索引重新构建索引。
print(df1)
输出: name gender age score
0 snow M 22 80
1 Tyrion M 32 98
2 sansa F 18 67
3 Arya F 14 90
===================================
name gender city age score
0 snow M NaN 22 80
1 Tyrion M NaN 32 98
2 sansa F NaN 18 67
3 Arya F NaN 14 90
给city列重新赋值
df1["city"]=["北京","山西","湖北","澳门"]
print(df1)
df中的insert,插入一列
"""
df中的insert,插入一列
df.insert(iloc,column,value)
iloc:要插入的位置
columns:列名
values:值
"""
df1.insert(2,"scoress",[80,98,67,90])
print(df1)
输出:
name gender scoress scores city age score
0 snow M 80 80 北京 22 80
1 Tyrion M 98 98 山西 32 98
2 sansa F 67 67 湖北 18 67
3 Arya F 90 90 澳门 14 90
插入一行(会被覆盖,可以进行分割合并的方式)
row=["111","222","333"]
df1.iloc:[1]=row
print(df1)
输出:AttributeError: can't set attribute报了一个这个错误,不知道怎么回事呢?
使用append在行尾增加一行:
#增加数据
df1=pd.DataFrame([["snow","M",22],["Tyrion","M",32],["sansa","F",18],["Arya","F",14]],columns=["name","gender","age"])
new=pd.DataFrame({"name":"lisa",
"gender":"F",
"age":"19"
},index=[0])
print(new)
print("======================")
##在原始数据框最后一行新增一行,使用append方法
df1=df1.append(new,ignore_index=True)#ignore_index=Flase表示不按照原来的索引,从0开始递增。
print(df1)
输出:
name gender age
0 lisa F 19
======================
name gender age
0 snow M 22
1 Tyrion M 32
2 sansa F 18
3 Arya F 14
4 lisa F 19
合并功能:
objs:合并对象
axis:合并方式(默认表示按照列合并,1表示按照行合并)
ignore_index:是否忽略索引
#合并
df1=pd.DataFrame(np.arange(6).reshape(3,2),columns=["four","five"])
df2=pd.DataFrame(np.arange(6).reshape(2,3),columns=["one","two","three"])
print(df1)
print("===================")
print(df2)
输出:
four five
0 0 1
1 2 3
2 4 5
===================
one two three
0 0 1 2
1 3 4 5
按照行合并:
result=pd.concat([df1,df2],axis=1)
print(result)
输出:
four five one two three
0 0 1 0.0 1.0 2.0
1 2 3 3.0 4.0 5.0
2 4 5 NaN NaN NaN
按照列合并:
result=pd.concat([df1,df2],axis=0,ignore_index=True)
print(result)
输出:
five four one three two
0 1.0 0.0 NaN NaN NaN
1 3.0 2.0 NaN NaN NaN
2 5.0 4.0 NaN NaN NaN
3 NaN NaN 0.0 2.0 1.0
4 NaN NaN 3.0 5.0 4.0
DataFrame的删除
df2=pd.DataFrame(np.arange(9).reshape(3,3),columns=["one","two","three"])
print(df2)
df3=df2.drop(["one"],axis=1,inplace=True)
print(df2)
print(df3)
#label:要删除的数据标签;
#axis:0,0表示删除行,1表示删除列,默认为0;
#inplace:是否在df中执行此操作;
输出:
one two three
0 0 1 2
1 3 4 5
2 6 7 8
two three
0 1 2
1 4 5
2 7 8
None
四、数据处理
通过dropna()处理缺失数据:
from numpy import nan as NaN
#通过dropna()过滤删除缺失数据
se=pd.Series([4,NaN,8,NaN,5])
print(se)
print(se.dropna())
print(se.notnull())
print(se.isnull())
输出:
0 4.0
1 NaN
2 8.0
3 NaN
4 5.0
dtype: float64
0 4.0
2 8.0
4 5.0
dtype: float64
0 True
1 False
2 True
3 False
4 True
dtype: bool
0 False
1 True
2 False
3 True
4 False
dtype: bool
通过布尔序列也能滤除:
print(se[se.notnull()])
输出:
0 4.0
2 8.0
4 5.0
dtype: float64
处理DataFrame对象
df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("================")
#默认滤除所有包含的NaN:
print(df1.dropna())
print("================")
#传入HOW=“all”滤除全部为NaN的行:
print(df1.dropna(how="all"))#默认情况下是“how=any”,意思是只要有nan就删除
print("================")
#传入axis=1,滤除列:
print(df1.dropna(axis=1,how="all"))
print("================")
#传入thresh=n保留至少有n个非NaN数据的行:
print(df1.dropna(thresh=1))
输出:
0 1 2
0 1.0 2.0 3.0
1 NaN NaN 2.0
2 NaN NaN NaN
3 8.0 8.0 NaN
================
0 1 2
0 1.0 2.0 3.0
================
0 1 2
0 1.0 2.0 3.0
1 NaN NaN 2.0
3 8.0 8.0 NaN
================
0 1 2
0 1.0 2.0 3.0
1 NaN NaN 2.0
2 NaN NaN NaN
3 8.0 8.0 NaN
================
0 1 2
0 1.0 2.0 3.0
1 NaN NaN 2.0
3 8.0 8.0 NaN
填充缺失数据:
#填充缺失数据
df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("================1")
#用常数填充fillna
print(df1.fillna(0))
print("================2")
#传入inplace=True直接修改原对象
df1.fillna(0,inplace=True)
print(df1)
print("================3")
输出:
0 1 2
0 1.0 2.0 3.0
1 NaN NaN 2.0
2 NaN NaN NaN
3 8.0 8.0 NaN
================1
0 1 2
0 1.0 2.0 3.0
1 0.0 0.0 2.0
2 0.0 0.0 0.0
3 8.0 8.0 0.0
================2
0 1 2
0 1.0 2.0 3.0
1 0.0 0.0 2.0
2 0.0 0.0 0.0
3 8.0 8.0 0.0
================3
#填充缺失数据:
df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("================1")
#通过字典填充不同的常数
print(df1.fillna({0:10,1:20,2:30}))
print("================4")
#填充平均值
print(df1.fillna(df1.mean()))
输出:
0 1 2
0 1.0 2.0 3.0
1 NaN NaN 2.0
2 NaN NaN NaN
3 8.0 8.0 NaN
================1
0 1 2
0 1.0 2.0 3.0
1 10.0 20.0 2.0
2 10.0 20.0 30.0
3 8.0 8.0 30.0
================4
0 1 2
0 1.0 2.0 3.0
1 4.5 5.0 2.0
2 4.5 5.0 2.5
3 8.0 8.0 2.5
填充缺失数据:
df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("================1")
#如果只填充了一列
print(df1.iloc[:,1].fillna(5,inplace=True))
print(df1)
输出:
0 1 2
0 1.0 2.0 3.0
1 NaN NaN 2.0
2 NaN NaN NaN
3 8.0 8.0 NaN
================1
None
0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 2.0
2 NaN 5.0 NaN
3 8.0 8.0 NaN
填充缺失数据
df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("====================1")
#传入method=""改变插值方式
df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN
df2.iloc[2:4,4]=NaN
print(df2)
输出: 0 1 2
0 1.0 2.0 3.0
1 NaN NaN 2.0
2 NaN NaN NaN
3 8.0 8.0 NaN
====================1
0 1 2 3 4
0 1 7 5 1.0 2.0
1 1 2 0 NaN 5.0
2 0 1 6 NaN NaN
3 8 8 2 NaN NaN
4 7 7 9 9.0 2.0
df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN
df2.iloc[2:4,4]=NaN
print(df2)
print("=====================")
#用前面的值填充ffill,用后面的值来填充bfill
print(df2.fillna(method="ffill"))
print("=====================")
输出:
0 1 2 3 4
0 0 0 6 6.0 2.0
1 4 1 5 NaN 2.0
2 1 0 1 NaN NaN
3 0 6 5 NaN NaN
4 4 6 8 0.0 6.0
=====================
0 1 2 3 4
0 0 0 6 6.0 2.0
1 4 1 5 6.0 2.0
2 1 0 1 6.0 2.0
3 0 6 5 6.0 2.0
4 4 6 8 0.0 6.0
=====================
df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN
df2.iloc[2:4,4]=NaN
print(df2)
print("=====================")
#传入limit=“限制填充行数”
print(df2.fillna(method="bfill",limit=1))
输出:
0 1 2 3 4
0 2 8 8 4.0 4.0
1 0 8 0 NaN 1.0
2 0 1 7 NaN NaN
3 1 6 4 NaN NaN
4 7 0 4 3.0 4.0
=====================
0 1 2 3 4
0 2 8 8 4.0 4.0
1 0 8 0 NaN 1.0
2 0 1 7 NaN NaN
3 1 6 4 3.0 4.0
4 7 0 4 3.0 4.0
#传入axis="",修改填充方向
df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN
df2.iloc[2:4,4]=NaN
print(df2)
print("===========================")
print(df2.fillna(method="ffill",limit=1,axis=1))
输出:
0 1 2 3 4
0 8 6 7 2.0 1.0
1 8 8 2 NaN 0.0
2 7 3 8 NaN NaN
3 6 8 0 NaN NaN
4 3 9 9 4.0 5.0
===========================
0 1 2 3 4
0 8.0 6.0 7.0 2.0 1.0
1 8.0 8.0 2.0 2.0 0.0
2 7.0 3.0 8.0 8.0 NaN
3 6.0 8.0 0.0 0.0 NaN
4 3.0 9.0 9.0 4.0 5.0
移除重复数据:
df1=pd.DataFrame({"A":[1,1,1,2,2,3,1],"B":list("aabbbca")})
print(df1)
print("==========")
#DataFrame中常出现重复行,利用duplicated函数返回每一行判断是否重复的结果(重复则为True)
print(df1.duplicated())
输出: A B
0 1 a
1 1 a
2 1 b
3 2 b
4 2 b
5 3 c
6 1 a
==========
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
去除全部的重复行:
print(df1.drop_duplicates())
输出: A B
0 1 a
2 1 b
3 2 b
5 3 c
指定列去除重复行:
print(df1.drop_duplicates(["A"]))
输出:
A B
0 1 a
3 2 b
5 3 c
保留重复行众多最后一行:
print(df1.drop_duplicates(["A"],keep="last"))
输出:
A B
4 2 b
5 3 c
6 1 a
去除重复的同时改变DataFrame对象:
df1.drop_duplicates(["A","B"],inplace=True)
print(df1)
输出:
A B
0 1 a
2 1 b
3 2 b
5 3 c
五、数据合并
#数据合并——简单合并
import pandas as pd
df3=pd.DataFrame({"Red":[1,3,5],"Green":[5,0,3]},index=list("abc"))
df4=pd.DataFrame({"Blue":[1,9,8],"Yellow":[6,6,7]},index=list("cde"))
print(df3)
print(df4)
简单合并(默认是left左连接,以左侧df3为基础)
df3.join(df4,how="left")
输出: Red Green
a 1 5
b 3 0
c 5 3
Blue Yellow
c 1 6
d 9 6
e 8 7
Red Green Blue Yellow
a 1 5 NaN NaN
b 3 0 NaN NaN
c 5 3 1.0 6.0
简单合并之右连接:
#右连接
df3.join(df4,how="right")
输出:Red Green
a 1 5
b 3 0
c 5 3
Blue Yellow
c 1 6
d 9 6
e 8 7
Red Green Blue Yellow
c 5.0 3.0 1 6
d NaN NaN 9 6
e NaN NaN 8 7
简单合并之外连接:
#外连接
df3.join(df4,how="outer")
输出:
Red Green Blue Yellow
a 1.0 5.0 NaN NaN
b 3.0 0.0 NaN NaN
c 5.0 3.0 1.0 6.0
d NaN NaN 9.0 6.0
e NaN NaN 8.0 7.0
#合并多个datafram对象
df5=pd.DataFrame({"Brown":[3,4,5],"White":[1,1,2]},index=list("aed"))
print(df5)
df3.join([df4,df5])
输出: Brown White
a 3 1
e 4 1
d 5 2
Red Green Blue Yellow Brown White
a 1 5 NaN NaN 3.0 1.0
b 3 0 NaN NaN NaN NaN
c 5 3 1.0 6.0 NaN NaN
使用merge合并:
#使用merge,着重关注的是列的合并
df1=pd.DataFrame({"名字":list("ABCDE"),"性别":["男","女","男","男","女"],
"职称":["副教授","讲师","助教","教授","助教"]},
index=range(1001,1006))
df1.columns.name="学院老师"
df1.index.name="编号"
print(df1)
print("==========================")
df2=pd.DataFrame({"名字":list("ABDAX"),"课程":["C++","计算机导论","汇编","数据结构","马克思原理"],
"职称":["副教授","讲师","教授","副教授","讲师"]},index=[1001,1002,1004,1001,3001])
df2.columns.name="课程"
df2.index.name="编号"
print(df2)
print("==========================")
#默认是根据左右对象出现同名的列作为连接的值,且连接方式是how="inner"
print(pd.merge(df1,df2))
print("==========================")
输出:学院老师 名字 性别 职称
编号
1001 A 男 副教授
1002 B 女 讲师
1003 C 男 助教
1004 D 男 教授
1005 E 女 助教
==========================
课程 名字 课程 职称
编号
1001 A C++ 副教授
1002 B 计算机导论 讲师
1004 D 汇编 教授
1001 A 数据结构 副教授
3001 X 马克思原理 讲师
==========================
名字 性别 职称 课程
0 A 男 副教授 C++
1 A 男 副教授 数据结构
2 B 女 讲师 计算机导论
3 D 男 教授 汇编
==========================
指定列名合并:
pd.merge(df1,df2,on="名字",suffixes=["_1","_2"])#返回匹配的;
输出:名字 性别 职称_1 课程 职称_2
0 A 男 副教授 C++ 副教授
1 A 男 副教授 数据结构 副教授
2 B 女 讲师 计算机导论 讲师
3 D 男 教授 汇编 教授
连接方式,根据左侧为准:
pd.merge(df1,df2,how="left")
输出:名字 性别 职称 课程
0 A 男 副教授 C++
1 A 男 副教授 数据结构
2 B 女 讲师 计算机导论
3 C 男 助教 NaN
4 D 男 教授 汇编
5 E 女 助教 NaN
根据右侧为准:
pd.merge(df1,df2,how="right")
输出:
名字 性别 职称 课程
0 A 男 副教授 C++
1 A 男 副教授 数据结构
2 B 女 讲师 计算机导论
3 D 男 教授 汇编
4 X NaN 讲师 马克思原理
根据所有:
pd.merge(df1,df2,how="outer")
输出:
名字 性别 职称 课程
0 A 男 副教授 C++
1 A 男 副教授 数据结构
2 B 女 讲师 计算机导论
3 C 男 助教 NaN
4 D 男 教授 汇编
5 E 女 助教 NaN
6 X NaN 讲师 马克思原理
根据多个键进行连接:
pd.merge(df1,df2,on=["职称","名字"])
输出:
名字 性别 职称 课程
0 A 男 副教授 C++
1 A 男 副教授 数据结构
2 B 女 讲师 计算机导论
3 D 男 教授 汇编
拓展:
#轴向连接——concat
#series对象的连接
s1=pd.Series([1,2],index=list("ab"))
s2=pd.Series([3,4,5],index=list("bde"))
print(s1)
print(s2)
pd.concat([s1,s2])
输出:
a 1
b 2
dtype: int64
b 3
d 4
e 5
dtype: int64
a 1
b 2
b 3
d 4
e 5
dtype: int64
横向连接
pd.concat([s1,s2],axis=1)
输出:
0 1
a 1.0 NaN
b 2.0 3.0
d NaN 4.0
e NaN 5.0
用内连接求交集(连接方式,公有"inner\left\right\outer)
pd.concat([s1,s2],axis=1,join="inner")
输出:
0 1
b 2 3
指定部分索引进行连接
pd.concat([s1,s2],axis=1,join_axes=[list("abc")])
输出:
0 1
a 1.0 NaN
b 2.0 3.0
c NaN NaN
创建层次化索引
pd.concat([s1,s2],keys=["A","B"])
输出:
A a 1
b 2
B b 3
d 4
e 5
dtype: int64
#当纵向连接时,keys为列名
pd.concat([s1,s2],keys=["A","D"],axis=1)
输出:
A D
a 1.0 NaN
b 2.0 3.0
d NaN 4.0
e NaN 5.0
#DataFrame对象的连接
df3=pd.DataFrame({"Red":[1,3,5],"Green":[5,0,3]},index=list("abd"))
df4=pd.DataFrame({"Blue":[1,9],"Yellow":[6,6]},index=list("ce"))
print(df3)
print(df4)
pd.concat([df3,df4])
pd.concat([df3,df4],axis=1,keys=["A","B"])
输出:
Red Green
a 1 5
b 3 0
d 5 3
Blue Yellow
c 1 6
e 9 6
A B
Red Green Blue Yellow
a 1.0 5.0 NaN NaN
b 3.0 0.0 NaN NaN
c NaN NaN 1.0 6.0
d 5.0 3.0 NaN NaN
e NaN NaN 9.0 6.0
使用字典的方式同样可以创建层次化索引:
pd.concat({"A":df3,"B":df4},axis=1)
输出:
A B
Red Green Blue Yellow
a 1.0 5.0 NaN NaN
b 3.0 0.0 NaN NaN
c NaN NaN 1.0 6.0
d 5.0 3.0 NaN NaN
e NaN NaN 9.0 6.0
七、时间序列
7.1.生成一段时间范围
#时间序列
import pandas as pd
import numpy as np
date=pd.date_range(start="20190501",end="20190530")
print(date)
输出:DatetimeIndex(['2019-05-01', '2019-05-02', '2019-05-03', '2019-05-04','2019-05-05', '2019-05-06', '2019-05-07', '2019-05-08', '2019-05-09', '2019-05-10', '2019-05-11', '2019-05-12', '2019-05-13', '2019-05-14', '2019-05-15', '2019-05-16','2019-05-17', '2019-05-18', '2019-05-19', '2019-05-20','2019-05-21', '2019-05-22', '2019-05-23', '2019-05-24', '2019-05-25', '2019-05-26', '2019-05-27', '2019-05-28', '2019-05-29', '2019-05-30'],dtype='datetime64[ns]', freq='D')
该函数主要用于生成一个固定频率的时间索引,在调用构造方法的时候,必须指定start、end,period中的两个参数,否则报错。
参数说明:
freq:日期偏移量,取值为string,默认为“D”,freq=“1h30min”,freq=“10D";
period:固定时期,取值是整数或None;
date=pd.date_range(start="20190501",periods=10,freq="10D")
print(date)
#根据closed参数选择是否包含开始和结束时间closed=None,left包含开始时间,不包含结束时间,right则与之相反。
date=pd.date_range(start="2019-01-09",end="2019-01-14",closed="left")
print(date)
输出:
DatetimeIndex(['2019-05-01', '2019-05-11', '2019-05-21', '2019-05-31',
'2019-06-10', '2019-06-20', '2019-06-30', '2019-07-10',
'2019-07-20', '2019-07-30'],
dtype='datetime64[ns]', freq='10D')
DatetimeIndex(['2019-01-09', '2019-01-10', '2019-01-11', '2019-01-12',
'2019-01-13'],
dtype='datetime64[ns]', freq='D')
时间序列频率:
D | 日历日的每一天 |
---|---|
B | 工作日的每天 |
H | 每小时 |
T或min | 每分钟 |
S | 每秒钟 |
L或ms | 每毫秒 |
U | 每微妙 |
M | 日历日的月底日期 |
BM | 工作日的月底日期 |
MS | 日历日的月初日期 |
BMS | 工作日的月初日期 |
7.2时间序列在DataFrame中的作用
可以将时间作为索引:
#可以将时间作为索引
index=pd.date_range(start="20190101",periods=10)
df=pd.Series(np.random.randint(0,10,size=10),index=index)
print(df)
输出:
2019-01-01 5
2019-01-02 9
2019-01-03 6
2019-01-04 1
2019-01-05 8
2019-01-06 1
2019-01-07 1
2019-01-08 5
2019-01-09 3
2019-01-10 8
Freq: D, dtype: int32
turncate这个函数将before指定日期之前的值全部过滤出去,after指定日期之后的值全部过滤出去。
after=df.truncate(after="2019-01-8")
print(after)
输出:
2019-01-01 5
2019-01-02 9
2019-01-03 6
2019-01-04 1
2019-01-05 8
2019-01-06 1
2019-01-07 1
2019-01-08 5
long_ts=pd.Series(np.random.randn(1000),index=pd.date_range("1/1/2019",periods=1000))
#根据年份获取
result=long_ts["2020"]
print(result)
#使用切片:
result=long_ts["2020-05-01":"2020-05-06"]
print(result)
#通过between……and来返回指定时间段的数据
index=pd.date_range("2018-03-17","2018-03-30",freq="2H")
ts=pd.Series(np.random.randn(157),index=index)
print(ts.between_time("7:00","17:00"))
这些操作也适用于dataframe:
index=pd.date_range("1/1/2019",periods=100)
df=pd.DataFrame(np.random.randn(100,4),index=index)
print(df.loc["2019-04"])
输出 :
2019-04-01 0.533238 -1.437635 0.278330 -1.578619
2019-04-02 0.564459 0.576964 0.307198 -0.530719
2019-04-03 0.048564 -1.496561 0.839416 0.341760
2019-04-04 0.090401 -1.947177 -0.085777 -0.222991
2019-04-05 1.995785 -0.352821 1.503189 -0.632829
2019-04-06 -1.400590 -0.754488 -1.119926 -0.021565
2019-04-07 -0.285587 -0.269059 0.388311 1.007480
2019-04-08 1.445099 0.964586 -0.577791 0.718433
2019-04-09 0.382878 -0.256596 1.430164 0.235034
2019-04-10 -0.171697 -1.684310 -0.834606 -0.269438
7.3移位日期
ts=pd.Series(np.random.randn(5),index=pd.date_range("1/1/2019",periods=5))
print(ts)
#移动数据,索引不变,默认由NaN填充
#periods:移动的位数,负数是向上移动
#fill_value:移动后填充数据
ts.shift(periods=2,fill_value=100)
输出:
2019-01-01 0.484380
2019-01-02 0.849366
2019-01-03 -1.331064
2019-01-04 -0.521108
2019-01-05 -1.868577
Freq: D, dtype: float64
2019-01-01 100.000000
2019-01-02 100.000000
2019-01-03 0.484380
2019-01-04 0.849366
2019-01-05 -1.331064
Freq: D, dtype: float64
使用tshift将索引移动到指定的时间
ts.tshift(2)
输出:
2019-01-03 0.484380
2019-01-04 0.849366
2019-01-05 -1.331064
2019-01-06 -0.521108
2019-01-07 -1.868577
Freq: D, dtype: float64
将时间戳转换为时间根式:
pd.to_datetime(1554970740000,unit="ms")
输出:
Timestamp('2019-04-11 08:19:00')
UCT是协调世界时,时区是以UTC的偏移量的形式表示的,需要注意设置utc=True,
是让pandas对象具有时区性质,对于一列进行转换的,会造成转换错误。
import pytz
print(pytz.common_timezones)
pd.to_datetime(1554970740000,unit="ms").tz_localize("UTC").tz_convert("Asia/Shanghai")
输出:
……略
Timestamp('2019-04-11 16:19:00+0800', tz='Asia/Shanghai')
处理一列:
df=pd.DataFrame([1554970740000,1554970800000,1554970860000],columns=["time_stamp"])
pd.to_datetime(df["time_stamp"],unit="ms").dt.tz_localize("UTC").dt.tz_convert("Asia/shanghai")#先转换到标准时区,再转换到东八区
输出:
0 2019-04-11 16:19:00+08:00
1 2019-04-11 16:20:00+08:00
2 2019-04-11 16:21:00+08:00
Name: time_stamp, dtype: datetime64[ns, Asia/Shanghai]
八、分组聚合
#先创建一个DataFranme
import pandas as pd
import numpy as np
df=pd.DataFrame({"name":["Boss","Lilei","Lilei","Han","Boss","Boss","Han","Boss"],
"Year":[2016,2016,2016,2016,2017,2017,2017,2017],
"Salary":[999999,20000,25000,3000,9999999,999999,3500,999999],
"Bonus":[100000,20000,20000,5000,200000,300000,3000,400000]
})
print(df)
输出:name Year Salary Bonus
0 Boss 2016 999999 100000
1 Lilei 2016 20000 20000
2 Lilei 2016 25000 20000
3 Han 2016 3000 5000
4 Boss 2017 9999999 200000
5 Boss 2017 999999 300000
6 Han 2017 3500 3000
7 Boss 2017 999999 400000
根据name这一列进行分组:
group_by_name=df.groupby("name")
print(type(group_by_name))
输出:<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
查看分组
print(group_by_name.groups)
输出:{'Boss': Int64Index([0, 4, 5, 7], dtype='int64'), 'Han': Int64Index([3, 6], dtype='int64'), 'Lilei': Int64Index([1, 2], dtype='int64')}
查看分组后的数量 :
print(group_by_name.count())
Year Salary Bonus
name
Boss 4 4 4
Han 2 2 2
Lilei 2 2 2
查看分组的情况:
for name,group in group_by_name:
print(name)#组的名字
输出:Boss
name Year Salary Bonus
0 Boss 2016 999999 100000
4 Boss 2017 9999999 200000
5 Boss 2017 999999 300000
7 Boss 2017 999999 400000
Han
name Year Salary Bonus
3 Han 2016 3000 5000
6 Han 2017 3500 3000
Lilei
name Year Salary Bonus
1 Lilei 2016 20000 20000
2 Lilei 2016 25000 20000
查看组的具体内容:
print(group)#查看具体内容
name Year Salary Bonus
1 Lilei 2016 20000 20000
2 Lilei 2016 25000 20000
可以选择分组:
print(group_by_name.get_group("Boss"))
输出:
name Year Salary Bonus
0 Boss 2016 999999 100000
4 Boss 2017 9999999 200000
5 Boss 2017 999999 300000
7 Boss 2017 999999 400000
按照某一列进行分组,将name这一列作为分组的键,对year进行分组
group_by_name=df["Year"].groupby(df["name"])
print(group_by_name.count())
输出:name
Boss 4
Han 2
Lilei 2
Name: Year, dtype: int64
按照多列进行分组
group_by_name_year=df.groupby(["name","Year"])
for name,group in group_by_name_year:
print(name)
print(group)
输出:
('Boss', 2016)
name Year Salary Bonus
0 Boss 2016 999999 100000
('Boss', 2017)
name Year Salary Bonus
4 Boss 2017 9999999 200000
5 Boss 2017 999999 300000
7 Boss 2017 999999 400000
('Han', 2016)
name Year Salary Bonus
3 Han 2016 3000 5000
('Han', 2017)
name Year Salary Bonus
6 Han 2017 3500 3000
('Lilei', 2016)
name Year Salary Bonus
1 Lilei 2016 20000 20000
2 Lilei 2016 25000 20000
可以选择分组:
print(group_by_name_year.get_group(("Boss",2016)))
输出:
name Year Salary Bonus
0 Boss 2016 999999 100000
将某列数据按照数值分成不同范围进行分组(groupby)运算
df=pd.DataFrame({"Age":np.random.randint(20,70,100),
"Sex":np.random.choice(["M","F"],100)
})
age_groups=pd.cut(df["Age"],bins=[19,45,60,100])
print(age_groups)
输出:
0 (19, 45]
1 (19, 45]
2 (19, 45]
3 (19, 45]
4 (19, 45]
5 (45, 60]
6 (60, 100]
7 (60, 100]
8 (60, 100]
9 (60, 100]
10 (19, 45]
11 (60, 100]
12 (60, 100]
13 (19, 45]
14 (19, 45]
15 (60, 100]
16 (19, 45]
17 (19, 45]
18 (45, 60]
19 (19, 45]
20 (19, 45]
21 (45, 60]
22 (19, 45]
23 (45, 60]
24 (19, 45]
25 (45, 60]
26 (60, 100]
27 (60, 100]
28 (19, 45]
29 (19, 45]
...
70 (19, 45]
71 (45, 60]
72 (45, 60]
73 (19, 45]
74 (60, 100]
75 (19, 45]
76 (60, 100]
77 (19, 45]
78 (45, 60]
79 (19, 45]
80 (45, 60]
81 (19, 45]
82 (19, 45]
83 (19, 45]
84 (60, 100]
85 (45, 60]
86 (19, 45]
87 (19, 45]
88 (60, 100]
89 (19, 45]
90 (19, 45]
91 (45, 60]
92 (19, 45]
93 (45, 60]
94 (19, 45]
95 (19, 45]
96 (19, 45]
97 (45, 60]
98 (19, 45]
99 (19, 45]
Name: Age, Length: 100, dtype: category
Categories (3, interval[int64]): [(19, 45] < (45, 60] < (60, 100]]
按“Age”分组范围和性别(sex)进行交叉表
pd.crosstab(age_groups,df["Sex"])
输出:
Sex F M
Age
(19, 45] 31 24
(45, 60] 15 10
(60, 100] 8 12
聚合函数:
例子:df1=pd.DataFrame({"Data1":np.random.randint(0,10,5),
"Data2":np.random.randint(10,20,5),
"key1":list("aabba"),
"key2":list("xyyxy")})
print(df1)
#按照key1进行分组,进行聚合运算
#注意:当分组进行数值计算时,不是数值类的列会被清除
print(df1.groupby("key1").sum())
输出:Data1 Data2 key1 key2
0 1 18 a x
1 4 11 a y
2 6 15 b y
3 9 17 b x
4 5 10 a y
Data1 Data2
key1
a 10 39
b 15 32
只计算data1:
print(df1["Data1"].groupby(df1["key1"]).sum())
print(df1.groupby("key1")["Data1"].sum())
输出:
key1
a 10
b 15
Name: Data1, dtype: int32
key1
a 10
b 15
Name: Data1, dtype: int32
使用agg()函数做聚合运算
print(df1.groupby("key1").agg("sum"))
输出:Data1 Data2
key1
a 10 39
b 15 32
可以同时做多个聚合运算
print(df1.groupby("key1").agg(["sum","mean","std"]))
输出:
Data1 Data2
sum mean std sum mean std
key1
a 10 3.333333 2.081666 39 13 4.358899
b 15 7.500000 2.121320 32 16 1.414214
自定义函数,传入agg方法中,grouped.agg(func)
def peak_range(df):
'''返回取值范围
'''
return df.max()-df.min()
print(df1.groupby("key1").agg(peak_range))
输出: Data1 Data2
key1
a 4 8
b 3 2
#同时应用多个聚合函数
print(df1.groupby("key1").agg(["count","mean","std",peak_range]))
print("==========")
print(df1.groupby("key1").agg(["count","mean","std",("range",peak_range)]))
输出:
Data1 Data2
count mean std peak_range count mean std peak_range
key1
a 3 3.333333 2.081666 4 3 13 4.358899 8
b 2 7.500000 2.121320 3 2 16 1.414214 2
==========
Data1 Data2
count mean std range count mean std range
key1
a 3 3.333333 2.081666 4 3 13 4.358899 8
b 2 7.500000 2.121320 3 2 16 1.414214 2
#给每列作用不同的聚合函数
dict_mapping={
"Data1":["mean","max"],
"Data2":"sum"}
df1.groupby("key1").agg(dict_mapping)
输出:
Data1 Data2
mean max sum
key1
a 3.333333 5 39
b 7.500000 9 32
拓展,使用apply函数
#apply函数是pandas里面自由度最高的函数
df1=pd.DataFrame({
"sex":list("FFMFMMF"),
"smoker":list("YNYYNYY"),
"age":[21,30,17,37,40,18,26],
"weight":[120,100,132,140,94,89,123]})
print(df1)
def bin_age(age):
if age>=18:
return 1
else:
return 0
输出:
sex smoker age weight
0 F Y 21 120
1 F N 30 100
2 M Y 17 132
3 F Y 37 140
4 M N 40 94
5 M Y 18 89
6 F Y 26 123
#抽烟的年龄大于等于18的:
print(df1["age"].apply(bin_age))
df1["age"]=df1["age"].apply(bin_age)
print(df1)
输出:
0 1
1 1
2 0
3 1
4 1
5 1
6 1
Name: age, dtype: int64
sex smoker age weight
0 F Y 1 120
1 F N 1 100
2 M Y 0 132
3 F Y 1 140
4 M N 1 94
5 M Y 1 89
6 F Y 1 123
#取出抽烟和不抽烟的体重前三:
def top(smoker,col,n=5):
return smoker.sort_values(by=col)[-n:]
df1.groupby("smoker").apply(top,col="weight",n=3)
输出: sex smoker age weight
smoker
N 4 M N 1 94
1 F N 1 100
Y 6 F Y 1 123
2 M Y 0 132
3 F Y 1 140
九、分组案例
最后以一个案例来收尾Python学习:
分组案例:导入文件在我自己的电脑上
import numpy as np
import pandas as pd
data=pd.read_csv("movie_metadata.csv")
print(data.head())
#处理缺失值
data=data.dropna(how="any")
print(data.head())
#查看票房收入统计
#导演vs票房总收入
group_director=data.groupby(by="director_name")["gross"].sum()
#ascending升降序排序,True表示升序
result=group_director.sort_values()
print(type(result))
print(result)
输出:
实在是太多了,还有一个多层索引的拓展没有码。
本上在这里python 就要告一段落了,时学时新。
还有拿在实际案例中多多练习、打磨,才知道各种用法到底是怎样的。