小笨蛋学习Python第八集


前言

Python基础知识8
本节主要学习pandas相关知识点

一、pandas是什么?

pandas是基于numpy的一种工具,能使我们快速便捷地处理数据的函数和方法。其纳入了很多库和标准的数据模型。

二、series对象

series是Pandas中最基本的对象,类似于一维数组,和numpy数组对象差不多,但有所差别,series可以为数据自定义标签,也就是索引(index),然后通过索引来访问数组中的数据。

2.1Series常用操作

创建:

#创建索引
import pandas as pd 
from pandas import Series,DataFrame
import numpy as np
#创建Series对象并忽略索引
sel=Series([1,2,3,4])
print(sel)

#通常会自己创建索引
sel=Series(data=[1,2,3,4],index=["a","b","c","d"]) 
sel=Series(data=[1,2,3,4],index=list("abcd"))
print(sel)

输出:
0    1
1    2
2    3
3    4
dtype: int64
a    1
b    2
c    3
d    4
dtype: int64

获取:

sel=Series(data=[1,2,3,4],index=["a","b","c","d"]) 
sel=Series(data=[1,2,3,4],index=list("abcd"))
print(sel)
#获取内容
print(sel.values)
#获取索引
print(sel.index)
#获取索引和值对
print(list(sel.iteritems()))
输出:
a    1
b    2
c    3
d    4
dtype: int64
[1 2 3 4]
Index(['a', 'b', 'c', 'd'], dtype='object')
[('a', 1), ('b', 2), ('c', 3), ('d', 4)]

将字典转为series:

dict={"red":100,"black":400,"green":300,"pink":900}
sel=Series(dict)
print(sel)

输出:
red      100
black    400
green    300
pink     900
dtype: int64

Series的数据获取:

sel=Series(data=[1,2,3,4],index=list("abcd"))
print(sel)
print("=========")
#Series对象同时支持位置和标签两种方式获取数据
print("索引下标",sel["c"])
print("=========")
print("位置下标",sel[2]) 
print("=========")
#获取不连续的数据
print("索引下标",sel[["a","c"]]) 
print("=========")
print("位置下标",sel[[1,3]])
print("=========")
#可以使用切片获取数据
print("位置切片",sel[1:3])
print("=========")
print("索引切片",sel["b":"d"])
输出:
a    1
b    2
c    3
d    4
dtype: int64
=========
索引下标 3
=========
位置下标 3
=========
索引下标 a    1
c    3
dtype: int64
=========
位置下标 b    2
d    4
dtype: int64
=========
位置切片 b    2
c    3
dtype: int64
=========
索引切片 b    2
c    3
d    4
dtype: int64

重新赋值索引的值:

sel.index=list("dcba")
print(sel)
#reindex重新索引,会返回一个新的series,(调用reindex将会重新排序,缺失值使用nan填补)
print(sel.reindex(["b","a","c","d","e"]))

输出:
d    1
c    2
b    3
a    4
dtype: int64
b    3.0
a    4.0
c    2.0
d    1.0
e    NaN
dtype: float64

drop丢弃指定轴上的项:

sel=pd.Series(range(10,15))
print(sel)
print(sel.drop([2,3]))

输出:
0    10
1    11
2    12
3    13
4    14
dtype: int64
0    10
1    11
4    14
dtype: int64

2.2.series进行算术运算操作

对series的算术操作都是基于index进行的,我们可以用加减乘除这样的运算符对两个series进行计算,pandas会根据索引index对应的数据进行计算,结果以浮点数的形式存储,以避免丢失精度。如果pandas在两个series中找不到相同的index,对应的位置就返回一个空值nan。
#series进行算术操作运算

series1=pd.Series([1,2,3,4],["London","Hongkong","Humbai","lagos"])
series2=pd.Series([1,3,6,4],["London","Accra","lagos","Delhi"])
print(series1-series2)
print("===============")
print(series1+series2)
print("===============")
print(series1*series2)
print("===============")
#同样也支持numpy的数组运算
sel=Series(data=[1,6,3,5],index=list("abcd"))
print(sel[sel>3])#布尔数据过滤
print("===============")
print(sel*2)#标量乘法

输出:

Accra       NaN
Delhi       NaN
Hongkong    NaN
Humbai      NaN
London      0.0
lagos      -2.0
dtype: float64
===============
Accra        NaN
Delhi        NaN
Hongkong     NaN
Humbai       NaN
London       2.0
lagos       10.0
dtype: float64
===============
Accra        NaN
Delhi        NaN
Hongkong     NaN
Humbai       NaN
London       1.0
lagos       24.0
dtype: float64
===============
b    6
d    5
dtype: int64
===============
a     2
b    12
c     6
d    10
dtype: int64

三、DataFrame的创建

DataFrame(数据表)是一种二维数据结构,数据以表格的形式存储,分成若干行和列。常见操作有选取、替换行或列的数据,还能重组数据表,修改索引,多重筛选等。可以把DataFrame理解成为一组采用同样索引的series的集合,调用DataFrame可以将多种格式的数据转换为DataFrame对象,它的三个参数data\index\columns分别是数据、行索引和列索引。

3.1.DataFrame的创建

使用二维数组

#使用二维数组
df1=DataFrame(np.random.randint(0,10,(4,4)),index=[1,2,3,4],columns=["a","b","c","d"])
print(df1)
输出:
 a  b  c  d
1  3  7  9  2
2  7  3  1  6
3  2  1  6  0
4  2  1  0  6

使用字典创建(行索引由字典创建,列索引由字典的键决定)

dict={
    "province":["Guangdong","Beijing","Qinghai","Fujian"],
    "pop":[1.3,2.5,1.1,0.7] ,"year":[2018,2018,2018,2018]}

df2=pd.DataFrame(dict,index=[1,2,3,4])
print(df2)

输出:
province  pop  year
1  Guangdong  1.3  2018
2    Beijing  2.5  2018
3    Qinghai  1.1  2018
4     Fujian  0.7  2018
===============
   a  b
0  1  4
1  2  5
2  3  6

使用from_dict创建

dict2={"a":[1,2,3],"b":[4,5,6]}
df6=pd.DataFrame.from_dict(dict2)
print(df6)
输出:
  a  b
0  1  4
1  2  5
2  3  6

#索引相同的情况下,相同索引的值会相应减少,缺少的值会添加NaN

data={
   "name":pd.Series(["zs","ls","we"],index=["a","b","c"]),
    "Age":pd.Series(["10","20","30","40"],index=["a","b","c","d"]),
    "country":pd.Series(["中国","日本","韩国"],index=["a","c","b"])}
df=pd.DataFrame(data)
print(df)
print("=====================")
#to_dict()方法将DataFrame转换成字典
dict=df.to_dict()
print(dict)

输出:
name Age country
a   zs  10      中国
b   ls  20      韩国
c   we  30      日本
d  NaN  40     NaN
=====================
{'name': {'a': 'zs', 'b': 'ls', 'c': 'we', 'd': nan}, 'Age': {'a': '10', 'b': '20', 'c': '30', 'd': '40'}, 'country': {'a': '中国', 'b': '韩国', 'c': '日本', 'd': nan}}

3.2.DataFrame的对象常用属性

import pandas as pd 
from pandas import Series,DataFrame
import numpy as np 
#DataFrame常用属性
df_dict={
    "name":["James","Curry","Iversion"],
    "age":["18","20","19"],
    "national":["us","china","us"]}
df=pd.DataFrame(data=df_dict,index=["0","1","2"])
print(df)
print("=====================")
#获取行数和列数
print(df.shape)
print("=====================")
#获取行索引
print(df.index.tolist())
print("=====================")
#获取列索引
print(df.columns.tolist())
print("=====================")
#获取数据的维度
print(df.dtypes)
print("=====================")
#values属性也会以二维ndarray的形式返回DataFrame的数据
print(df.values)
print("=====================")
#展示df的概览
print(df.info())
print("=====================")
#显示头几行,默认显示头5行
print(df.head(2))
print("=====================")
#显示后几行
print(df.tail(1))
print("=====================")
#获取DataFrame的列
print(df["name"])
print("=====================")
#如果获取多个列,那么默认返回一个DataFrame类型:
print(df[["name","age"]])
print(type(df[["name","age"]]))
print("=====================")
#获取一行
print(df[0:1])
print("=====================")
#获取多行
print(df[1:3])
print("=====================")
#取多行里面的某一列
print(df[1:3][["name","age"]])
#注意:df[]只能进行行选择,或列选择,不能同时多行多列
print("=====================")

输出:
  name age national
0     James  18       us
1     Curry  20    china
2  Iversion  19       us
=====================
(3, 3)
=====================
['0', '1', '2']
=====================
['name', 'age', 'national']
=====================
name        object
age         object
national    object
dtype: object
=====================
[['James' '18' 'us']
 ['Curry' '20' 'china']
 ['Iversion' '19' 'us']]
=====================
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 0 to 2
Data columns (total 3 columns):
name        3 non-null object
age         3 non-null object
national    3 non-null object
dtypes: object(3)
memory usage: 96.0+ bytes
None
=====================
    name age national
0  James  18       us
1  Curry  20    china
=====================
       name age national
2  Iversion  19       us
=====================
0       James
1       Curry
2    Iversion
Name: name, dtype: object
=====================
       name age
0     James  18
1     Curry  20
2  Iversion  19
<class 'pandas.core.frame.DataFrame'>
=====================
    name age national
0  James  18       us
=====================
       name age national
1     Curry  20    china
2  Iversion  19       us
=====================
       name age
1     Curry  20
2  Iversion  19
=====================

继续:

import pandas as pd 
from pandas import Series,DataFrame
import numpy as np 
#DataFrame常用属性
df_dict={
    "name":["James","Curry","Iversion"],
    "age":["18","20","19"],
    "national":["us","china","us"]}
df=pd.DataFrame(data=df_dict,index=["0","1","2"])
#通过标签索引行数据——df.loc
#通过位置获取行数据——df.iloc
#获取某一行某一列的数据
print(df.loc["0","name"])
#一行所有列
print(df.loc["0",:])
print("=====================")
#某一行多列的数据
print(df.loc["0",["name","age"]])
print("=====================")
#选择间隔的多行多列
print(df.loc[["0","2"],["name","national"]])
print("=====================")
#取一行
print(df.iloc[1])
print("=====================")
#取连续多行
print(df.iloc[[0,2],:])
#取某一列
print("=====================")
print(df.iloc[:,1])
print("=====================")
#某一个值
print(df.iloc[1,0])
print("=====================")
#修改值
df.iloc[0,0]="panda"
print(df)
print("=====================")
#dataframe中的排序方法
df=df.sort_values(by="age",ascending=False)#ascending=False表示降序排序,默认是升序
print(df)

输出:

James
name        James
age            18
national       us
Name: 0, dtype: object
=====================
name    James
age        18
Name: 0, dtype: object
=====================
       name national
0     James       us
2  Iversion       us
=====================
name        Curry
age            20
national    china
Name: 1, dtype: object
=====================
       name age national
0     James  18       us
2  Iversion  19       us
=====================
0    18
1    20
2    19
Name: age, dtype: object
=====================
Curry
=====================
       name age national
0     panda  18       us
1     Curry  20    china
2  Iversion  19       us
=====================
       name age national
1     Curry  20    china
2  Iversion  19       us
0     panda  18       us

3.3.DataFrame修改index、columns

#dataframe修改index\columns
import pandas as pd 
from pandas import Series,DataFrame
import numpy as np 

df1=pd.DataFrame(np.arange(9).reshape(3,3),index=["bj","sh","gz"],columns=["a","b","c"])
print(df1)
#修改df1的index

print(df1.index)#可以打印出df1的index,并为其重新赋值
print("=====================")
df1.index=["beijing","shanghai","guangzhou"]
print(df1)

#使用自定义map函数(x是原有的行列值)
def test_map(x):
    return x+"_ABC"
print(df1.rename(index=test_map,columns=test_map,inplace=True))
#inplace布尔值,默认为False,指定是否返回新的dataframe,如果为True,则在原Dataframe上修改,返回值为None。
#同时rename还可以传入字典,为某个index单独修改名称。

输出:
    a  b  c
bj  0  1  2
sh  3  4  5
gz  6  7  8
Index(['bj', 'sh', 'gz'], dtype='object')
=====================
           a  b  c
beijing    0  1  2
shanghai   3  4  5
guangzhou  6  7  8
None
=====================
               a_ABC  b_ABC  c_ABC
beijing_ABC        0      1      2
shanghai_ABC       3      4      5
guangzhou_ABC      6      7      8

还有:

import pandas as pd 
from pandas import Series,DataFrame
import numpy as np 

df1=pd.DataFrame(np.arange(9).reshape(3,3),index=["bj","sh","gz"],columns=["a","b","c"])
print(df1)
print("=====================")
#同时rename还可以传入字典,为某个index单独修改名称
df3=df1.rename(index={"bj":"beijing"},columns={"a":"aa"})
print(df3)

输出:
  a  b  c
bj  0  1  2
sh  3  4  5
gz  6  7  8
         aa  b  c
beijing   0  1  2
sh        3  4  5
gz        6  7  8

还有:

#列转化为索引
df1=pd.DataFrame({"x":range(5),"Y":range(5),"s":list("abcde"),"z":[1,1,2,2,2]})
print(df1)
#指定一列为索引:drop=False指定的同时保留作为索引的那一列
result=df1.set_index("s",drop=False)
result.index.name=None
print(result)
#拓展:rest_index是把索引变成列
输出:
  x  Y  s  z
0  0  0  a  1
1  1  1  b  1
2  2  2  c  2
3  3  3  d  2
4  4  4  e  2
   x  Y  s  z
a  0  0  a  1
b  1  1  b  1
c  2  2  c  2
d  3  3  d  2
e  4  4  e  2

and:

#行转为列索引
result=df1.set_axis(df1.iloc[0],axis=1,inplace=False)
result.columns.name=None
print(result)
0  0  a  1
0  0  0  a  1
1  1  1  b  1
2  2  2  c  2
3  3  3  d  2
4  4  4  e  2

3.4添加数据

#添加数据
df1=pd.DataFrame([["snow","M",22],["Tyrion","M",32],["sansa","F",18],["Arya","F",14]],columns=["name","gender","age"])
df1["score"]=[80,98,67,90]#在数据框最后加上score一列,并且增加列的元素必须和元数据列的个数一样
print(df1)

输出:
   name gender  age  score
0    snow      M   22     80
1  Tyrion      M   32     98
2   sansa      F   18     67
3    Arya      F   14     90

插入数据:

#原始数据框
df1=pd.DataFrame([["snow","M",22],["Tyrion","M",32],["sansa","F",18],["Arya","F",14]],columns=["name","gender","age"])
df1["score"]=[80,98,67,90]#在数据框最后加上score一列,并且增加列的元素必须和元数据列的个数一样
print(df1)
print("===================================")

#在原始数据框中插入数据
col_name=df1.columns.tolist()#将数据框的列名全部取出放在列表里
col_name.insert(2,"city")#在列索引为2的位置插入一列,列名为city,刚插入时不会有值,整列都是Nan
df1=df1.reindex(columns=col_name)#DataFrame.reindex对原行、列索引重新构建索引。
print(df1)
输出:  name gender  age  score
0    snow      M   22     80
1  Tyrion      M   32     98
2   sansa      F   18     67
3    Arya      F   14     90
===================================
     name gender  city  age  score
0    snow      M   NaN   22     80
1  Tyrion      M   NaN   32     98
2   sansa      F   NaN   18     67
3    Arya      F   NaN   14     90

给city列重新赋值

df1["city"]=["北京","山西","湖北","澳门"]
print(df1)

df中的insert,插入一列

"""
df中的insert,插入一列
df.insert(iloc,column,value)
iloc:要插入的位置
columns:列名
values:值
"""
df1.insert(2,"scoress",[80,98,67,90])
print(df1)
输出:
 name gender  scoress  scores city  age  score
0    snow      M       80      80   北京   22     80
1  Tyrion      M       98      98   山西   32     98
2   sansa      F       67      67   湖北   18     67
3    Arya      F       90      90   澳门   14     90

插入一行(会被覆盖,可以进行分割合并的方式)

row=["111","222","333"]
df1.iloc:[1]=row
print(df1)
输出:AttributeError: can't set attribute报了一个这个错误,不知道怎么回事呢?

使用append在行尾增加一行:

#增加数据
df1=pd.DataFrame([["snow","M",22],["Tyrion","M",32],["sansa","F",18],["Arya","F",14]],columns=["name","gender","age"])
new=pd.DataFrame({"name":"lisa",
                  "gender":"F",
                  "age":"19"
                 },index=[0])
print(new)
print("======================")
##在原始数据框最后一行新增一行,使用append方法
df1=df1.append(new,ignore_index=True)#ignore_index=Flase表示不按照原来的索引,从0开始递增。
print(df1)

输出:
name gender age
0  lisa      F  19
======================
     name gender age
0    snow      M  22
1  Tyrion      M  32
2   sansa      F  18
3    Arya      F  14
4    lisa      F  19

合并功能:
objs:合并对象
axis:合并方式(默认表示按照列合并,1表示按照行合并)
ignore_index:是否忽略索引

#合并
df1=pd.DataFrame(np.arange(6).reshape(3,2),columns=["four","five"])
df2=pd.DataFrame(np.arange(6).reshape(2,3),columns=["one","two","three"])
print(df1)
print("===================")
print(df2)

输出:
four  five
0     0     1
1     2     3
2     4     5
===================
   one  two  three
0    0    1      2
1    3    4      5

按照行合并:

result=pd.concat([df1,df2],axis=1)
print(result)
输出:
 four  five  one  two  three
0     0     1  0.0  1.0    2.0
1     2     3  3.0  4.0    5.0
2     4     5  NaN  NaN    NaN

按照列合并:

result=pd.concat([df1,df2],axis=0,ignore_index=True)
print(result)
输出:
 five  four  one  three  two
0   1.0   0.0  NaN    NaN  NaN
1   3.0   2.0  NaN    NaN  NaN
2   5.0   4.0  NaN    NaN  NaN
3   NaN   NaN  0.0    2.0  1.0
4   NaN   NaN  3.0    5.0  4.0

DataFrame的删除

df2=pd.DataFrame(np.arange(9).reshape(3,3),columns=["one","two","three"])
print(df2)
df3=df2.drop(["one"],axis=1,inplace=True)
print(df2)
print(df3)
#label:要删除的数据标签;
#axis:0,0表示删除行,1表示删除列,默认为0;
#inplace:是否在df中执行此操作;

输出:
 one  two  three
0    0    1      2
1    3    4      5
2    6    7      8
   two  three
0    1      2
1    4      5
2    7      8
None

四、数据处理

通过dropna()处理缺失数据:

from numpy import nan as NaN
#通过dropna()过滤删除缺失数据
se=pd.Series([4,NaN,8,NaN,5])
print(se)
print(se.dropna())
print(se.notnull())
print(se.isnull())

输出:
0    4.0
1    NaN
2    8.0
3    NaN
4    5.0
dtype: float64
0    4.0
2    8.0
4    5.0
dtype: float64
0     True
1    False
2     True
3    False
4     True
dtype: bool
0    False
1     True
2    False
3     True
4    False
dtype: bool

通过布尔序列也能滤除:

print(se[se.notnull()])
输出:
0    4.0
2    8.0
4    5.0
dtype: float64

处理DataFrame对象

df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("================")
#默认滤除所有包含的NaN:
print(df1.dropna())
print("================")
#传入HOW=“all”滤除全部为NaN的行:
print(df1.dropna(how="all"))#默认情况下是“how=any”,意思是只要有nan就删除
print("================")
#传入axis=1,滤除列:
print(df1.dropna(axis=1,how="all"))
print("================")
#传入thresh=n保留至少有n个非NaN数据的行:
print(df1.dropna(thresh=1))

输出:
 0    1    2
0  1.0  2.0  3.0
1  NaN  NaN  2.0
2  NaN  NaN  NaN
3  8.0  8.0  NaN
================
     0    1    2
0  1.0  2.0  3.0
================
     0    1    2
0  1.0  2.0  3.0
1  NaN  NaN  2.0
3  8.0  8.0  NaN
================
     0    1    2
0  1.0  2.0  3.0
1  NaN  NaN  2.0
2  NaN  NaN  NaN
3  8.0  8.0  NaN
================
     0    1    2
0  1.0  2.0  3.0
1  NaN  NaN  2.0
3  8.0  8.0  NaN

填充缺失数据:

#填充缺失数据
df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("================1")
#用常数填充fillna
print(df1.fillna(0))
print("================2")
#传入inplace=True直接修改原对象
df1.fillna(0,inplace=True)
print(df1)
print("================3")
输出:
 0    1    2
0  1.0  2.0  3.0
1  NaN  NaN  2.0
2  NaN  NaN  NaN
3  8.0  8.0  NaN
================1
     0    1    2
0  1.0  2.0  3.0
1  0.0  0.0  2.0
2  0.0  0.0  0.0
3  8.0  8.0  0.0
================2
     0    1    2
0  1.0  2.0  3.0
1  0.0  0.0  2.0
2  0.0  0.0  0.0
3  8.0  8.0  0.0
================3

#填充缺失数据:

df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("================1")
#通过字典填充不同的常数
print(df1.fillna({0:10,1:20,2:30})) 
print("================4")
#填充平均值
print(df1.fillna(df1.mean()))

输出:
  0    1    2
0  1.0  2.0  3.0
1  NaN  NaN  2.0
2  NaN  NaN  NaN
3  8.0  8.0  NaN
================1
      0     1     2
0   1.0   2.0   3.0
1  10.0  20.0   2.0
2  10.0  20.0  30.0
3   8.0   8.0  30.0
================4
     0    1    2
0  1.0  2.0  3.0
1  4.5  5.0  2.0
2  4.5  5.0  2.5
3  8.0  8.0  2.5

填充缺失数据:

df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("================1")
#如果只填充了一列
print(df1.iloc[:,1].fillna(5,inplace=True))
print(df1)
输出:
0    1    2
0  1.0  2.0  3.0
1  NaN  NaN  2.0
2  NaN  NaN  NaN
3  8.0  8.0  NaN
================1
None
     0    1    2
0  1.0  2.0  3.0
1  NaN  5.0  2.0
2  NaN  5.0  NaN
3  8.0  8.0  NaN

填充缺失数据

df1=pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
print(df1)
print("====================1")
#传入method=""改变插值方式
df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN
df2.iloc[2:4,4]=NaN
print(df2)

输出: 0    1    2
0  1.0  2.0  3.0
1  NaN  NaN  2.0
2  NaN  NaN  NaN
3  8.0  8.0  NaN
====================1
   0  1  2    3    4
0  1  7  5  1.0  2.0
1  1  2  0  NaN  5.0
2  0  1  6  NaN  NaN
3  8  8  2  NaN  NaN
4  7  7  9  9.0  2.0

df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN
df2.iloc[2:4,4]=NaN
print(df2)
print("=====================")

#用前面的值填充ffill,用后面的值来填充bfill
print(df2.fillna(method="ffill"))
print("=====================")
 
 输出:
 0  1  2    3    4
0  0  0  6  6.0  2.0
1  4  1  5  NaN  2.0
2  1  0  1  NaN  NaN
3  0  6  5  NaN  NaN
4  4  6  8  0.0  6.0
=====================
   0  1  2    3    4
0  0  0  6  6.0  2.0
1  4  1  5  6.0  2.0
2  1  0  1  6.0  2.0
3  0  6  5  6.0  2.0
4  4  6  8  0.0  6.0
=====================
df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN
df2.iloc[2:4,4]=NaN
print(df2)
print("=====================")
#传入limit=“限制填充行数”
print(df2.fillna(method="bfill",limit=1))
​输出:
   0  1  2    3    4
0  2  8  8  4.0  4.0
1  0  8  0  NaN  1.0
2  0  1  7  NaN  NaN
3  1  6  4  NaN  NaN
4  7  0  4  3.0  4.0
=====================
   0  1  2    3    4
0  2  8  8  4.0  4.0
1  0  8  0  NaN  1.0
2  0  1  7  NaN  NaN
3  1  6  4  3.0  4.0
4  7  0  4  3.0  4.0
#传入axis="",修改填充方向
df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN
df2.iloc[2:4,4]=NaN
print(df2)
print("===========================")
print(df2.fillna(method="ffill",limit=1,axis=1))
输出:
0  1  2    3    4
0  8  6  7  2.0  1.0
1  8  8  2  NaN  0.0
2  7  3  8  NaN  NaN
3  6  8  0  NaN  NaN
4  3  9  9  4.0  5.0
===========================
     0    1    2    3    4
0  8.0  6.0  7.0  2.0  1.0
1  8.0  8.0  2.0  2.0  0.0
2  7.0  3.0  8.0  8.0  NaN
3  6.0  8.0  0.0  0.0  NaN
4  3.0  9.0  9.0  4.0  5.0

移除重复数据:

df1=pd.DataFrame({"A":[1,1,1,2,2,3,1],"B":list("aabbbca")})
print(df1)
print("==========")
#DataFrame中常出现重复行,利用duplicated函数返回每一行判断是否重复的结果(重复则为True)
print(df1.duplicated())
输出: A  B
0  1  a
1  1  a
2  1  b
3  2  b
4  2  b
5  3  c
6  1  a
==========
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

去除全部的重复行:

print(df1.drop_duplicates())
输出: A  B
0  1  a
2  1  b
3  2  b
5  3  c

指定列去除重复行:

print(df1.drop_duplicates(["A"]))
输出:
A  B
0  1  a
3  2  b
5  3  c

保留重复行众多最后一行:

print(df1.drop_duplicates(["A"],keep="last"))
输出:
A  B
4  2  b
5  3  c
6  1  a

去除重复的同时改变DataFrame对象:

df1.drop_duplicates(["A","B"],inplace=True)
print(df1)
输出:
    A  B
0  1  a
2  1  b
3  2  b
5  3  c

五、数据合并

#数据合并——简单合并

import pandas as pd 
df3=pd.DataFrame({"Red":[1,3,5],"Green":[5,0,3]},index=list("abc"))
df4=pd.DataFrame({"Blue":[1,9,8],"Yellow":[6,6,7]},index=list("cde"))
print(df3)
print(df4)

简单合并(默认是left左连接,以左侧df3为基础)
df3.join(df4,how="left")
输出: Red  Green
a    1      5
b    3      0
c    5      3
   Blue  Yellow
c     1       6
d     9       6
e     8       7
Red	Green	Blue	Yellow
a	1	5	NaN	NaN
b	3	0	NaN	NaN
c	5	3	1.0	6.0

简单合并之右连接:

#右连接
df3.join(df4,how="right")
输出:Red  Green
a    1      5
b    3      0
c    5      3
   Blue  Yellow
c     1       6
d     9       6
e     8       7
Red	Green	Blue	Yellow
c	5.0	3.0	1	6
d	NaN	NaN	9	6
e	NaN	NaN	8	7

简单合并之外连接:

#外连接
df3.join(df4,how="outer")
输出:
   Red	Green	Blue	Yellow
a	1.0	5.0	NaN	NaN
b	3.0	0.0	NaN	NaN
c	5.0	3.0	1.0	6.0
d	NaN	NaN	9.0	6.0
e	NaN	NaN	8.0	7.0

#合并多个datafram对象

df5=pd.DataFrame({"Brown":[3,4,5],"White":[1,1,2]},index=list("aed"))
print(df5)

df3.join([df4,df5])
输出: Brown  White
a      3      1
e      4      1
d      5      2
Red	Green	Blue	Yellow	Brown	White
a	1	5	NaN	NaN	3.0	1.0
b	3	0	NaN	NaN	NaN	NaN
c	5	3	1.0	6.0	NaN	NaN

使用merge合并:

#使用merge,着重关注的是列的合并
df1=pd.DataFrame({"名字":list("ABCDE"),"性别":["男","女","男","男","女"],
                  "职称":["副教授","讲师","助教","教授","助教"]},
                 index=range(1001,1006))
df1.columns.name="学院老师"
df1.index.name="编号"
print(df1)
print("==========================")
df2=pd.DataFrame({"名字":list("ABDAX"),"课程":["C++","计算机导论","汇编","数据结构","马克思原理"],
                  "职称":["副教授","讲师","教授","副教授","讲师"]},index=[1001,1002,1004,1001,3001])
df2.columns.name="课程"
df2.index.name="编号"
print(df2)
print("==========================")
#默认是根据左右对象出现同名的列作为连接的值,且连接方式是how="inner"
print(pd.merge(df1,df2))
print("==========================")

输出:学院老师 名字 性别   职称
编号             
1001  A  男  副教授
1002  B  女   讲师
1003  C  男   助教
1004  D  男   教授
1005  E  女   助教
==========================
课程   名字     课程   职称
编号                 
1001  A    C++  副教授
1002  B  计算机导论   讲师
1004  D     汇编   教授
1001  A   数据结构  副教授
3001  X  马克思原理   讲师
==========================
  名字 性别   职称     课程
0  A  男  副教授    C++
1  A  男  副教授   数据结构
2  B  女   讲师  计算机导论
3  D  男   教授     汇编
==========================

指定列名合并:

pd.merge(df1,df2,on="名字",suffixes=["_1","_2"])#返回匹配的;
输出:名字	性别	职称_1	课程	职称_2
0	A	男	副教授	C++	副教授
1	A	男	副教授	数据结构	副教授
2	B	女	讲师	计算机导论	讲师
3	D	男	教授	汇编	教授

连接方式,根据左侧为准:

pd.merge(df1,df2,how="left")
输出:名字	性别	职称	课程
0	A	男	副教授	C++
1	A	男	副教授	数据结构
2	B	女	讲师	计算机导论
3	C	男	助教	NaN
4	D	男	教授	汇编
5	E	女	助教	NaN

根据右侧为准:

pd.merge(df1,df2,how="right")
输出:
名字	性别	职称	课程
0	A	男	副教授	C++
1	A	男	副教授	数据结构
2	B	女	讲师	计算机导论
3	D	男	教授	汇编
4	X	NaN	讲师	马克思原理

根据所有:

pd.merge(df1,df2,how="outer")
输出:
名字	性别	职称	课程
0	A	男	副教授	C++
1	A	男	副教授	数据结构
2	B	女	讲师	计算机导论
3	C	男	助教	NaN
4	D	男	教授	汇编
5	E	女	助教	NaN
6	X	NaN	讲师	马克思原理

根据多个键进行连接:

pd.merge(df1,df2,on=["职称","名字"])
输出:
名字	性别	职称	课程
0	A	男	副教授	C++
1	A	男	副教授	数据结构
2	B	女	讲师	计算机导论
3	D	男	教授	汇编

拓展:

#轴向连接——concat
#series对象的连接
s1=pd.Series([1,2],index=list("ab"))
s2=pd.Series([3,4,5],index=list("bde"))
print(s1)
print(s2)
pd.concat([s1,s2])

输出:
a    1
b    2
dtype: int64
b    3
d    4
e    5
dtype: int64
a    1
b    2
b    3
d    4
e    5
dtype: int64

横向连接

pd.concat([s1,s2],axis=1)
输出:
0	1
a	1.0	NaN
b	2.0	3.0
d	NaN	4.0
e	NaN	5.0

用内连接求交集(连接方式,公有"inner\left\right\outer)

pd.concat([s1,s2],axis=1,join="inner")
输出:
0	1
b	2	3

指定部分索引进行连接

pd.concat([s1,s2],axis=1,join_axes=[list("abc")])
输出:
    0	      1
a	1.0	NaN
b	2.0	3.0
c	NaN	NaN

创建层次化索引

pd.concat([s1,s2],keys=["A","B"])
输出:
A  a    1
   b    2
B  b    3
   d    4
   e    5
dtype: int64

#当纵向连接时,keys为列名

pd.concat([s1,s2],keys=["A","D"],axis=1)
输出:
     A	  D
a	1.0	NaN
b	2.0	3.0
d	NaN	4.0
e	NaN	5.0

#DataFrame对象的连接

df3=pd.DataFrame({"Red":[1,3,5],"Green":[5,0,3]},index=list("abd"))
df4=pd.DataFrame({"Blue":[1,9],"Yellow":[6,6]},index=list("ce"))
print(df3)
print(df4)
pd.concat([df3,df4])
pd.concat([df3,df4],axis=1,keys=["A","B"])
输出:
   Red  Green
a    1      5
b    3      0
d    5      3
   Blue  Yellow
c     1       6
e     9       6
A	B

 Red	Green	Blue	Yellow
a	1.0	5.0	NaN	NaN
b	3.0	0.0	NaN	NaN
c	NaN	NaN	1.0	6.0
d	5.0	3.0	NaN	NaN
e	NaN	NaN	9.0	6.0

使用字典的方式同样可以创建层次化索引:

pd.concat({"A":df3,"B":df4},axis=1)

输出:
     A	     B
   Red Green Blue Yellow
a	1.0	5.0	NaN	NaN
b	3.0	0.0	NaN	NaN
c	NaN	NaN	1.0	6.0
d	5.0	3.0	NaN	NaN
e	NaN	NaN	9.0	6.0

七、时间序列

7.1.生成一段时间范围

#时间序列
import pandas as pd 
import numpy as np 
date=pd.date_range(start="20190501",end="20190530")
print(date)

输出:DatetimeIndex(['2019-05-01', '2019-05-02', '2019-05-03', '2019-05-04','2019-05-05', '2019-05-06', '2019-05-07', '2019-05-08', '2019-05-09', '2019-05-10', '2019-05-11', '2019-05-12', '2019-05-13', '2019-05-14', '2019-05-15', '2019-05-16','2019-05-17', '2019-05-18', '2019-05-19', '2019-05-20','2019-05-21', '2019-05-22', '2019-05-23', '2019-05-24', '2019-05-25', '2019-05-26', '2019-05-27', '2019-05-28', '2019-05-29', '2019-05-30'],dtype='datetime64[ns]', freq='D')

该函数主要用于生成一个固定频率的时间索引,在调用构造方法的时候,必须指定start、end,period中的两个参数,否则报错。

参数说明:

freq:日期偏移量,取值为string,默认为“D”,freq=“1h30min”,freq=10D";
period:固定时期,取值是整数或None;
date=pd.date_range(start="20190501",periods=10,freq="10D")
print(date)

#根据closed参数选择是否包含开始和结束时间closed=None,left包含开始时间,不包含结束时间,right则与之相反。
date=pd.date_range(start="2019-01-09",end="2019-01-14",closed="left")
print(date)

输出:
DatetimeIndex(['2019-05-01', '2019-05-11', '2019-05-21', '2019-05-31',
               '2019-06-10', '2019-06-20', '2019-06-30', '2019-07-10',
               '2019-07-20', '2019-07-30'],
              dtype='datetime64[ns]', freq='10D')
DatetimeIndex(['2019-01-09', '2019-01-10', '2019-01-11', '2019-01-12',
               '2019-01-13'],
              dtype='datetime64[ns]', freq='D')

时间序列频率:

D日历日的每一天
B工作日的每天
H每小时
T或min每分钟
S每秒钟
L或ms每毫秒
U每微妙
M日历日的月底日期
BM工作日的月底日期
MS日历日的月初日期
BMS工作日的月初日期

7.2时间序列在DataFrame中的作用

可以将时间作为索引:

#可以将时间作为索引
index=pd.date_range(start="20190101",periods=10)
df=pd.Series(np.random.randint(0,10,size=10),index=index)
print(df)

输出:
2019-01-01    5
2019-01-02    9
2019-01-03    6
2019-01-04    1
2019-01-05    8
2019-01-06    1
2019-01-07    1
2019-01-08    5
2019-01-09    3
2019-01-10    8
Freq: D, dtype: int32

turncate这个函数将before指定日期之前的值全部过滤出去,after指定日期之后的值全部过滤出去。

after=df.truncate(after="2019-01-8")
print(after)
输出:
2019-01-01    5
2019-01-02    9
2019-01-03    6
2019-01-04    1
2019-01-05    8
2019-01-06    1
2019-01-07    1
2019-01-08    5
long_ts=pd.Series(np.random.randn(1000),index=pd.date_range("1/1/2019",periods=1000))
#根据年份获取
result=long_ts["2020"]
print(result)
#使用切片:
result=long_ts["2020-05-01":"2020-05-06"]
print(result)
#通过between……and来返回指定时间段的数据
index=pd.date_range("2018-03-17","2018-03-30",freq="2H")
ts=pd.Series(np.random.randn(157),index=index)
print(ts.between_time("7:00","17:00"))

这些操作也适用于dataframe:

index=pd.date_range("1/1/2019",periods=100) 
df=pd.DataFrame(np.random.randn(100,4),index=index)
print(df.loc["2019-04"])
输出 :
2019-04-01  0.533238 -1.437635  0.278330 -1.578619
2019-04-02  0.564459  0.576964  0.307198 -0.530719
2019-04-03  0.048564 -1.496561  0.839416  0.341760
2019-04-04  0.090401 -1.947177 -0.085777 -0.222991
2019-04-05  1.995785 -0.352821  1.503189 -0.632829
2019-04-06 -1.400590 -0.754488 -1.119926 -0.021565
2019-04-07 -0.285587 -0.269059  0.388311  1.007480
2019-04-08  1.445099  0.964586 -0.577791  0.718433
2019-04-09  0.382878 -0.256596  1.430164  0.235034
2019-04-10 -0.171697 -1.684310 -0.834606 -0.269438

7.3移位日期

ts=pd.Series(np.random.randn(5),index=pd.date_range("1/1/2019",periods=5)) 
print(ts)
#移动数据,索引不变,默认由NaN填充
#periods:移动的位数,负数是向上移动
#fill_value:移动后填充数据
ts.shift(periods=2,fill_value=100)
输出:
2019-01-01    0.484380
2019-01-02    0.849366
2019-01-03   -1.331064
2019-01-04   -0.521108
2019-01-05   -1.868577
Freq: D, dtype: float64
2019-01-01    100.000000
2019-01-02    100.000000
2019-01-03      0.484380
2019-01-04      0.849366
2019-01-05     -1.331064
Freq: D, dtype: float64

使用tshift将索引移动到指定的时间

ts.tshift(2)
输出:
2019-01-03    0.484380
2019-01-04    0.849366
2019-01-05   -1.331064
2019-01-06   -0.521108
2019-01-07   -1.868577
Freq: D, dtype: float64
​

将时间戳转换为时间根式:

pd.to_datetime(1554970740000,unit="ms")
输出:
Timestamp('2019-04-11 08:19:00')

UCT是协调世界时,时区是以UTC的偏移量的形式表示的,需要注意设置utc=True,
是让pandas对象具有时区性质,对于一列进行转换的,会造成转换错误。

import pytz 
print(pytz.common_timezones)
pd.to_datetime(1554970740000,unit="ms").tz_localize("UTC").tz_convert("Asia/Shanghai")

输出:
……略
Timestamp('2019-04-11 16:19:00+0800', tz='Asia/Shanghai')

处理一列:

df=pd.DataFrame([1554970740000,1554970800000,1554970860000],columns=["time_stamp"])
pd.to_datetime(df["time_stamp"],unit="ms").dt.tz_localize("UTC").dt.tz_convert("Asia/shanghai")#先转换到标准时区,再转换到东八区
输出:
0   2019-04-11 16:19:00+08:00
1   2019-04-11 16:20:00+08:00
2   2019-04-11 16:21:00+08:00
Name: time_stamp, dtype: datetime64[ns, Asia/Shanghai]

八、分组聚合

#先创建一个DataFranme
import pandas as pd 
import numpy as np
df=pd.DataFrame({"name":["Boss","Lilei","Lilei","Han","Boss","Boss","Han","Boss"],
                "Year":[2016,2016,2016,2016,2017,2017,2017,2017],
                 "Salary":[999999,20000,25000,3000,9999999,999999,3500,999999],
                 "Bonus":[100000,20000,20000,5000,200000,300000,3000,400000]
                                 })
print(df)

输出:name  Year   Salary   Bonus
0   Boss  2016   999999  100000
1  Lilei  2016    20000   20000
2  Lilei  2016    25000   20000
3    Han  2016     3000    5000
4   Boss  2017  9999999  200000
5   Boss  2017   999999  300000
6    Han  2017     3500    3000
7   Boss  2017   999999  400000

根据name这一列进行分组:

group_by_name=df.groupby("name")
print(type(group_by_name))
输出:<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

查看分组

print(group_by_name.groups)
输出:{'Boss': Int64Index([0, 4, 5, 7], dtype='int64'), 'Han': Int64Index([3, 6], dtype='int64'), 'Lilei': Int64Index([1, 2], dtype='int64')}

查看分组后的数量 :
print(group_by_name.count())
 Year  Salary  Bonus
name                      
Boss      4       4      4
Han       2       2      2
Lilei     2       2      2

查看分组的情况:

for name,group in group_by_name:
    print(name)#组的名字
输出:Boss
   name  Year   Salary   Bonus
0  Boss  2016   999999  100000
4  Boss  2017  9999999  200000
5  Boss  2017   999999  300000
7  Boss  2017   999999  400000
Han
  name  Year  Salary  Bonus
3  Han  2016    3000   5000
6  Han  2017    3500   3000
Lilei
    name  Year  Salary  Bonus
1  Lilei  2016   20000  20000
2  Lilei  2016   25000  20000

查看组的具体内容:

print(group)#查看具体内容
 name  Year  Salary  Bonus
1  Lilei  2016   20000  20000
2  Lilei  2016   25000  20000

可以选择分组:

print(group_by_name.get_group("Boss"))
输出:
 name  Year   Salary   Bonus
0  Boss  2016   999999  100000
4  Boss  2017  9999999  200000
5  Boss  2017   999999  300000
7  Boss  2017   999999  400000  

按照某一列进行分组,将name这一列作为分组的键,对year进行分组

group_by_name=df["Year"].groupby(df["name"])
print(group_by_name.count())
输出:name
Boss     4
Han      2
Lilei    2
Name: Year, dtype: int64

按照多列进行分组

group_by_name_year=df.groupby(["name","Year"])
for name,group in group_by_name_year:
    print(name)
    print(group)
    
    输出:
('Boss', 2016)
   name  Year  Salary   Bonus
0  Boss  2016  999999  100000
('Boss', 2017)
   name  Year   Salary   Bonus
4  Boss  2017  9999999  200000
5  Boss  2017   999999  300000
7  Boss  2017   999999  400000
('Han', 2016)
  name  Year  Salary  Bonus
3  Han  2016    3000   5000
('Han', 2017)
  name  Year  Salary  Bonus
6  Han  2017    3500   3000
('Lilei', 2016)
    name  Year  Salary  Bonus
1  Lilei  2016   20000  20000
2  Lilei  2016   25000  20000
    

可以选择分组:

print(group_by_name_year.get_group(("Boss",2016)))
输出:
   name  Year  Salary   Bonus
0  Boss  2016  999999  100000

将某列数据按照数值分成不同范围进行分组(groupby)运算

df=pd.DataFrame({"Age":np.random.randint(20,70,100),
                 "Sex":np.random.choice(["M","F"],100)
                 })
age_groups=pd.cut(df["Age"],bins=[19,45,60,100])
print(age_groups)
输出:
0      (19, 45]
1      (19, 45]
2      (19, 45]
3      (19, 45]
4      (19, 45]
5      (45, 60]
6     (60, 100]
7     (60, 100]
8     (60, 100]
9     (60, 100]
10     (19, 45]
11    (60, 100]
12    (60, 100]
13     (19, 45]
14     (19, 45]
15    (60, 100]
16     (19, 45]
17     (19, 45]
18     (45, 60]
19     (19, 45]
20     (19, 45]
21     (45, 60]
22     (19, 45]
23     (45, 60]
24     (19, 45]
25     (45, 60]
26    (60, 100]
27    (60, 100]
28     (19, 45]
29     (19, 45]
        ...    
70     (19, 45]
71     (45, 60]
72     (45, 60]
73     (19, 45]
74    (60, 100]
75     (19, 45]
76    (60, 100]
77     (19, 45]
78     (45, 60]
79     (19, 45]
80     (45, 60]
81     (19, 45]
82     (19, 45]
83     (19, 45]
84    (60, 100]
85     (45, 60]
86     (19, 45]
87     (19, 45]
88    (60, 100]
89     (19, 45]
90     (19, 45]
91     (45, 60]
92     (19, 45]
93     (45, 60]
94     (19, 45]
95     (19, 45]
96     (19, 45]
97     (45, 60]
98     (19, 45]
99     (19, 45]
Name: Age, Length: 100, dtype: category
Categories (3, interval[int64]): [(19, 45] < (45, 60] < (60, 100]]

按“Age”分组范围和性别(sex)进行交叉表

pd.crosstab(age_groups,df["Sex"])
输出:
Sex	        F	M
Age		
(19, 45]	31	24
(45, 60]	15	10
(60, 100]	8	12

聚合函数:
在这里插入图片描述

例子:df1=pd.DataFrame({"Data1":np.random.randint(0,10,5),
                   "Data2":np.random.randint(10,20,5),
                   "key1":list("aabba"),
                   "key2":list("xyyxy")})
print(df1)
#按照key1进行分组,进行聚合运算
#注意:当分组进行数值计算时,不是数值类的列会被清除
print(df1.groupby("key1").sum())

输出:Data1  Data2 key1 key2
0      1     18    a    x
1      4     11    a    y
2      6     15    b    y
3      9     17    b    x
4      5     10    a    y
      Data1  Data2
key1              
a        10     39
b        15     32

只计算data1:

print(df1["Data1"].groupby(df1["key1"]).sum())
print(df1.groupby("key1")["Data1"].sum())
输出:
key1
a    10
b    15
Name: Data1, dtype: int32
key1
a    10
b    15
Name: Data1, dtype: int32
使用agg()函数做聚合运算
print(df1.groupby("key1").agg("sum"))
输出:Data1  Data2
key1              
a        10     39
b        15     32

可以同时做多个聚合运算

print(df1.groupby("key1").agg(["sum","mean","std"]))
输出:
 Data1                     Data2               
       sum      mean       std   sum mean       std
key1                                               
a       10  3.333333  2.081666    39   13  4.358899
b       15  7.500000  2.121320    32   16  1.414214

自定义函数,传入agg方法中,grouped.agg(func)

def peak_range(df):
    '''返回取值范围
    '''
    return df.max()-df.min()
print(df1.groupby("key1").agg(peak_range))
输出:  Data1  Data2
key1              
a         4      8
b         3      2
#同时应用多个聚合函数
print(df1.groupby("key1").agg(["count","mean","std",peak_range]))
print("==========")
print(df1.groupby("key1").agg(["count","mean","std",("range",peak_range)]))
输出:
 Data1                                Data2                          
     count      mean       std peak_range count mean       std peak_range
key1                                                                     
a        3  3.333333  2.081666          4     3   13  4.358899          8
b        2  7.500000  2.121320          3     2   16  1.414214          2
==========
     Data1                           Data2                     
     count      mean       std range count mean       std range
key1                                                           
a        3  3.333333  2.081666     4     3   13  4.358899     8
b        2  7.500000  2.121320     3     2   16  1.414214     2


#给每列作用不同的聚合函数
dict_mapping={
            "Data1":["mean","max"],
            "Data2":"sum"}
df1.groupby("key1").agg(dict_mapping)

输出:
        Data1	     Data2
     mean      max	 sum
key1			
a	3.333333	5	39
b	7.500000	9	32

拓展,使用apply函数

#apply函数是pandas里面自由度最高的函数
df1=pd.DataFrame({
                  "sex":list("FFMFMMF"),
                  "smoker":list("YNYYNYY"),
                  "age":[21,30,17,37,40,18,26],
                  "weight":[120,100,132,140,94,89,123]})
print(df1)
def bin_age(age):
    if age>=18:
        return 1
    else:
        return 0
输出:
sex smoker  age  weight
0   F      Y   21     120
1   F      N   30     100
2   M      Y   17     132
3   F      Y   37     140
4   M      N   40      94
5   M      Y   18      89
6   F      Y   26     123

#抽烟的年龄大于等于18的:
print(df1["age"].apply(bin_age))
df1["age"]=df1["age"].apply(bin_age)
print(df1)
输出:
0    1
1    1
2    0
3    1
4    1
5    1
6    1
Name: age, dtype: int64
  sex smoker  age  weight
0   F      Y    1     120
1   F      N    1     100
2   M      Y    0     132
3   F      Y    1     140
4   M      N    1      94
5   M      Y    1      89
6   F      Y    1     123

#取出抽烟和不抽烟的体重前三:
def top(smoker,col,n=5):
    return smoker.sort_values(by=col)[-n:]
df1.groupby("smoker").apply(top,col="weight",n=3)

输出:  sex smoker age weight
smoker					
N	4	M	N	1	94
	1	F	N	1   100
Y	6	F	Y	1	123
  	2	M	Y   0	132
	3	F	Y	1   140

九、分组案例

最后以一个案例来收尾Python学习:
分组案例:导入文件在我自己的电脑上

import numpy as np
import pandas as pd
data=pd.read_csv("movie_metadata.csv")
print(data.head())

#处理缺失值
data=data.dropna(how="any")
print(data.head())

#查看票房收入统计
#导演vs票房总收入
group_director=data.groupby(by="director_name")["gross"].sum()
#ascending升降序排序,True表示升序
result=group_director.sort_values()
print(type(result))
print(result)

输出:
在这里插入图片描述
实在是太多了,还有一个多层索引的拓展没有码。
本上在这里python 就要告一段落了,时学时新。
还有拿在实际案例中多多练习、打磨,才知道各种用法到底是怎样的。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值