pandas入门(含实例)

pandas介绍

Pandas 是基于 NumPy 的一种工具,该工具是为解决数据分析任务而创建的。
Pandas纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

含有的数据结构

Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型,字符串、boolean值、数字等都能保存在Series中。
Time- Series:以时间为索引的Series。
DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
Panel :三维的数组,可以理解为DataFrame的容器。
Panel4D:是像Panel一样的4维数据容器。
PanelND:拥有factory集合,可以创建像Panel4D一样N维命名容器的模块。

------------- 下面用实例对pandas基本命令进行讲解 -------------

原始数据

agecity
name
Tom18BeiJing
Bob30ShangHai
Mary25GuangZhou
James40ShenZhen

生成数据

pd.Index()定义索引
data可以用字典表示
pd.DataFrame(data=,index=)#生成DataFrame数据结构

import numpy as np
import pandas as pd#导入pandas包
index = pd.Index(data=["Tom","Bob","Mary","James"],name="name")
data = {"age":[18,30,25,40],"city":["BeiJing","ShangHai","GuangZhou","ShenZhen"]}
user_info = pd.DataFrame(data=data,index=index)#生成DataFrame数据结构
print(user_info)
       age       city
name                 
Tom     18    BeiJing
Bob     30   ShangHai
Mary    25  GuangZhou
James   40   ShenZhen

还可以用下面这种形式定义

index = pd.Index(data=["Tom","Bob","Mary","James"],name='name')
data = [[18,"BeiJing"],
       [30,"ShangHai"],
       [25,"Guangzhou"],
       [40,"ShenZhen"]]
columns = ["age","city"]
user_info = pd.DataFrame(data=data,index=index,columns=columns)
print(user_info)
       age       city
name                 
Tom     18    BeiJing
Bob     30   ShangHai
Mary    25  Guangzhou
James   40   ShenZhen

访问数据

print(user_info.loc["Tom"])#提取Tom行
age          18
city    BeiJing
Name: Tom, dtype: object
print(user_info.iloc[1:3])#提取[1,3)行
      age       city
name                
Bob    30   ShangHai
Mary   25  Guangzhou
print(user_info.age)#输出指定列
name
Tom      18
Bob      30
Mary     25
James    40
Name: age, dtype: int64
print(user_info[["city","age"]])#输出指定两列
            city  age
name                 
Tom      BeiJing   18
Bob     ShangHai   30
Mary   Guangzhou   25
James   ShenZhen   40

添加与删除

user_info["sex"]="male"#添加新列
print(user_info)
       age       city   sex
name                       
Tom     18    BeiJing  male
Bob     30   ShangHai  male
Mary    25  Guangzhou  male
James   40   ShenZhen  male
del user_info["sex"]#删除指定列
user_info
agecity
name
Tom18BeiJing
Bob30ShangHai
Mary25Guangzhou
James40ShenZhen
user_info.drop("Tom")
print(user_info)
print(user_info.drop("Tom"))#会生成一个副本
       age       city
name                 
Tom     18    BeiJing
Bob     30   ShangHai
Mary    25  Guangzhou
James   40   ShenZhen
       age       city
name                 
Bob     30   ShangHai
Mary    25  Guangzhou
James   40   ShenZhen

查看数据

user_info.info()#查看数据概况
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Tom to James
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     4 non-null      int64 
 1   city    4 non-null      object
dtypes: int64(1), object(1)
memory usage: 256.0+ bytes
user_info.head(2)#查看头两条数据
agecity
name
Tom18BeiJing
Bob30ShangHai
user_info.tail(2)#查看尾两条数据
agecity
name
Mary25Guangzhou
James40ShenZhen
user_info.age.max()#获取年龄最大值
40
user_info.age.cumsum()#年龄累加
name
Tom       18
Bob       48
Mary      73
James    113
Name: age, dtype: int64
user_info.describe()#获取统计指标
age
count4.000000
mean28.250000
std9.251126
min18.000000
25%23.250000
50%27.500000
75%32.500000
max40.000000
user_info.describe(include=["object"])#获取非数字列统计指标
city
count4
unique4
topShenZhen
freq1
user_info.city.value_counts()#获取每座城市的频数
ShenZhen     1
Guangzhou    1
ShangHai     1
BeiJing      1
Name: city, dtype: int64
user_info.age.idxmax()#获取年龄最大的索引
'James'

处理数据

pd.cut(user_info.age,3)#生成三个等距离散区间
name
Tom      (17.978, 25.333]
Bob      (25.333, 32.667]
Mary     (17.978, 25.333]
James      (32.667, 40.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(17.978, 25.333] < (25.333, 32.667] < (32.667, 40.0]]
pd.cut(user_info.age,[1,18,30,50])#自己定义离散区间
name
Tom       (1, 18]
Bob      (18, 30]
Mary     (18, 30]
James    (30, 50]
Name: age, dtype: category
Categories (3, interval[int64]): [(1, 18] < (18, 30] < (30, 50]]
pd.cut(user_info.age,[1,18,30,50],labels=["childhood","youth","middle"])#给每个区间一个标签
name
Tom      childhood
Bob          youth
Mary         youth
James       middle
Name: age, dtype: category
Categories (3, object): ['childhood' < 'youth' < 'middle']
pd.qcut(user_info.age,3)#根据每个值出现的次数进行离散化
name
Tom      (17.999, 25.0]
Bob        (25.0, 30.0]
Mary     (17.999, 25.0]
James      (30.0, 40.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(17.999, 25.0] < (25.0, 30.0] < (30.0, 40.0]]
user_info["sex"]=["male","male","female","male"]#添加性别列
print(user_info)
       age       city     sex
name                         
Tom     18    BeiJing    male
Bob     30   ShangHai    male
Mary    25  Guangzhou  female
James   40   ShenZhen    male
user_info.sort_index()#按索引排序
agecitysex
name
Bob30ShangHaimale
James40ShenZhenmale
Mary25Guangzhoufemale
Tom18BeiJingmale
user_info.sort_index(axis=1,ascending=False)#按列进行倒序排序
sexcityage
name
TommaleBeiJing18
BobmaleShangHai30
MaryfemaleGuangzhou25
JamesmaleShenZhen40
user_info.sort_values(by="age")#按年龄进行排序
agecitysex
name
Tom18BeiJingmale
Mary25Guangzhoufemale
Bob30ShangHaimale
James40ShenZhenmale
user_info.sort_values(by=["age","city"])#按年龄和城市排序
agecitysex
name
Tom18BeiJingmale
Mary25Guangzhoufemale
Bob30ShangHaimale
James40ShenZhenmale
user_info.age.nlargest(2)#取出年龄前两大
name
James    40
Bob      30
Name: age, dtype: int64

lambda表达式

user_info.age.map(lambda x:"yes" if x>=30 else "no")#函数运用
name
Tom       no
Bob      yes
Mary      no
James    yes
Name: age, dtype: object
city_map={
    "BeiJing":"north",
    "ShangHai":"south",
    "Guangzhou":"south",
    "ShenZhen":"south"
}
#传入一个map
user_info.city.map(city_map)
name
Tom      north
Bob      south
Mary     south
James    south
Name: city, dtype: object
user_info.apply(lambda x:x.max(),axis=0)#取每列的最大值
age           40
city    ShenZhen
sex         male
dtype: object
user_info.apply(lambda x:x.min(),axis=0)#取每列的最小值
age          18
city    BeiJing
sex      female
dtype: object
user_info.applymap(lambda x:str(x).lower())#所有字符串小写
agecitysex
name
Tom18beijingmale
Bob30shanghaimale
Mary25guangzhoufemale
James40shenzhenmale
user_info.applymap(lambda x:str(x).upper())#所有字符串大写
agecitysex
name
Tom18BEIJINGMALE
Bob30SHANGHAIMALE
Mary25GUANGZHOUFEMALE
James40SHENZHENMALE

修改行列及索引名

user_info.rename(columns={"age":"Age","city":"City","sex":"Sex"})#修改列名
AgeCitySex
name
Tom18BeiJingmale
Bob30ShangHaimale
Mary25Guangzhoufemale
James40ShenZhenmale
user_info.rename(index={"Tom":"tom","Bob":"bob"})#修改索引名
agecitysex
name
tom18BeiJingmale
bob30ShangHaimale
Mary25Guangzhoufemale
James40ShenZhenmale

修改数据类型

user_info["age"].astype(float)#改变age数据类型为float类型
name
Tom      18.0
Bob      30.0
Mary     25.0
James    40.0
Name: age, dtype: float64
user_info["height"]=["178","168","178","180cm"]
pd.to_numeric(user_info.height,errors="coerce")#强制转换,无法转换的用NAN,errors不改默认为"raise"所以会抛出异常
name
Tom      178.0
Bob      168.0
Mary     178.0
James      NaN
Name: height, dtype: float64
pd.to_numeric(user_info.height,errors="ignore")#强制转换失败时保留原有数据
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值