pandas介绍
Pandas 是基于 NumPy 的一种工具,该工具是为解决数据分析任务而创建的。 Pandas纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。
含有的数据结构
Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型,字符串、boolean值、数字等都能保存在Series中。 Time- Series:以时间为索引的Series。 DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。 Panel :三维的数组,可以理解为DataFrame的容器。 Panel4D:是像Panel一样的4维数据容器。 PanelND:拥有factory集合,可以创建像Panel4D一样N维命名容器的模块。
------------- 下面用实例对pandas基本命令进行讲解 -------------
原始数据
age city name Tom 18 BeiJing Bob 30 ShangHai Mary 25 GuangZhou James 40 ShenZhen
生成数据
pd.Index()定义索引 data可以用字典表示 pd.DataFrame(data=,index=)#生成DataFrame数据结构
import numpy as np
import pandas as pd
index = pd. Index( data= [ "Tom" , "Bob" , "Mary" , "James" ] , name= "name" )
data = { "age" : [ 18 , 30 , 25 , 40 ] , "city" : [ "BeiJing" , "ShangHai" , "GuangZhou" , "ShenZhen" ] }
user_info = pd. DataFrame( data= data, index= index)
print ( user_info)
age city
name
Tom 18 BeiJing
Bob 30 ShangHai
Mary 25 GuangZhou
James 40 ShenZhen
还可以用下面这种形式定义
index = pd. Index( data= [ "Tom" , "Bob" , "Mary" , "James" ] , name= 'name' )
data = [ [ 18 , "BeiJing" ] ,
[ 30 , "ShangHai" ] ,
[ 25 , "Guangzhou" ] ,
[ 40 , "ShenZhen" ] ]
columns = [ "age" , "city" ]
user_info = pd. DataFrame( data= data, index= index, columns= columns)
print ( user_info)
age city
name
Tom 18 BeiJing
Bob 30 ShangHai
Mary 25 Guangzhou
James 40 ShenZhen
访问数据
print ( user_info. loc[ "Tom" ] )
age 18
city BeiJing
Name: Tom, dtype: object
print ( user_info. iloc[ 1 : 3 ] )
age city
name
Bob 30 ShangHai
Mary 25 Guangzhou
print ( user_info. age)
name
Tom 18
Bob 30
Mary 25
James 40
Name: age, dtype: int64
print ( user_info[ [ "city" , "age" ] ] )
city age
name
Tom BeiJing 18
Bob ShangHai 30
Mary Guangzhou 25
James ShenZhen 40
添加与删除
user_info[ "sex" ] = "male"
print ( user_info)
age city sex
name
Tom 18 BeiJing male
Bob 30 ShangHai male
Mary 25 Guangzhou male
James 40 ShenZhen male
del user_info[ "sex" ]
user_info
age city name Tom 18 BeiJing Bob 30 ShangHai Mary 25 Guangzhou James 40 ShenZhen
user_info. drop( "Tom" )
print ( user_info)
print ( user_info. drop( "Tom" ) )
age city
name
Tom 18 BeiJing
Bob 30 ShangHai
Mary 25 Guangzhou
James 40 ShenZhen
age city
name
Bob 30 ShangHai
Mary 25 Guangzhou
James 40 ShenZhen
查看数据
user_info. info( )
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Tom to James
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 4 non-null int64
1 city 4 non-null object
dtypes: int64(1), object(1)
memory usage: 256.0+ bytes
user_info. head( 2 )
age city name Tom 18 BeiJing Bob 30 ShangHai
user_info. tail( 2 )
age city name Mary 25 Guangzhou James 40 ShenZhen
user_info. age. max ( )
40
user_info. age. cumsum( )
name
Tom 18
Bob 48
Mary 73
James 113
Name: age, dtype: int64
user_info. describe( )
age count 4.000000 mean 28.250000 std 9.251126 min 18.000000 25% 23.250000 50% 27.500000 75% 32.500000 max 40.000000
user_info. describe( include= [ "object" ] )
city count 4 unique 4 top ShenZhen freq 1
user_info. city. value_counts( )
ShenZhen 1
Guangzhou 1
ShangHai 1
BeiJing 1
Name: city, dtype: int64
user_info. age. idxmax( )
'James'
处理数据
pd. cut( user_info. age, 3 )
name
Tom (17.978, 25.333]
Bob (25.333, 32.667]
Mary (17.978, 25.333]
James (32.667, 40.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(17.978, 25.333] < (25.333, 32.667] < (32.667, 40.0]]
pd. cut( user_info. age, [ 1 , 18 , 30 , 50 ] )
name
Tom (1, 18]
Bob (18, 30]
Mary (18, 30]
James (30, 50]
Name: age, dtype: category
Categories (3, interval[int64]): [(1, 18] < (18, 30] < (30, 50]]
pd. cut( user_info. age, [ 1 , 18 , 30 , 50 ] , labels= [ "childhood" , "youth" , "middle" ] )
name
Tom childhood
Bob youth
Mary youth
James middle
Name: age, dtype: category
Categories (3, object): ['childhood' < 'youth' < 'middle']
pd. qcut( user_info. age, 3 )
name
Tom (17.999, 25.0]
Bob (25.0, 30.0]
Mary (17.999, 25.0]
James (30.0, 40.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(17.999, 25.0] < (25.0, 30.0] < (30.0, 40.0]]
user_info[ "sex" ] = [ "male" , "male" , "female" , "male" ]
print ( user_info)
age city sex
name
Tom 18 BeiJing male
Bob 30 ShangHai male
Mary 25 Guangzhou female
James 40 ShenZhen male
user_info. sort_index( )
age city sex name Bob 30 ShangHai male James 40 ShenZhen male Mary 25 Guangzhou female Tom 18 BeiJing male
user_info. sort_index( axis= 1 , ascending= False )
sex city age name Tom male BeiJing 18 Bob male ShangHai 30 Mary female Guangzhou 25 James male ShenZhen 40
user_info. sort_values( by= "age" )
age city sex name Tom 18 BeiJing male Mary 25 Guangzhou female Bob 30 ShangHai male James 40 ShenZhen male
user_info. sort_values( by= [ "age" , "city" ] )
age city sex name Tom 18 BeiJing male Mary 25 Guangzhou female Bob 30 ShangHai male James 40 ShenZhen male
user_info. age. nlargest( 2 )
name
James 40
Bob 30
Name: age, dtype: int64
lambda表达式
user_info. age. map ( lambda x: "yes" if x>= 30 else "no" )
name
Tom no
Bob yes
Mary no
James yes
Name: age, dtype: object
city_map= {
"BeiJing" : "north" ,
"ShangHai" : "south" ,
"Guangzhou" : "south" ,
"ShenZhen" : "south"
}
user_info. city. map ( city_map)
name
Tom north
Bob south
Mary south
James south
Name: city, dtype: object
user_info. apply ( lambda x: x. max ( ) , axis= 0 )
age 40
city ShenZhen
sex male
dtype: object
user_info. apply ( lambda x: x. min ( ) , axis= 0 )
age 18
city BeiJing
sex female
dtype: object
user_info. applymap( lambda x: str ( x) . lower( ) )
age city sex name Tom 18 beijing male Bob 30 shanghai male Mary 25 guangzhou female James 40 shenzhen male
user_info. applymap( lambda x: str ( x) . upper( ) )
age city sex name Tom 18 BEIJING MALE Bob 30 SHANGHAI MALE Mary 25 GUANGZHOU FEMALE James 40 SHENZHEN MALE
修改行列及索引名
user_info. rename( columns= { "age" : "Age" , "city" : "City" , "sex" : "Sex" } )
Age City Sex name Tom 18 BeiJing male Bob 30 ShangHai male Mary 25 Guangzhou female James 40 ShenZhen male
user_info. rename( index= { "Tom" : "tom" , "Bob" : "bob" } )
age city sex name tom 18 BeiJing male bob 30 ShangHai male Mary 25 Guangzhou female James 40 ShenZhen male
修改数据类型
user_info[ "age" ] . astype( float )
name
Tom 18.0
Bob 30.0
Mary 25.0
James 40.0
Name: age, dtype: float64
user_info[ "height" ] = [ "178" , "168" , "178" , "180cm" ]
pd. to_numeric( user_info. height, errors= "coerce" )
name
Tom 178.0
Bob 168.0
Mary 178.0
James NaN
Name: height, dtype: float64
pd. to_numeric( user_info. height, errors= "ignore" )