Python_Pandas基础
By:小?
参考博客_1 参考博客_2 参考博客_3
Pandas是Python的一个数据分析包,该工具为解决数据分析任务而创建。
Pandas纳入大量库和标准数据模型,提供高效的操作数据集所需的工具。
Pandas提供大量能使我们快速便捷地处理数据的函数和方法。
Pandas是字典形式,基于NumPy创建,让NumPy为中心的应用变得更加简单
Pandas安装
pip3 install pandas
Pandas引入
import pandas as pd
数据结构
Series
import numpy as np
import pandas as pd
s= pd. Series( [ 1 , 2 , 3 , np. nan, 5 , 6 ] )
print ( s)
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
dtype: float64
DataFrame
dates= pd. date_range( '20180310' , periods= 6 )
df = pd. DataFrame( np. random. randn( 6 , 4 ) , index= dates, columns= [ 'A' , 'B' , 'C' , 'D' ] )
print ( df)
print ( df[ 'B' ] )
print ( "----------------\n----------------" )
df_1= pd. DataFrame( {
'A' : 1 . ,
'B' : pd. Timestamp( '20180310' ) ,
'C' : pd. Series( 1 , index= list ( range ( 4 ) ) , dtype= 'float32' ) ,
'D' : np. array( [ 3 ] * 4 , dtype= 'int32' ) ,
'E' : pd. Categorical( [ "test" , "train" , "test" , "train" ] ) ,
'F' : 'foo'
} )
print ( df_1)
print ( df_1. dtypes)
print ( df_1. index)
print ( df_1. columns)
print ( "----------------\n----------------" )
print ( df_1. values)
print ( df_1. describe( ) )
print ( df_1. T)
print ( "----------------\n----------------" )
print ( df_1. sort_index( axis= 1 , ascending= False ) )
print ( df_1. sort_values( by= 'E' ) )
A B C D
2018-03-10 0.872767 2.188739 0.766781 -0.001429
2018-03-11 0.218740 -0.556263 -0.047700 0.470347
2018-03-12 -0.816785 0.479690 1.722349 1.116260
2018-03-13 0.988138 -0.025760 -0.971384 -0.558211
2018-03-14 -0.581776 1.021027 -1.280569 1.022587
2018-03-15 0.061455 -1.647589 -1.568288 -0.467407
2018-03-10 2.188739
2018-03-11 -0.556263
2018-03-12 0.479690
2018-03-13 -0.025760
2018-03-14 1.021027
2018-03-15 -1.647589
Freq: D, Name: B, dtype: float64
----------------
----------------
A B C D E F
0 1.0 2018-03-10 1.0 3 test foo
1 1.0 2018-03-10 1.0 3 train foo
2 1.0 2018-03-10 1.0 3 test foo
3 1.0 2018-03-10 1.0 3 train foo
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
Int64Index([0, 1, 2, 3], dtype='int64')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
----------------
----------------
[[1.0 Timestamp('2018-03-10 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2018-03-10 00:00:00') 1.0 3 'train' 'foo']
[1.0 Timestamp('2018-03-10 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2018-03-10 00:00:00') 1.0 3 'train' 'foo']]
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
0 1 2 \
A 1 1 1
B 2018-03-10 00:00:00 2018-03-10 00:00:00 2018-03-10 00:00:00
C 1 1 1
D 3 3 3
E test train test
F foo foo foo
3
A 1
B 2018-03-10 00:00:00
C 1
D 3
E train
F foo
----------------
----------------
F E D C B A
0 foo test 3 1.0 2018-03-10 1.0
1 foo train 3 1.0 2018-03-10 1.0
2 foo test 3 1.0 2018-03-10 1.0
3 foo train 3 1.0 2018-03-10 1.0
A B C D E F
0 1.0 2018-03-10 1.0 3 test foo
2 1.0 2018-03-10 1.0 3 test foo
1 1.0 2018-03-10 1.0 3 train foo
3 1.0 2018-03-10 1.0 3 train foo
Pandas选择数据
选择特定列的数据
选择特定行的数据
选择特定行and列的数据
根据序列iloc-行号进行选择数据
根据条件判断筛选
多重索引
df = pd. DataFrame( np. random. rand( 16 ) . reshape( 4 , 4 ) * 100 ,
index = [ 'one' , 'two' , 'three' , 'four' ] ,
columns = [ 'a' , 'b' , 'c' , 'd' ] )
df
a
b
c
d
one
73.506341
75.662735
74.675325
7.697207
two
73.055825
83.222481
4.777599
82.534340
three
89.156683
85.001712
47.443443
73.379189
four
95.648043
64.162408
26.731916
73.839172
选择特定列的数据
print ( df[ "a" ] )
print ( "----------------\n----------------" )
print ( df[ [ "a" , "b" ] ] )
print ( "----------------\n----------------" )
print ( df. loc[ : , "b" : "d" ] )
one 73.506341
two 73.055825
three 89.156683
four 95.648043
Name: a, dtype: float64
----------------
----------------
a b
one 73.506341 75.662735
two 73.055825 83.222481
three 89.156683 85.001712
four 95.648043 64.162408
----------------
----------------
b c d
one 75.662735 74.675325 7.697207
two 83.222481 4.777599 82.534340
three 85.001712 47.443443 73.379189
four 64.162408 26.731916 73.839172
选择特定行的数据
print ( df. loc[ "one" ]