Pandas简单使用
由于Python本身的限制,当数据太大的时候,而无法一次载入内存,需要进行分块导入,并对查询做出相应的修改。
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
dates = pd. date_range( '20121201' , periods= 6 )
df = pd. DataFrame( np. random. randn( 6 , 4 ) , index= dates, columns= list ( 'ABCD' ) )
print ( df)
Date A B C D 2012-12-01 1.167517 -0.814091 -0.90861 -0.599967 2012-12-02 -0.767541 0.072700 0.985450 -1.838382 2012-12-03 -1.078247 -0.413168 -1.899446 -0.150623 2012-12-04 -0.974840 -0.114777 0.952938 2.034717 2012-12-05 -0.689099 -1.102233 0.227212 1.241322 2012-12-06 -0.288585 1.363764 0.230803 -1.884838
1. 1 选择行
rows = df[ 0 : 3 ]
print ( rows)
Date A B C D 2012-12-01 1.167517 -0.814091 -0.90861 -0.599967 2012-12-02 -0.767541 0.072700 0.985450 -1.838382 2012-12-03 -1.078247 -0.413168 -1.899446 -0.150623
1.2 选择列
cols = df[ [ 'A' , 'B' , 'C' ] ]
print ( cols)
Date A B C 2012-12-01 1.167517 -0.814091 -0.90861 2012-12-02 -0.767541 0.072700 0.985450 2012-12-03 -1.078247 -0.413168 -1.899446 2012-12-04 -0.974840 -0.114777 0.952938 2012-12-05 -0.689099 -1.102233 0.227212 2012-12-06 -0.288585 1.363764 0.230803
1.3 块的选取,也就是选择行和列组成的数据快
Pandas的基本数据有二种,Series和Dataframe。Series创建行,也就是一维数组。 Dataframe用来创建块,或者成为矩阵,表格。
2 Series操作
s = pd. Series( [ 1 , 2 , 3 , 4 ] )
print ( s)
输出:
0 1
1 2
2 3
3 4
dtype: int64
2.2 DataFrame
s = pd. DataFrame( np. random. randn( 6 , 4 ) , columns= list ( 'ABCD' ) )
print ( s)
Date A B C D 2012-12-01 1.167517 -0.814091 -0.90861 -0.599967 2012-12-02 -0.767541 0.072700 0.985450 -1.838382 2012-12-03 -1.078247 -0.413168 -1.899446 -0.150623 2012-12-04 -0.974840 -0.114777 0.952938 2.034717 2012-12-05 -0.689099 -1.102233 0.227212 1.241322 2012-12-06 -0.288585 1.363764 0.230803 -1.884838
print ( s. index)
输出:
RangeIndex( start= 0 , stop= 6 , step= 1 )
df[ 'sumAB' ] = pd. Series( df[ 'A' ] + df[ 'B' ] , index= df. index)
df[ '10A' ] = pd. Series( df[ 'A' ] * 10 , index= df. index)
print ( df)
Date A B C D SumAB 10A 2012-12-01 1.167517 -0.814091 -0.90861 -0.599967 0.353426 11.675168 2012-12-02 -0.767541 0.072700 0.985450 -1.838382 -0.694840 -7.675406 2012-12-03 -1.078247 -0.413168 -1.899446 -0.150623 -1.491414 -10.782469 2012-12-04 -0.974840 -0.114777 0.952938 2.034717 -0.860063 -9.748398 2012-12-05 -0.689099 -1.102233 0.227212 1.241322 -1.791332 -6.890987 2012-12-06 -0.288585 1.363764 0.230803 -1.884838 1.075178 -2.885852
2.3 根据条件过滤行
s1 = df[ ( df. index>= '20121201' ) & ( df. index<= '20121203' ) ]
print ( s1)
Date A B C D SumAB 10A 2012-12-01 1.167517 -0.814091 -0.90861 -0.599967 0.353426 11.675168 2012-12-02 -0.767541 0.072700 0.985450 -1.838382 -0.694840 -7.675406 2012-12-03 -1.078247 -0.413168 -1.899446 -0.150623 -1.491414 -10.782469
s2 = df[ df[ 'A' ] > 0 ]
print ( s2)
Date A B C D SumAB 10A 2012-12-01 1.167517 -0.814091 -0.90861 -0.599967 0.353426 11.675168
2.4 窥视数据
df. head( 5 )
Date A B C D SumAB 10A 2012-12-01 1.167517 -0.814091 -0.90861 -0.599967 0.353426 11.675168 2012-12-02 -0.767541 0.072700 0.985450 -1.838382 -0.694840 -7.675406 2012-12-03 -1.078247 -0.413168 -1.899446 -0.150623 -1.491414 -10.782469 2012-12-04 -0.974840 -0.114777 0.952938 2.034717 -0.860063 -9.748398 2012-12-05 -0.689099 -1.102233 0.227212 1.241322 -1.791332 -6.890987
df. tail( 5 )
Date A B C D SumAB 10A 2012-12-02 -0.767541 0.072700 0.985450 -1.838382 -0.694840 -7.675406 2012-12-03 -1.078247 -0.413168 -1.899446 -0.150623 -1.491414 -10.782469 2012-12-04 -0.974840 -0.114777 0.952938 2.034717 -0.860063 -9.748398 2012-12-05 -0.689099 -1.102233 0.227212 1.241322 -1.791332 -6.890987 2012-12-06 -0.288585 1.363764 0.230803 -1.884838 1.075178 -2.885852
df. values
array( [ [ 1.16751676 , - 0.81409105 , - 0.90861201 , - 0.59996719 ,
0.35342571 , 11.6751676 ] ,
[ - 0.76754063 , 0.07270018 , 0.98545024 , - 1.83838166 ,
- 0.69484045 , - 7.67540633 ] ,
[ - 1.07824687 , - 0.41316755 , - 1.89944615 , - 0.15062331 ,
- 1.49141442 , - 10.78246866 ] ,
[ - 0.97483975 , 0.11477693 , 0.95293849 , 2.03471652 ,
- 0.86006282 , - 9.74839753 ] ,
[ - 0.68909873 , - 1.10223307 , 0.22721154 , 1.24132162 ,
- 1.7913318 , - 6.89098733 ] ,
[ - 0.2885852 , 1.3637637 , 0.23080346 , - 1.88483769 ,
1.0751785 , - 2.88585202 ] ] )
df. sort_values( by= 'A' )
Date A B C D SumAB 10A 2012-12-03 -1.078247 -0.413168 -1.899446 -0.150623 -1.491414 -10.782469 2012-12-04 -0.974840 -0.114777 0.952938 2.034717 -0.860063 -9.748398 2012-12-02 -0.767541 0.072700 0.985450 -1.838382 -0.694840 -7.675406 2012-12-05 -0.689099 -1.102233 0.227212 1.241322 -1.791332 -6.890987 2012-12-06 -0.288585 1.363764 0.230803 -1.884838 1.075178 -2.885852 2012-12-01 1.167517 -0.814091 -0.90861 -0.599967 0.353426 11.675168
3 作图
Pandas和matplotlib配合使用,几乎可以支持所有的图表形式
首先打开图表行内显示
% matplotlib inline
nd = pd. Series( np. random. randn( 600 ) )
nd. hist( bins= 100 )
输出
< matplotlib. axes. _subplots. AxesSubplot at 0x7f54c76043c8 >
Pandas中read_csv()函数使用注意:
import pandas as pd
data = pd. read_csv( "iris_training.csv" )
print ( data)
'''
120 4 setosa versicolor virginica
0 6.4 2.8 5.6 2.2 2
1 5.0 2.3 3.3 1.0 1
2 4.9 2.5 4.5 1.7 2
3 4.9 3.1 1.5 0.1 0
'''
data = pd. read_csv( "iris_training.csv" , names= CSV_COLUMN_NAMES)
'''
SepalLength SepalWidth PetalLength PetalWidth Species
0 120.0 4.0 setosa versicolor virginica
1 6.4 2.8 5.6 2.2 2
2 5.0 2.3 3.3 1.0 1
'''
data = pd. read_csv( "iris_training.csv" , names= CSV_COLUMN_NAMES, header= 0 )
'''
SepalLength SepalWidth PetalLength PetalWidth Species
0 6.4 2.8 5.6 2.2 2
1 5.0 2.3 3.3 1.0 1
2 4.9 2.5 4.5 1.7 2
3 4.9 3.1 1.5 0.1 0
'''
header : int or list of ints, default ‘infer’ 指定行数用来作为列名,数据开始行数。如果文件中没有列名,则默认为0,否则设置为None。如果明确设定header=0 就会替换掉原来存在列名。header参数可以是一个list 例如:[0,1,3],这个list表示将文件中的这些行作为列标题(意味着每一列有多个标题),介于中间的行将被忽略掉(例如本例中的2;本例中的数据1,2,4行将被作为多级标题出现,第3行数据将被丢弃,dataframe的数据从第5行开始。)。 注意:如果skip_blank_lines=True 那么header参数忽略注释行和空行,所以header=0表示第一行数据而不是文件的第一行。 names : array-like, default None 用于结果的列名列表,如果数据文件中没有列标题行,就需要执行header=None。默认列表中不能出现重复,除非设定参数mangle_dupe_cols=True。