Pandas系列教程(三)
对于想要入门数据科学的朋友们来说,Python是一个很好的选择,除了因为简单的语法外,Python 生态中提供了很多在数值计算方面非常优秀的库,其中Pandas不可不提,Pandas是很强大是数据集处理工具,往往和numpy, matplotlib 等库搭配使用,我也是刚刚开始学习Pandas, 顺便翻译了一下官方的Pandas教程 , 这里使用的是jupyter notebook, 因为博客不支持html直接编辑,所以只能转化为markdown 格式,如果想直接查看html版本可点击每一节下的链接。本文仅供学习和交流使用,欢迎大家交流和指正!
摘要
创建数据,这次的数据集是描述2009-2012年欧洲四个市场的客流量,用to_excel()得到excel格式文件,用date_range()函数得到基于日期与传入频率的列表等(较常用) 获取数据,使用pandas.read_excel()函数从excel格式中读取数据集 准备数据(make data clean) 即确保生成的数据集正确无误,主要是检查类型,这里我们检查了status需要为1,,也对市场进行了筛选 分析数据,本节详细运用了df.groupby(),sum(),transform(),lambda表达式等对数据进行了分组,排序,求和,合并,增删等等,还用 percentile 排除了异常点让图像更平滑 可视化,使用matplotlib,直观地显示我们的数据,
HTML版本点击此处
import pandas as pd
import matplotlib. pyplot as plt
import numpy. random as np
import sys
import matplotlib
% matplotlib inline
print ( 'Python version ' + sys. version)
print ( 'Pandas version: ' + pd. __version__)
print ( 'Matplotlib version ' + matplotlib. __version__)
Python version 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
Pandas version: 0.23.0
Matplotlib version 2.2.2
np. seed( 111 )
def CreateDataSet ( Number= 1 ) :
Output = [ ]
for i in range ( Number) :
rng = pd. date_range( start= '1/1/2009' , end= '12/31/2012' , freq= 'W-MON' )
data = np. randint( low= 25 , high= 1000 , size= len ( rng) )
status = [ 1 , 2 , 3 ]
random_status = [ status[ np. randint( 0 , len ( status) ) ] for i in range ( len ( rng) ) ]
states = [ 'GA' , 'FL' , 'fl' , 'NY' , 'NJ' , 'TX' ]
random_states = [ states[ np. randint( 0 , len ( states) ) ] for i in range ( len ( rng) ) ]
Output. extend( zip ( random_states, random_status, data, rng) )
return Output
dataset = CreateDataSet( )
df = pd. DataFrame( data= dataset, columns= [ 'State' , 'Status' , 'CustomerCount' , 'StatusDate' ] )
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 4 columns):
State 209 non-null object
Status 209 non-null int64
CustomerCount 209 non-null int64
StatusDate 209 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 6.6+ KB
dataset = CreateDataSet( 4 )
df = pd. DataFrame( data= dataset, columns= [ 'State' , 'Status' , 'CustomerCount' , 'StatusDate' ] )
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 836 entries, 0 to 835
Data columns (total 4 columns):
State 836 non-null object
Status 836 non-null int64
CustomerCount 836 non-null int64
StatusDate 836 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 26.2+ KB
df. head( )
State Status CustomerCount StatusDate 0 fl 3 122 2009-01-05 1 fl 3 955 2009-01-12 2 NY 1 522 2009-01-19 3 fl 3 436 2009-01-26 4 TX 2 398 2009-02-02
df. to_excel( 'Lesson3.xlsx' , index= False )
print ( 'Done!' )
Done!
Location = 'Lesson3.xlsx'
df = pd. read_excel( Location, 0 , index_col= 'StatusDate' )
df. dtypes
State object
Status int64
CustomerCount int64
dtype: object
df. index
DatetimeIndex(['2009-01-05', '2009-01-12', '2009-01-19', '2009-01-26',
'2009-02-02', '2009-02-09', '2009-02-16', '2009-02-23',
'2009-03-02', '2009-03-09',
...
'2012-10-29', '2012-11-05', '2012-11-12', '2012-11-19',
'2012-11-26', '2012-12-03', '2012-12-10', '2012-12-17',
'2012-12-24', '2012-12-31'],
dtype='datetime64[ns]', name='StatusDate', length=836, freq=None)
df. head( )
State Status CustomerCount StatusDate 2009-01-05 fl 3 122 2009-01-12 fl 3 955 2009-01-19 NY 1 522 2009-01-26 fl 3 436 2009-02-02 TX 2 398
"""
检查数据的正确性
- state列的字母都要大写
- 只选择status列为1的数据
- 将state列的NJ与NY合并为NY
- 删除一些很大或者很小的特殊值(customerCount列)
"""
df[ 'State' ] . unique( )
array(['fl', 'NY', 'TX', 'GA', 'NJ', 'FL'], dtype=object)
df[ 'State' ] = df. State. apply ( lambda x: x. upper( ) )
df[ 'State' ] . unique( )
array(['FL', 'NY', 'TX', 'GA', 'NJ'], dtype=object)
mask = df[ 'Status' ] == 1
df = df[ mask]
df. head( )
State Status CustomerCount StatusDate 2009-01-19 NY 1 522 2009-03-16 NY 1 355 2009-03-23 GA 1 972 2009-04-20 NY 1 820 2009-04-27 NY 1 447
mask = df. State == 'NJ'
df. State[ mask] = 'NY'
df. head( 10 )
State Status CustomerCount StatusDate 2009-01-19 NY 1 522 2009-03-16 NY 1 355 2009-03-23 GA 1 972 2009-04-20 NY 1 820 2009-04-27 NY 1 447 2009-05-11 NY 1 987 2009-05-18 TX 1 377 2009-05-25 NY 1 378 2009-06-08 TX 1 45 2009-06-22 TX 1 896
df[ 'State' ] . unique( )
array(['NY', 'GA', 'TX', 'FL'], dtype=object)
df[ 'CustomerCount' ] . plot( figsize= ( 15 , 5 ) )
<matplotlib.axes._subplots.AxesSubplot at 0x7f33f95be160>
sortdf = df[ df[ 'State' ] == 'NY' ] . sort_index( axis= 0 )
sortdf. head( 10 )
State Status CustomerCount StatusDate 2009-01-19 NY 1 522 2009-02-23 NY 1 710 2009-03-16 NY 1 355 2009-03-23 NY 1 728 2009-03-30 NY 1 863 2009-04-13 NY 1 520 2009-04-20 NY 1 820 2009-04-20 NY 1 937 2009-04-27 NY 1 447 2009-05-04 NY 1 497
"""
接下来我们将用groupby(),sum()等函数得到顾客数/天/状态
因为我们需要将statusDate与state作为分组依据传入groupby函数,而groupby接受的
参数类型是columns,所以我们需要先用reset_index()把statusDate从index转回column
"""
Daily = df. reset_index( ) . groupby( [ 'State' , 'StatusDate' ] ) . sum ( )
Daily. head( )
Status CustomerCount State StatusDate FL 2009-01-12 1 319 2009-02-02 1 653 2009-03-23 1 752 2009-04-06 1 795 2009-04-20 1 788
"""
从上面的结果中我们可以看到State和StatusDate列自动变成了
index列,我们可以把index理解为数据库里的表的主键,只是数据帧里的
index列不要求唯一,我们可以通过这个index来进行选择,绘图,计算操作
"""
del Daily[ 'Status' ]
Daily. head( )
CustomerCount State StatusDate FL 2009-01-12 319 2009-02-02 653 2009-03-23 752 2009-04-06 795 2009-04-20 788
Daily. index
MultiIndex(levels=[['FL', 'GA', 'NY', 'TX'], [2009-01-12 00:00:00, 2009-01-19 00:00:00, 2009-02-02 00:00:00, 2009-02-23 00:00:00, 2009-03-16 00:00:00, 2009-03-23 00:00:00, 2009-03-30 00:00:00, 2009-04-06 00:00:00, 2009-04-13 00:00:00, 2009-04-20 00:00:00, 2009-04-27 00:00:00, 2009-05-04 00:00:00, 2009-05-11 00:00:00, 2009-05-18 00:00:00, 2009-05-25 00:00:00, 2009-06-08 00:00:00, 2009-06-22 00:00:00, 2009-07-06 00:00:00, 2009-07-13 00:00:00, 2009-07-20 00:00:00, 2009-07-27 00:00:00, 2009-08-10 00:00:00, 2009-08-17 00:00:00, 2009-08-24 00:00:00, 2009-08-31 00:00:00, 2009-09-07 00:00:00, 2009-09-14 00:00:00, 2009-09-21 00:00:00, 2009-09-28 00:00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10-26 00:00:00, 2009-11-02 00:00:00, 2009-11-23 00:00:00, 2009-11-30 00:00:00, 2009-12-07 00:00:00, 2009-12-14 00:00:00, 2009-12-21 00:00:00, 2010-01-04 00:00:00, 2010-01-11 00:00:00, 2010-01-18 00:00:00, 2010-01-25 00:00:00, 2010-02-08 00:00:00, 2010-02-15 00:00:00, 2010-02-22 00:00:00, 2010-03-01 00:00:00, 2010-03-08 00:00:00, 2010-03-15 00:00:00, 2010-03-29 00:00:00, 2010-04-05 00:00:00, 2010-04-12 00:00:00, 2010-04-19 00:00:00, 2010-04-26 00:00:00, 2010-05-03 00:00:00, 2010-05-10 00:00:00, 2010-05-17 00:00:00, 2010-05-24 00:00:00, 2010-05-31 00:00:00, 2010-06-14 00:00:00, 2010-06-21 00:00:00, 2010-06-28 00:00:00, 2010-07-05 00:00:00, 2010-07-19 00:00:00, 2010-07-26 00:00:00, 2010-08-02 00:00:00, 2010-08-16 00:00:00, 2010-08-30 00:00:00, 2010-09-06 00:00:00, 2010-09-13 00:00:00, 2010-09-20 00:00:00, 2010-09-27 00:00:00, 2010-10-04 00:00:00, 2010-10-11 00:00:00, 2010-10-18 00:00:00, 2010-10-25 00:00:00, 2010-11-01 00:00:00, 2010-11-08 00:00:00, 2010-11-15 00:00:00, 2010-11-29 00:00:00, 2010-12-20 00:00:00, 2011-01-03 00:00:00, 2011-01-10 00:00:00, 2011-01-17 00:00:00, 2011-01-31 00:00:00, 2011-02-07 00:00:00, 2011-02-14 00:00:00, 2011-02-21 00:00:00, 2011-02-28 00:00:00, 2011-03-07 00:00:00, 2011-03-14 00:00:00, 2011-04-04 00:00:00, 2011-04-18 00:00:00, 2011-04-25 00:00:00, 2011-05-02 00:00:00, 2011-05-09 00:00:00, 2011-05-16 00:00:00, 2011-05-23 00:00:00, 2011-05-30 00:00:00, 2011-06-13 00:00:00, 2011-06-20 00:00:00, 2011-06-27 00:00:00, 2011-07-04 00:00:00, 2011-07-11 00:00:00, 2011-07-25 00:00:00, 2011-08-01 00:00:00, 2011-08-08 00:00:00, 2011-08-15 00:00:00, 2011-08-29 00:00:00, 2011-09-05 00:00:00, 2011-09-12 00:00:00, 2011-09-19 00:00:00, 2011-09-26 00:00:00, 2011-10-03 00:00:00, 2011-10-24 00:00:00, 2011-10-31 00:00:00, 2011-11-07 00:00:00, 2011-11-14 00:00:00, 2011-11-28 00:00:00, 2011-12-05 00:00:00, 2011-12-19 00:00:00, 2011-12-26 00:00:00, 2012-01-02 00:00:00, 2012-01-16 00:00:00, 2012-01-23 00:00:00, 2012-01-30 00:00:00, 2012-02-06 00:00:00, 2012-02-13 00:00:00, 2012-02-20 00:00:00, 2012-02-27 00:00:00, 2012-03-05 00:00:00, 2012-03-12 00:00:00, 2012-03-19 00:00:00, 2012-03-26 00:00:00, 2012-04-02 00:00:00, 2012-04-09 00:00:00, 2012-04-23 00:00:00, 2012-04-30 00:00:00, 2012-05-07 00:00:00, 2012-05-14 00:00:00, 2012-05-28 00:00:00, 2012-06-04 00:00:00, 2012-06-18 00:00:00, 2012-07-09 00:00:00, 2012-07-16 00:00:00, 2012-07-30 00:00:00, 2012-08-06 00:00:00, 2012-08-20 00:00:00, 2012-08-27 00:00:00, 2012-09-03 00:00:00, 2012-09-10 00:00:00, 2012-09-17 00:00:00, 2012-09-24 00:00:00, 2012-10-01 00:00:00, 2012-10-08 00:00:00, 2012-10-15 00:00:00, 2012-10-22 00:00:00, 2012-10-29 00:00:00, 2012-11-05 00:00:00, 2012-11-12 00:00:00, 2012-11-19 00:00:00, 2012-11-26 00:00:00, 2012-12-10 00:00:00, 2012-12-17 00:00:00]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [0, 2, 5, 7, 9, 15, 16, 17, 18, 21, 23, 25, 26, 28, 29, 33, 36, 40, 42, 43, 44, 45, 47, 48, 50, 53, 55, 57, 58, 61, 62, 64, 67, 69, 70, 71, 72, 73, 77, 78, 79, 80, 87, 89, 90, 92, 93, 96, 97, 100, 103, 104, 105, 108, 109, 110, 113, 115, 118, 119, 124, 126, 127, 128, 129, 132, 137, 138, 139, 142, 147, 151, 152, 153, 154, 159, 163, 5, 6, 20, 21, 22, 25, 26, 32, 33, 35, 41, 46, 49, 50, 56, 59, 67, 68, 70, 72, 74, 75, 76, 80, 83, 85, 87, 91, 93, 95, 106, 109, 110, 123, 128, 133, 135, 141, 160, 161, 162, 1, 3, 4, 5, 6, 8, 9, 10, 11, 12, 14, 17, 19, 22, 24, 26, 27, 29, 30, 31, 34, 37, 38, 39, 40, 41, 51, 52, 55, 56, 63, 65, 71, 73, 74, 76, 79, 81, 82, 83, 84, 88, 91, 92, 94, 98, 101, 102, 103, 105, 107, 108, 111, 112, 113, 114, 117, 122, 123, 125, 126, 130, 131, 132, 134, 140, 141, 143, 144, 145, 147, 149, 150, 156, 158, 160, 13, 15, 16, 46, 50, 54, 57, 58, 60, 66, 69, 77, 86, 90, 92, 98, 99, 107, 113, 116, 120, 121, 131, 136, 137, 138, 142, 144, 146, 147, 148, 155, 157]],
names=['State', 'StatusDate'])
Daily. index. levels[ 0 ]
Index(['FL', 'GA', 'NY', 'TX'], dtype='object', name='State')
Daily. index. levels[ 1 ]
DatetimeIndex(['2009-01-12', '2009-01-19', '2009-02-02', '2009-02-23',
'2009-03-16', '2009-03-23', '2009-03-30', '2009-04-06',
'2009-04-13', '2009-04-20',
...
'2012-10-08', '2012-10-15', '2012-10-22', '2012-10-29',
'2012-11-05', '2012-11-12', '2012-11-19', '2012-11-26',
'2012-12-10', '2012-12-17'],
dtype='datetime64[ns]', name='StatusDate', length=164, freq=None)
Daily. loc[ 'FL' ] . plot( )
Daily. loc[ 'GA' ] . plot( )
Daily. loc[ 'NY' ] . plot( )
Daily. loc[ 'TX' ] . plot( ) ;
Daily. loc[ 'FL' ] [ '2012' : ] . plot( )
Daily. loc[ 'GA' ] [ '2012' : ] . plot( )
Daily. loc[ 'NY' ] [ '2012' : ] . plot( )
Daily. loc[ 'TX' ] [ '2012' : ] . plot( ) ;
"""
通过上面的图像,我们可以寻找异常值(超过特定范围)然后删除异常值,确保我们最终
得到平滑的图像。
StateYearMonth -这里我们将依照state,year,month(statusDate的属性)来分组
Daily['Outlier'] - 异常值
注意我们这里用的是transfrom而不是apply函数,因为transform 能保证我们数据帧的行
数与列数不变。此外,从上面的图像中可看到,数据不是高斯分布的,所以我们在指定正常
范围的时候不能用stddev与mean等标准,这里我们用的是百分数
"""
StateYearMonth = Daily. groupby( [ Daily. index. get_level_values( 0 ) , Daily. index. get_level_values( 1 ) . year, Daily. index. get_level_values( 1 ) . month] )
Daily[ 'Lower' ] = StateYearMonth[ 'CustomerCount' ] . transform( lambda x: x. quantile( q= .25 ) - ( 1.5 * x. quantile( q= .75 ) - x. quantile( q= .25 ) ) )
Daily[ 'Upper' ] = StateYearMonth[ 'CustomerCount' ] . transform( lambda x: x. quantile( q= .75 ) + ( 1.5 * x. quantile( q= .75 ) - x. quantile( q= .25 ) ) )
Daily[ 'Outlier' ] = ( Daily[ 'CustomerCount' ] < Daily[ 'Lower' ] ) | ( Daily[ 'CustomerCount' ] > Daily[ 'Upper' ] )
Daily = Daily[ Daily[ 'Outlier' ] == False ]
Daily. head( )
CustomerCount Lower Upper Outlier State StatusDate FL 2009-01-12 319 159.500 478.500 False 2009-02-02 653 326.500 979.500 False 2009-03-23 752 376.000 1128.000 False 2009-04-06 795 389.625 1193.375 False 2009-04-20 788 389.625 1193.375 False
ALL = pd. DataFrame( Daily[ 'CustomerCount' ] . groupby( Daily. index. get_level_values( 1 ) ) . sum ( ) )
ALL. columns = [ 'CustomerCount' ]
YearMonth = ALL. groupby( [ lambda x: x. year, lambda x: x. month] )
ALL[ 'Max' ] = YearMonth[ 'CustomerCount' ] . transform( lambda x: x. max ( ) )
ALL. head( )
CustomerCount Max StatusDate 2009-01-12 319 522 2009-01-19 522 522 2009-02-02 653 710 2009-02-23 710 710 2009-03-16 355 2452
"""
我们给客流量设上目标
- 12/31/2011 - 1,000
- 12/31/2012 - 2,000
- 12/31/2013 - 3,000
"""
data = [ 1000 , 2000 , 3000 ]
idx = pd. date_range( start= '12/31/2011' , end= '12/31/2013' , freq= 'A' )
BHAG = pd. DataFrame( data, index= idx, columns= [ 'BHAG' ] )
BHAG
BHAG 2011-12-31 1000 2012-12-31 2000 2013-12-31 3000
combined = pd. concat( [ ALL, BHAG] , axis= 0 )
combined = combined. sort_index( axis= 0 )
combined. tail( )
/home/dreamboy/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=True'.
To retain the current behavior and silence the warning, pass sort=False
This is separate from the ipykernel package so we can avoid doing imports until
BHAG CustomerCount Max 2012-11-26 NaN 367.0 1042.0 2012-12-10 NaN 769.0 817.0 2012-12-17 NaN 817.0 817.0 2012-12-31 2000.0 NaN NaN 2013-12-31 3000.0 NaN NaN
fig, axes = plt. subplots( figsize= ( 12 , 7 ) )
combined[ 'BHAG' ] . fillna( method= 'pad' ) . plot( color= 'green' , label= 'BHAG' )
combined[ 'Max' ] . plot( color= 'blue' , label= 'All Markets' )
plt. legend( loc= 'best' ) ;
Year = combined. groupby( lambda x: x. year) . max ( )
Year
BHAG CustomerCount Max 2009 NaN 2545.0 2545.0 2010 NaN 1819.0 1819.0 2011 1000.0 1957.0 1957.0 2012 2000.0 1689.0 1689.0 2013 3000.0 NaN NaN
Year[ 'YR_PCT_CHANGE' ] = Year[ 'Max' ] . pct_change( periods= 1 )
Year
BHAG CustomerCount Max YR_PCT_CHANGE 2009 NaN 2545.0 2545.0 NaN 2010 NaN 1819.0 1819.0 -0.285265 2011 1000.0 1957.0 1957.0 0.075866 2012 2000.0 1689.0 1689.0 -0.136944 2013 3000.0 NaN NaN 0.000000
( 1 + Year. ix[ 2012 , 'YR_PCT_CHANGE' ] ) * Year. loc[ 2012 , 'Max' ]
/home/dreamboy/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
1457.7010730710272
ALL[ 'Max' ] . plot( figsize= ( 10 , 5 ) ) ; plt. title( 'ALL Markets' )
fig, axes = plt. subplots( nrows= 2 , ncols= 2 , figsize= ( 20 , 10 ) )
fig. subplots_adjust( hspace= 1.0 )
Daily. loc[ 'FL' ] [ 'CustomerCount' ] [ '2012' : ] . fillna( method= 'pad' ) . plot( ax= axes[ 0 , 0 ] )
Daily. loc[ 'GA' ] [ 'CustomerCount' ] [ '2012' : ] . fillna( method= 'pad' ) . plot( ax= axes[ 0 , 1 ] )
Daily. loc[ 'TX' ] [ 'CustomerCount' ] [ '2012' : ] . fillna( method= 'pad' ) . plot( ax= axes[ 1 , 0 ] )
Daily. loc[ 'NY' ] [ 'CustomerCount' ] [ '2012' : ] . fillna( method= 'pad' ) . plot( ax= axes[ 1 , 1 ] )
axes[ 0 , 0 ] . set_title( 'Florida' )
axes[ 0 , 1 ] . set_title( 'Georgia' )
axes[ 1 , 0 ] . set_title( 'Texas' )
axes[ 1 , 1 ] . set_title( 'North East' ) ;