作为习惯,一般导入下面包命名
In [1]: import pandas as pd In [2]: import numpy as np数据框架对象df可以通过df.head()显示前面5条记录,类似SAS代码如下:
proc print data=df(obs=5); run;数据框架和序列数据,类似SAS数据集和只含一个变量的数据集,数据框架也支持标签,通过列模式组织数据框架数据
数据框架和序列可以通过索引来访问数据行,通过对行打lable实现,在SAS中类似对于_N_,pandas中如果没有打label,可以通过索引下标0-N访问
数据输入/输出
data df; input x y; datalines; 1 2 3 4 5 6 ; run;
In [3]: df = pd.DataFrame({ ...: 'x': [1, 3, 5], ...: 'y': [2, 4, 6]}) ...: In [4]: df Out[4]: x y 0 1 2 1 3 4 2 5 6
proc import datafile='tips.csv' dbms=csv out=tips replace; getnames=yes; run;
In [5]: url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv' In [6]: tips = pd.read_csv(url)
tips = pd.read_csv('tips.csv', sep='\t', header=None) # alternatively, read_table is an alias to read_csv with tab delimiter tips = pd.read_table('tips.csv', header=None)
导出数据
proc export data=tips outfile='tips2.csv' dbms=csv; run;
tips.to_csv('tips2.csv')数据操作
data tips; set tips; total_bill = total_bill - 2; new_bill = total_bill / 2; run;
In [8]: tips['total_bill'] = tips['total_bill'] - 2 In [9]: tips['new_bill'] = tips['total_bill'] / 2.0过滤
data tips; set tips; if total_bill > 10; run; data tips; set tips; where total_bill > 10;
通过返回的布尔索引用于过滤数据
In [11]: tips[tips['total_bill'] > 10].head()
判断逻辑
data tips; set tips; format bucket $4.; if total_bill < 10 then bucket = 'low'; else bucket = 'high'; run;
In [12]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')
日期函数
data tips; set tips; format date1 date2 date1_plusmonth mmddyy10.; date1 = mdy(1, 15, 2013); date2 = mdy(2, 15, 2015); date1_year = year(date1); date2_month = month(date2); * shift date to beginning of next interval; date1_next = intnx('MONTH', date1, 1); * count intervals between dates; months_between = intck('MONTH', date1, date2); run;
In [14]: tips['date1'] = pd.Timestamp('2013-01-15') In [15]: tips['date2'] = pd.Timestamp('2015-02-15') In [16]: tips['date1_year'] = tips['date1'].dt.year In [17]: tips['date2_month'] = tips['date2'].dt.month In [18]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin() In [19]: tips['months_between'] = (tips['date2'].dt.to_period('M') - ....: tips['date1'].dt.to_period('M')) ....: In [20]: tips[['date1','date2','date1_year','date2_month', ....: 'date1_next','months_between']].head()选择变量
data tips; set tips; keep sex total_bill tip; run; data tips; set tips; drop sex; run; data tips; set tips; rename total_bill=total_bill_2; run;
In [21]: tips[['sex', 'total_bill', 'tip']].head()
In [22]: tips.drop('sex', axis=1).head()
In [23]: tips.rename(columns={'total_bill':'total_bill_2'}).head()
排序变量
proc sort data=tips; by sex total_bill; run;
tips = tips.sort_values(['sex', 'total_bill'])合并数据集
proc sort data=df1; by key; run; proc sort data=df2; by key; run; data left_join inner_join right_join outer_join; merge df1(in=a) df2(in=b); if a and b then output inner_join; if a then output left_join; if b then output right_join; if a or b then output outer_join; run;pandas合并不需要先排序
In [26]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], ....: 'value': np.random.randn(4)}) ....:
In [28]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'], ....: 'value': np.random.randn(4)})
In [30]: inner_join = df1.merge(df2, on=['key'], how='inner')
In [32]: left_join = df1.merge(df2, on=['key'], how='left')
In [34]: right_join = df1.merge(df2, on=['key'], how='right')
In [36]: outer_join = df1.merge(df2, on=['key'], how='outer')缺失数据处理
pandas和SAS一样,缺失数据+任何数据返回都是缺省数据,聚合函数会自动忽略缺失数据
outer_join['value_x'] + outer_join['value_y']
outer_join['value_x'].sum()在pandas中,不能对缺失数据进行直接过滤比较,通过函数判断返回布尔值进行过滤 pd.isnull pd.notnull
data outer_join_nulls; set outer_join; if value_x = .; run;
In [41]: outer_join[pd.isnull(outer_join['value_x'])]
In [42]: outer_join[pd.notnull(outer_join['value_x'])]pandas提供很多方法用于处理空值,如去除含空值的行,填充空值为特定值(如统计值等)
In [43]: outer_join.dropna()
In [44]: outer_join.fillna(method='ffill')
In [45]: outer_join['value_x'].fillna(outer_join['value_x'].mean())分组聚合处理
proc summary data=tips nway; class sex smoker; var total_bill tip; output out=tips_summed sum=; run;
In [46]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()
In [48]: gb = tips.groupby('smoker')['total_bill']
In [49]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')
proc sort data=tips; by sex smoker; run; data tips_first; set tips; by sex smoker; if FIRST.sex or FIRST.smoker then output; run;
In [51]: tips.groupby(['sex','smoker']).first()
其他比较
pandas处理数据都在内存,而SAS数据位于磁盘上,因此后者处理数据容量更大,但前者更快,如果想用磁盘可以考虑使用正在开发中的库【
dask.dataframe】来实现部分pandas功能
pandas可以读去SAS的XPORT格式数据,未来计划支持原生二进制文件
libname xportout xport 'transport-file.xpt'; data xportout.tips; set tips(rename=(total_bill=tbill)); * xport variable names limited to 6 characters; run;
df = pd.read_sas('transport-file.xpt')
# version 0.17, 10M rows In [8]: %time df = pd.read_sas('big.xpt') Wall time: 14.6 s In [9]: %time df = pd.read_csv('big.csv') Wall time: 4.86 s