PANDAS和SAS的代码使用习惯比较

2 篇文章 0 订阅
作为习惯,一般导入下面包命名
In [1]: import pandas as pd

In [2]: import numpy as np
数据框架对象df可以通过df.head()显示前面5条记录,类似SAS代码如下:
proc print data=df(obs=5);
run;
数据框架和序列数据,类似SAS数据集和只含一个变量的数据集,数据框架也支持标签,通过列模式组织数据框架数据
数据框架和序列可以通过索引来访问数据行,通过对行打lable实现,在SAS中类似对于_N_,pandas中如果没有打label,可以通过索引下标0-N访问

数据输入/输出

data df;
    input x y;
    datalines;
    1 2
    3 4
    5 6
    ;
run;

In [3]: df = pd.DataFrame({
   ...:           'x': [1, 3, 5],
   ...:           'y': [2, 4, 6]})
   ...: 

In [4]: df
Out[4]: 
   x  y
0  1  2
1  3  4
2  5  6

proc import datafile='tips.csv' dbms=csv out=tips replace;
    getnames=yes;
run;

In [5]: url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'

In [6]: tips = pd.read_csv(url)

tips = pd.read_csv('tips.csv', sep='\t', header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter
tips = pd.read_table('tips.csv', header=None)

导出数据

proc export data=tips outfile='tips2.csv' dbms=csv;
run;
tips.to_csv('tips2.csv')
数据操作

data tips;
    set tips;
    total_bill = total_bill - 2;
    new_bill = total_bill / 2;
run;
In [8]: tips['total_bill'] = tips['total_bill'] - 2

In [9]: tips['new_bill'] = tips['total_bill'] / 2.0
过滤

data tips;
    set tips;
    if total_bill > 10;
run;

data tips;
    set tips;
    where total_bill > 10;
通过返回的布尔索引用于过滤数据
In [11]: tips[tips['total_bill'] > 10].head()

判断逻辑
data tips;
    set tips;
    format bucket $4.;

    if total_bill < 10 then bucket = 'low';
    else bucket = 'high';
run;

In [12]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')

日期函数
 
 
data tips;
    set tips;
    format date1 date2 date1_plusmonth mmddyy10.;
    date1 = mdy(1, 15, 2013);
    date2 = mdy(2, 15, 2015);
    date1_year = year(date1);
    date2_month = month(date2);
    * shift date to beginning of next interval;
    date1_next = intnx('MONTH', date1, 1);
    * count intervals between dates;
    months_between = intck('MONTH', date1, date2);
run;
In [14]: tips['date1'] = pd.Timestamp('2013-01-15')

In [15]: tips['date2'] = pd.Timestamp('2015-02-15')

In [16]: tips['date1_year'] = tips['date1'].dt.year

In [17]: tips['date2_month'] = tips['date2'].dt.month

In [18]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()

In [19]: tips['months_between'] = (tips['date2'].dt.to_period('M') -
   ....:                           tips['date1'].dt.to_period('M'))
   ....: 

In [20]: tips[['date1','date2','date1_year','date2_month',
   ....:       'date1_next','months_between']].head()
选择变量
data tips;
    set tips;
    keep sex total_bill tip;
run;

data tips;
    set tips;
    drop sex;
run;

data tips;
    set tips;
    rename total_bill=total_bill_2;
run;
In [21]: tips[['sex', 'total_bill', 'tip']].head()
In [22]: tips.drop('sex', axis=1).head()
In [23]: tips.rename(columns={'total_bill':'total_bill_2'}).head()



排序变量
proc sort data=tips;
    by sex total_bill;
run;
tips = tips.sort_values(['sex', 'total_bill'])
合并数据集
proc sort data=df1;
    by key;
run;

proc sort data=df2;
    by key;
run;

data left_join inner_join right_join outer_join;
    merge df1(in=a) df2(in=b);

    if a and b then output inner_join;
    if a then output left_join;
    if b then output right_join;
    if a or b then output outer_join;
run;
pandas合并不需要先排序
In [26]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
   ....:                     'value': np.random.randn(4)})
   ....: 
In [28]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
   ....:                      'value': np.random.randn(4)})
In [30]: inner_join = df1.merge(df2, on=['key'], how='inner')
In [32]: left_join = df1.merge(df2, on=['key'], how='left')
In [34]: right_join = df1.merge(df2, on=['key'], how='right')
In [36]: outer_join = df1.merge(df2, on=['key'], how='outer')
缺失数据处理
pandas和SAS一样,缺失数据+任何数据返回都是缺省数据,聚合函数会自动忽略缺失数据
outer_join['value_x'] + outer_join['value_y']
outer_join['value_x'].sum() 
在pandas中,不能对缺失数据进行直接过滤比较,通过函数判断返回布尔值进行过滤 pd.isnull pd.notnull
data outer_join_nulls;
    set outer_join;
    if value_x = .;
run;
 
  
In [41]: outer_join[pd.isnull(outer_join['value_x'])]
In [42]: outer_join[pd.notnull(outer_join['value_x'])]
pandas提供很多方法用于处理空值,如去除含空值的行,填充空值为特定值(如统计值等)
In [43]: outer_join.dropna()
In [44]: outer_join.fillna(method='ffill')
In [45]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
分组聚合处理
proc summary data=tips nway;
    class sex smoker;
    var total_bill tip;
    output out=tips_summed sum=;
run;
 
  
In [46]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()
In [48]: gb = tips.groupby('smoker')['total_bill']
In [49]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')
proc sort data=tips;
   by sex smoker;
run;

data tips_first;
    set tips;
    by sex smoker;
    if FIRST.sex or FIRST.smoker then output;
run;
In [51]: tips.groupby(['sex','smoker']).first()

其他比较

pandas处理数据都在内存,而SAS数据位于磁盘上,因此后者处理数据容量更大,但前者更快,如果想用磁盘可以考虑使用正在开发中的库【   dask.dataframe】来实现部分pandas功能
pandas可以读去SAS的XPORT格式数据,未来计划支持原生二进制文件
libname xportout xport 'transport-file.xpt';
data xportout.tips;
    set tips(rename=(total_bill=tbill));
    * xport variable names limited to 6 characters;
run;
df = pd.read_sas('transport-file.xpt')
但是用csv文件读取会更快于xpt格式读取
# version 0.17, 10M rows

In [8]: %time df = pd.read_sas('big.xpt')
Wall time: 14.6 s

In [9]: %time df = pd.read_csv('big.csv')
Wall time: 4.86 s


























  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值