http://blog.csdn.net/pipisorry/article/details/39506169
数据分析和建模方面的大量编程工作都是用在数据准备上的:加载、清理、转换以及重 塑。有时候,存放在文件或数据库中的数据并不能满足数据处理应用的要求。
pandas和Python标准库提供了一组高级的、灵活的、高效的核心函数和算法,它们能够轻松地将数据规整化为正确的形式。
数据正则化data normalization
min-max标准化(对列进行)
df_norm = (df - df.min()) / (df.max() - df.min())
或者df_norm2=df.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
Z-score标准化(对列进行)
1 pandas dataframe实现
df = (df - df.mean()) / df.std()
# df.fillna(df.min(), inplace=True)
# df.fillna(args.nan_sub, inplace=True)
Note: df_norm会保留nan值
2 np实现
df_norm2 = df.apply(lambda x: (x - np.mean(x)) / (np.std(x)))
上面df.std()和np.std()算出来的值不一样,因为np.std()计算的是总体标准差;df.std()计算的是样本标准差,是无偏估计。
3 sklearn实现
scaler = preprocessing.StandardScaler().fit(dense_features_df)
dense_index = dense_features_df.index
scaler.transform(dense_features_df, copy=False)
dense_features_df = pd.DataFrame(dense_features_df, index=dense_index, columns=dense_cols)
1和3对比:建议使用3,运行时间3是1的100多倍。1不处理nan数据,但是一般处理完后需要使用min值填充;3 不允许存在nan值,否则报错,所以需要提前处理好nan数据。
示例:
df = pd.DataFrame([[1,2,3], [4,5,np.nan], [6,np.nan, 7]])
df
Out[27]:
0 1 2
0 1 2.0 3.0
1 4 5.0 NaN
2 6 NaN 7.0
df.std()
Out[28]:
0 2.516611
1 2.121320
2 2.828427
dtype: float64
(df - df.mean())/df.std()
Out[29]:
0 1 2
0 -1.059626 -0.707107 -0.707107
1 0.132453 0.707107 NaN
2 0.927173 NaN 0.707107
dataframe每行都减去行平均值
use DataFrame's sub method and specify that the subtraction should happen row-wise (axis=0) as opposed to the default column-wise behaviour:
df.sub(df.mean(axis=1), axis=0)
相当于
norm_ui_array = df.values
for i in range(len(norm_ui_array)):
norm_ui_array[i][norm_ui_array[i] != np.NaN] -= user_rat_mean[i]
[Pandas: Subtract row mean from each element in row]
分组Grouping
By “group by” we are referring to a process involving one or more of the following steps
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
Note: 分组后成为groupby对象,groupby取属性或者聚合后就成为df对象,取某列就成为series对象。
group后的索引
分组后的索引还是原来的索引,不过分组再聚合后的索引就不一样了。
分组修改index方式
分组聚合后,结果会将组名作为新的index。the result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can bechanged by using the as_index option.
修改这种默认方式可以使用df.groupby('A', as_index=False).sum()或者df.groupby(['A', 'B']).sum().reset_index()
分组示例
In [86]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
....: 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
....: 'C' : np.random.randn(8),
....: 'D' : np.random.randn(8)})
In [87]: df
A B C D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033
Grouping and then applying a function sum to the resulting groups.
In [88]: df.groupby('A').sum()
Out[88]:
C D
A
bar -2.802588 2.42611
foo 3.146492 -0.63958
Grouping by multiple columns forms a hierarchical index, which we then applythe function.
In [89]: df.groupby(['A','B']).sum()
Out[89]:
C D
A B
bar one -1.814470 2.395985
three -0.595447 0.166599
two -0.392670 -0.136473
foo one -1.195665 -0.616981
three 1.928123 -1.623033
two 2.414034 1.600434
嵌套索引
每个sid对应一些时间,每个时间又对应一些uid,统计uid个数。
shop_ids = user_pay_df['sid'].unique()
s_group = user_pay_df.groupby(by=['sid', 'time']).count() print(s_group) for sid in shop_ids: print(s_group.ix[sid])
索引时候可以先对最外层ix进行索引,就可以对不同的sid单独进行某些操作了。
uid
sid time
90 2015-07-06 3
2015-07-07 10
2015-07-08 7
2015-07-09 4
uid
time
2015-07-02 4
2015-07-03 1
2015-07-04 3
嵌套索引也可以通过下面的方式实现
group_process = lambda g: g.groupby(by=['time']).count()
s_group = user_pay_df.groupby(by=['sid']).apply(group_process)
print(s_group.ix[90])
不过奇怪的多了一列
uid sid
time
2015-07-06 3 3
2015-07-07 10 10
2015-07-08 7 7
2015-07-09 4 4
2015-07-10 23 23
[Need for speed: Slow nested groupbys and applys in Pandas]
分组对象的属性GroupBy object attributes
Note: 这里属性应该就是group后的聚合方法。
The groups attribute is a dict whose keys are the computed unique groupsand corresponding values being the axis labels belonging to each group.
gb.<TAB>
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height
gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist
gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices
gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
[GroupBy¶]
In [19]: df.groupby('A').groups
Out[19]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}
In [21]: gb = df.groupby(['A', 'B'])
In [22]: gb.groups
Out[22]:
{('bar', 'one'): [1],
('bar', 'three'): [3], ...
('foo', 'two'): [2, 4]}
In [23]: len(grouped)
Out[23]: 6
没有reduce聚合的分组
如果只是想分组,而不想聚合或者reduce,可以使用set_index重建索引实现(当然也可以使用下面的遍历分组方法,也不会聚合)
ut_index = lines.set_index(['user', 'location_id'])
user check-in_time location_id
0 0 201