数据规整化:合并、重塑、转换
-纵向连接表
-横向连接表
-对数据分组和汇总
-数据重塑
-变量转换
首先载入需要使用的包
in
# -*- coding:utf-8 -*-
from pandas import Series,DataFrame
import pandas as pd
import sys
default_encoding = 'utf-8'
if sys.getdefaultencoding() != default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)
out
1.纵向连接表
注意通过pandas.concat()函数实现
in
trans_apr9 = pd.read_csv('trans_apr9.csv')
trans_apr9
trans_apr10 = pd.read_csv('trans_apr10.csv')
trans_apr10
out
默认的是在axis=0上工作,可以传入axis=1:
in
combine = pd.concat([trans_apr9,trans_apr10],axis=0)
combine
out
移除重复数据
.dorp_duplicates方法返回一个移除了重复行的DataFrame,
.drop_duplicates方法默认保留第一个出现的值组合,传入take_last=True则保留最后一个:
in
combine.drop_duplicates(['trans_id',])
out
2.横向连接表
1.1数据库风格的DataFrame合并
数据集的合并或连接运算是通过一个或多个键将行连接起来的。这种运算是关系型数据库的核心。pandas.merge()函数是对数据应用这些算法的主要切入点。
in
Order_fact = pd.read_csv('Order_fact.csv')
Order_fact.head(2)
customer = pd.read_excel('customer.xls')
customer.head()
out
这是一种多对一的合并。
in
merge = pd.merge(Order_fact,customer)
merge.head(3)
out
注意上面并没有指明要用哪个列进行连接。如果没有指定,pandas.merge()就会将重叠的列名当作键,不过,最好显示指定一下:
in
merge = pd.merge(Order_fact,customer,on='Customer_ID')
merge.head(3)
out
如果两个对象的列名不一样,也可以分别进行指定:
in
merge = pd.merge(Order_fact,customer,left_on='Customer_ID',right_on='Customer_ID')
merge.head(3)
out
默认情况下,pandas.merge()做的是inner连接,其他方式还有left,right以及outer。
in
table1 = pd.read_csv('table1.csv')
table2 = pd.read_csv('table2.csv')
table1
table2
left = pd.merge(table1,table2,how='left')
left
out
3.对数据分组和汇总
in
merge = pd.merge(Order_fact,customer)
merge.head(3)
grouped = merge.groupby(merge['Gender'])
grouped.mean()
out
4.数据重塑(reshape)
首先运行如下语句:
in
employee_phones = pd.read_csv('employee_phones.csv')
employee_phones.head(5)
out
很多时候,关系型数据库中的数据经常都是如上那样存储的。我们更常用的是DataFrame,不同的item分别形成一列,data列中的时间值则用作索引。需要如下操作:
in
pivoted = employee_phones.pivot('Employee_ID','Phone_Type','Phone_Number')
pivoted.head()
out
5.数据转换
5.1替换值
in
data = Series([1.,-999.,2.,-999.,-1000.,3.])
data
out
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
使用.replace方法可以对数据中的值进行替换:
in
import numpy as np
data.replace(-999,np.nan)
out
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
替换多个值的情形:
in
data.replace([-999,-1000],[np.nan,0])
data.replace({-999: np.nan, -1000: 0})
out
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
5.2离散化
in
age = [20,22,25,27,21,23,37,31,61,45,41,32]
out
上面是一个年龄的序列,我们可能需要将之离散化,分为”18到25”、”25到35”、”35到60”以及”60以上”几个区段。需要使用pandas.cut()函数:
in
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats
out
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats对象有两个重要属性:
in
cats.labels
out
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
可以在离散化时设点每个区段的名字:
in
group_names = ['Young','YoungAdult','MiddleAged','Senior']
pd.cut(ages,bins,labels=group_names)
out
[Young, Young, Young, YoungAdult, Young, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Young < YoungAdult < MiddleAged < Senior]
与cut()函数相似的qcut()函数,按照分位数进行划分:
in
data = np.random.randn(1000) #正态分布
cats = pd.qcut(data,4) #按四分位数划分
cats
pd.value_counts(cats)
pd.qcut(data,[0, 0.1, 0.5, 0.9, 1.])
out
[(-1.353, -0.0173], (-3.418, -1.353], (-1.353, -0.0173], (-3.418, -1.353], (-0.0173, 1.245], ..., (-0.0173, 1.245], (-3.418, -1.353], (-1.353, -0.0173], (-1.353, -0.0173], (-1.353, -0.0173]]
Length: 1000
Categories (4, interval[float64]): [(-3.418, -1.353] < (-1.353, -0.0173] < (-0.0173, 1.245] < (1.245, 3.848]]
5.3计算指标/哑变量
统计建模或机器学习经常需要将分类变量转换为”哑变量矩阵”或”指标矩阵”,可以使用.get_dummies()方法
in
df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
pd.get_dummies(df['key'])
df
pd.merge
df.join(pd.get_dummies(df['key']))
out