行列操纵乃是ETL的基本功,传统的提取数据并进行转化的工作是由SQL完成的,但是R与Python在解决这方面问题也是游刃有余,让我们看看两者在处理数据基本操作的时候有什么区别。
Python
#载入模块
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
#载入数据
edu = pd.read_csv('G:/Py/introduction-datascience-python-book-master/files/ch02/educ_figdp_1_Data.csv',
na_values=':', usecols=['TIME', 'GEO', 'Value']) #na_values是把“:”符号认为是缺失值的意思
行列选择
#选择列,只显示前5行
edu['Value'].head()
0 NaN
1 NaN
2 5.00
3 5.03
4 4.95
Name: Value, dtype: float64
#选择行
edu[10:14]
TIME GEO Value 10 2010 European Union (28 countries) 5.41 11 2011 European Union (28 countries) 5.25 12 2000 European Union (27 countries) 4.91 13 2001 European Union (27 countries) 4.99
过滤数据
#条件过滤,取Value大于6.5的行,只显示后5行
edu[edu['Value'] > 6.5].tail()
TIME GEO Value 286 2010 Malta 6.74 287 2011 Malta 7.96 381 2009 Finland 6.81 382 2010 Finland 6.85 383 2011 Finland 6.76
#筛选具有缺失值的数据,只显示前5行
edu[edu['Value'].isnull()].head()
TIME GEO Value 0 2000 European Union (28 countries) NaN 1 2001 European Union (28 countries) NaN 36 2000 Euro area (18 countries) NaN 37 2001 Euro area (18 countries) NaN 48 2000 Euro area (17 countries) NaN
汇总数据
#求所有列的最大值
edu.max(axis=0)
TIME 2011
GEO Spain
Value 8.81
dtype: object
#求Value列的最大值,自动忽略缺失值
edu['Value'].max()
8.81
#把Value列统一除以100
s = edu['Value'] / 100
s.head()
0 NaN
1 NaN
2 0.0500
3 0.0503
4 0.0495
Name: Value, dtype: float64
#构造新一列,名为ValueNorm,把Value列统一除以自身最大值
edu['ValueNorm'] = edu['Value'] / edu['Value'].max()
edu.tail()
TIME GEO Value ValueNorm 379 2007 Finland 5.90 0.669694 380 2008 Finland 6.10 0.692395 381 2009 Finland 6.81 0.772985 382 2010 Finland 6.85 0.777526 383 2011 Finland 6.76 0.767310
#取Vaule列的开方
s = edu['Value'].apply(np.sqrt)
s.head()
0 NaN
1 NaN
2 2.236068
3 2.242766
4 2.224860
Name: Value, dtype: float64
#取Value列的平方
s = edu['Value'].apply(lambda d: d**2)
s.head()
0 NaN
1 NaN
2 25.0000
3 25.3009
4 24.5025
Name: Value, dtype: float64
#删除ValueNorm列,inplace=True参数的设置可以使删除操作直接对原始数据进行,也就是说现在的edu已经改变了
#没有对inplace进行设置的话,返回的是一个拷贝,也就是edu本身仍然没有发生变化
edu.drop('ValueNorm', axis=1, inplace=True)
edu.head()
TIME GEO Value 0 2000 European Union (28 countries) NaN 1 2001 European Union (28 countries) NaN 2 2002 European Union (28 countries) 5.00 3 2003 European Union (28 countries) 5.03 4 2004 European Union (28 countries) 4.95
#在最后添加一行
edu = edu.append({'TIME': 2000, 'Value': 5.00, 'GEO': 'a'}, ignore_index=True)
edu.tail()
TIME GEO Value 380 2008 Finland 6.10 381 2009 Finland 6.81 382 2010 Finland 6.85 383 2011 Finland 6.76 384 2000 a 5.00
#删除最后一行
edu.drop(max(edu.index), axis=0, inplace=True)
edu.tail()
TIME GEO Value 379 2007 Finland 5.90 380 2008 Finland 6.10 381 2009 Finland 6.81 382 2010 Finland 6.85 383 2011 Finland 6.76
#删除Value为缺失值的行
eduDrop = edu.dropna(how='any', subset=['Value'], axis=0)
eduDrop.head()
TIME GEO Value 2 2002 European Union (28 countries) 5.00 3 2003 European Union (28 countries) 5.03 4 2004 European Union (28 countries) 4.95 5 2005 European Union (28 countries) 4.92 6 2006 European Union (28 countries) 4.91
#给缺失的Value值进行填充,填充值为0
eduFilled = edu.fillna(value={'Value': 0})
eduFilled.head()
TIME GEO Value 0 2000 European Union (28 countries) 0.00 1 2001 European Union (28 countries) 0.00 2 2002 European Union (28 countries) 5.00 3 2003 European Union (28 countries) 5.03 4 2004 European Union (28 countries) 4.95
R
在R中,上述操作要如何进行呢?我们看一下。
#加载包与数据读取
library(tidyverse)
edu = read_csv('G:/Py/introduction-datascience-python-book-master/files/ch02/educ_figdp_1_Data.csv', na=":") %>%
select(TIME,GEO,Value)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.0.0 √ purrr 0.2.5
√ tibble 1.4.2 √ dplyr 0.7.6
√ tidyr 0.8.1 √ stringr 1.3.1
√ readr 1.1.1 √ forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Parsed with column specification:
cols(
TIME = col_integer(),
GEO = col_character(),
INDIC_ED = col_character(),
Value = col_double(),
`Flag and Footnotes` = col_character()
)
行列选择
edu %>% select(Value) %>% head
Value NA NA5.005.034.954.92
edu %>% slice(10:13)
TIMEGEOValue2009 European Union (28 countries)5.38 2010 European Union (28 countries)5.41 2011 European Union (28 countries)5.25 2000 European Union (27 countries)4.91
edu %>% filter(Value > 6.5) %>% tail
TIMEGEOValue2005 Malta 6.58 2010 Malta 6.74 2011 Malta 7.96 2009 Finland6.81 2010 Finland6.85 2011 Finland6.76
edu %>% filter(is.na(Value)) %>% head
TIMEGEOValue2000 European Union (28 countries)NA 2001 European Union (28 countries)NA 2000 Euro area (18 countries) NA 2001 Euro area (18 countries) NA 2000 Euro area (17 countries) NA 2001 Euro area (17 countries) NA
edu %>% summarize_all(funs(max(.,na.rm=T)))
TIMEGEOValue2011 Spain8.81
edu %>% transmute(s = Value/100) %>% head
s NA NA0.05000.05030.04950.0492
edu %>% mutate(ValueNorm = Value/max(Value,na.rm=T)) -> edu
edu %>% head
TIMEGEOValueValueNorm2000 European Union (28 countries) NA NA 2001 European Union (28 countries) NA NA 2002 European Union (28 countries)5.00 0.5675369 2003 European Union (28 countries)5.03 0.5709421 2004 European Union (28 countries)4.95 0.5618615 2005 European Union (28 countries)4.92 0.5584563
edu %>% transmute(sqrt(Value)) %>% head
sqrt(Value) NA NA2.2360682.2427662.2248602.218107
edu %>% transmute(Value**2) %>% head
Value^2 NA NA25.000025.300924.502524.2064
edu %>% select(-ValueNorm) -> edu
edu %>% head
TIMEGEOValue2000 European Union (28 countries) NA 2001 European Union (28 countries) NA 2002 European Union (28 countries)5.00 2003 European Union (28 countries)5.03 2004 European Union (28 countries)4.95 2005 European Union (28 countries)4.92
edu %>%
add_row(TIME=2000,GEO=5,Value='a') -> edu
edu %>%
tail
TIMEGEOValue2007 Finland5.9 2008 Finland6.1 2009 Finland6.81 2010 Finland6.85 2011 Finland6.76 2000 5 a
edu %>% head(-1) -> edu
edu %>% tail
TIMEGEOValue2006 Finland6.18 2007 Finland5.9 2008 Finland6.1 2009 Finland6.81 2010 Finland6.85 2011 Finland6.76
edu %>% drop_na(Value) %>% head
TIMEGEOValue2002 European Union (28 countries)5 2003 European Union (28 countries)5.03 2004 European Union (28 countries)4.95 2005 European Union (28 countries)4.92 2006 European Union (28 countries)4.91 2007 European Union (28 countries)4.92
edu %>% replace_na(replace=list(Value=0)) %>% head
TIMEGEOValue2000 European Union (28 countries)0 2001 European Union (28 countries)0 2002 European Union (28 countries)5 2003 European Union (28 countries)5.03 2004 European Union (28 countries)4.95 2005 European Union (28 countries)4.92
分析
大家还是要记得,python的行是从0开始计算的,所以在python里面edu[10:14]其实取的是10-13行。另外,python中习惯用axis=0处理行问题,axis=1处理列问题;dplyr中,则用filter处理行,select处理列。python的inplace参数可以直接在原始数据中更改,有利有弊,能够方便一点节省时间,但是如果操作不当的时候则不是很安全,很容易无法恢复原始数据。