python里怎么转职行列_R与Python手牵手:数据的行列操纵

行列操纵乃是ETL的基本功,传统的提取数据并进行转化的工作是由SQL完成的,但是R与Python在解决这方面问题也是游刃有余,让我们看看两者在处理数据基本操作的时候有什么区别。

Python

#载入模块

import pandas as pd

import numpy as np

import matplotlib.pylab as plt

#载入数据

edu = pd.read_csv('G:/Py/introduction-datascience-python-book-master/files/ch02/educ_figdp_1_Data.csv',

na_values=':', usecols=['TIME', 'GEO', 'Value']) #na_values是把“:”符号认为是缺失值的意思

行列选择

#选择列,只显示前5行

edu['Value'].head()

0 NaN

1 NaN

2 5.00

3 5.03

4 4.95

Name: Value, dtype: float64

#选择行

edu[10:14]

TIME GEO Value 10 2010 European Union (28 countries) 5.41 11 2011 European Union (28 countries) 5.25 12 2000 European Union (27 countries) 4.91 13 2001 European Union (27 countries) 4.99

过滤数据

#条件过滤,取Value大于6.5的行,只显示后5行

edu[edu['Value'] > 6.5].tail()

TIME GEO Value 286 2010 Malta 6.74 287 2011 Malta 7.96 381 2009 Finland 6.81 382 2010 Finland 6.85 383 2011 Finland 6.76

#筛选具有缺失值的数据,只显示前5行

edu[edu['Value'].isnull()].head()

TIME GEO Value 0 2000 European Union (28 countries) NaN 1 2001 European Union (28 countries) NaN 36 2000 Euro area (18 countries) NaN 37 2001 Euro area (18 countries) NaN 48 2000 Euro area (17 countries) NaN

汇总数据

#求所有列的最大值

edu.max(axis=0)

TIME 2011

GEO Spain

Value 8.81

dtype: object

#求Value列的最大值,自动忽略缺失值

edu['Value'].max()

8.81

#把Value列统一除以100

s = edu['Value'] / 100

s.head()

0 NaN

1 NaN

2 0.0500

3 0.0503

4 0.0495

Name: Value, dtype: float64

#构造新一列,名为ValueNorm,把Value列统一除以自身最大值

edu['ValueNorm'] = edu['Value'] / edu['Value'].max()

edu.tail()

TIME GEO Value ValueNorm 379 2007 Finland 5.90 0.669694 380 2008 Finland 6.10 0.692395 381 2009 Finland 6.81 0.772985 382 2010 Finland 6.85 0.777526 383 2011 Finland 6.76 0.767310

#取Vaule列的开方

s = edu['Value'].apply(np.sqrt)

s.head()

0 NaN

1 NaN

2 2.236068

3 2.242766

4 2.224860

Name: Value, dtype: float64

#取Value列的平方

s = edu['Value'].apply(lambda d: d**2)

s.head()

0 NaN

1 NaN

2 25.0000

3 25.3009

4 24.5025

Name: Value, dtype: float64

#删除ValueNorm列,inplace=True参数的设置可以使删除操作直接对原始数据进行,也就是说现在的edu已经改变了

#没有对inplace进行设置的话,返回的是一个拷贝,也就是edu本身仍然没有发生变化

edu.drop('ValueNorm', axis=1, inplace=True)

edu.head()

TIME GEO Value 0 2000 European Union (28 countries) NaN 1 2001 European Union (28 countries) NaN 2 2002 European Union (28 countries) 5.00 3 2003 European Union (28 countries) 5.03 4 2004 European Union (28 countries) 4.95

#在最后添加一行

edu = edu.append({'TIME': 2000, 'Value': 5.00, 'GEO': 'a'}, ignore_index=True)

edu.tail()

TIME GEO Value 380 2008 Finland 6.10 381 2009 Finland 6.81 382 2010 Finland 6.85 383 2011 Finland 6.76 384 2000 a 5.00

#删除最后一行

edu.drop(max(edu.index), axis=0, inplace=True)

edu.tail()

TIME GEO Value 379 2007 Finland 5.90 380 2008 Finland 6.10 381 2009 Finland 6.81 382 2010 Finland 6.85 383 2011 Finland 6.76

#删除Value为缺失值的行

eduDrop = edu.dropna(how='any', subset=['Value'], axis=0)

eduDrop.head()

TIME GEO Value 2 2002 European Union (28 countries) 5.00 3 2003 European Union (28 countries) 5.03 4 2004 European Union (28 countries) 4.95 5 2005 European Union (28 countries) 4.92 6 2006 European Union (28 countries) 4.91

#给缺失的Value值进行填充,填充值为0

eduFilled = edu.fillna(value={'Value': 0})

eduFilled.head()

TIME GEO Value 0 2000 European Union (28 countries) 0.00 1 2001 European Union (28 countries) 0.00 2 2002 European Union (28 countries) 5.00 3 2003 European Union (28 countries) 5.03 4 2004 European Union (28 countries) 4.95

R

在R中,上述操作要如何进行呢?我们看一下。

#加载包与数据读取

library(tidyverse)

edu = read_csv('G:/Py/introduction-datascience-python-book-master/files/ch02/educ_figdp_1_Data.csv', na=":") %>%

select(TIME,GEO,Value)

-- Attaching packages --------------------------------------- tidyverse 1.2.1 --

√ ggplot2 3.0.0 √ purrr 0.2.5

√ tibble 1.4.2 √ dplyr 0.7.6

√ tidyr 0.8.1 √ stringr 1.3.1

√ readr 1.1.1 √ forcats 0.3.0

-- Conflicts ------------------------------------------ tidyverse_conflicts() --

x dplyr::filter() masks stats::filter()

x dplyr::lag() masks stats::lag()

Parsed with column specification:

cols(

TIME = col_integer(),

GEO = col_character(),

INDIC_ED = col_character(),

Value = col_double(),

`Flag and Footnotes` = col_character()

)

行列选择

edu %>% select(Value) %>% head

Value NA NA5.005.034.954.92

edu %>% slice(10:13)

TIMEGEOValue2009 European Union (28 countries)5.38 2010 European Union (28 countries)5.41 2011 European Union (28 countries)5.25 2000 European Union (27 countries)4.91

edu %>% filter(Value > 6.5) %>% tail

TIMEGEOValue2005 Malta 6.58 2010 Malta 6.74 2011 Malta 7.96 2009 Finland6.81 2010 Finland6.85 2011 Finland6.76

edu %>% filter(is.na(Value)) %>% head

TIMEGEOValue2000 European Union (28 countries)NA 2001 European Union (28 countries)NA 2000 Euro area (18 countries) NA 2001 Euro area (18 countries) NA 2000 Euro area (17 countries) NA 2001 Euro area (17 countries) NA

edu %>% summarize_all(funs(max(.,na.rm=T)))

TIMEGEOValue2011 Spain8.81

edu %>% transmute(s = Value/100) %>% head

s NA NA0.05000.05030.04950.0492

edu %>% mutate(ValueNorm = Value/max(Value,na.rm=T)) -> edu

edu %>% head

TIMEGEOValueValueNorm2000 European Union (28 countries) NA NA 2001 European Union (28 countries) NA NA 2002 European Union (28 countries)5.00 0.5675369 2003 European Union (28 countries)5.03 0.5709421 2004 European Union (28 countries)4.95 0.5618615 2005 European Union (28 countries)4.92 0.5584563

edu %>% transmute(sqrt(Value)) %>% head

sqrt(Value) NA NA2.2360682.2427662.2248602.218107

edu %>% transmute(Value**2) %>% head

Value^2 NA NA25.000025.300924.502524.2064

edu %>% select(-ValueNorm) -> edu

edu %>% head

TIMEGEOValue2000 European Union (28 countries) NA 2001 European Union (28 countries) NA 2002 European Union (28 countries)5.00 2003 European Union (28 countries)5.03 2004 European Union (28 countries)4.95 2005 European Union (28 countries)4.92

edu %>%

add_row(TIME=2000,GEO=5,Value='a') -> edu

edu %>%

tail

TIMEGEOValue2007 Finland5.9 2008 Finland6.1 2009 Finland6.81 2010 Finland6.85 2011 Finland6.76 2000 5 a

edu %>% head(-1) -> edu

edu %>% tail

TIMEGEOValue2006 Finland6.18 2007 Finland5.9 2008 Finland6.1 2009 Finland6.81 2010 Finland6.85 2011 Finland6.76

edu %>% drop_na(Value) %>% head

TIMEGEOValue2002 European Union (28 countries)5 2003 European Union (28 countries)5.03 2004 European Union (28 countries)4.95 2005 European Union (28 countries)4.92 2006 European Union (28 countries)4.91 2007 European Union (28 countries)4.92

edu %>% replace_na(replace=list(Value=0)) %>% head

TIMEGEOValue2000 European Union (28 countries)0 2001 European Union (28 countries)0 2002 European Union (28 countries)5 2003 European Union (28 countries)5.03 2004 European Union (28 countries)4.95 2005 European Union (28 countries)4.92

分析

大家还是要记得,python的行是从0开始计算的,所以在python里面edu[10:14]其实取的是10-13行。另外,python中习惯用axis=0处理行问题,axis=1处理列问题;dplyr中,则用filter处理行,select处理列。python的inplace参数可以直接在原始数据中更改,有利有弊,能够方便一点节省时间,但是如果操作不当的时候则不是很安全,很容易无法恢复原始数据。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值