前言
dplyr作为R中必学工具包之一,其对数据的行、列处理,抽样,分组,新增,排序,筛选等操作;加之其配合上管道函数与tibble类数据框,使dplyr在语法上简洁易懂,效率上也超越一般的data.frame。
数据准备
> pacman::p_load(data.table, tidyverse) #也可以直接library
> data <- fread("D://contest//transactions.csv")
> #将数据框转换为tibble形式数据框,为dplyr函数加快处理速度
> data <- tbl_df(data)
> #查看行列信息
> names(data)
[1] "authorized_flag" "card_id" "city_id" "category_1" "installments" "category_3"
[7] "merchant_category_id" "merchant_id" "month_lag" "purchase_amount" "purchase_date" "category_2"
[13] "state_id" "subsector_id"
> dim(data)
[1] 185129 14
#185129行,14列
> data
# A tibble: 185,129 x 14
authorized_flag card_id city_id category_1 installments category_3 merchant_category~ merchant_id month_lag purchase_amount purchase_date category_2 state_id subsector_id
<chr> <chr> <int> <chr> <int> <chr> <int> <chr> <int> <dbl> <chr> <int> <int> <int>
1 Y C_ID_cb34e~ 20 N 0 A 422 M_ID_f162748~ 1 -0.660 2018-03-07 10:5~ 3 19 27
2 Y C_ID_8a118~ 251 N 0 A 278 M_ID_00c57ea~ 2 -0.624 2017-07-21 12:1~ 3 8 37
3 Y C_ID_69ed7~ 117 N 0 A 367 M_ID_1da88fb~ 2 -0.714 2017-10-27 20:2~ 4 13 16
4 Y C_ID_ae6f8~ 69 N 1 B 383 M_ID_ebbdb42~ 2 -0.446 2018-01-26 08:4~ 1 9 2
5 Y C_ID_07b21~ 277 N 0 A 278 M_ID_7291e8a~ 1 -0.736 2018-03-10 21:4~ 4 13 37
6 Y C_ID_34521~ 291 N 0 A 422 M_ID_0649e6e~ 2 -0.702 2018-03-11 19:3~ 1 9 27
7 Y C_ID_5d09f~ 69 N 0 A 511 M_ID_b794b9d~ 2 -0.183 2018-04-21 13:1~ 1 9 7
8 Y C_ID_a12f5~ 331 N 0 A 273 M_ID_73d19e5~ 1 -0.720 2018-03-21 06:5~ 3 3 20
9 Y C_ID_88fb2~ 69 N 0 A 19 M_ID_a79f97e~ 1 -0.726 2018-03-01 16:5~ 1 9 36
10 Y C_ID_8ef5e~ 261 N 1 B 823 M_ID_2059928~ 2 -0.509 2018-03-04 03:5~ 1 9 25
# ... with 185,119 more rows