添加串联校正系统的伯德图_SparkR:通过串联其他列将新列添加到数据框

添加串联校正系统的伯德图

继续使用SparkR探索土地注册处的开放数据集,我想看看在过去20年中英国哪条道路的房地产销售量最大。

回顾一下,这就是数据框的样子:

./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0
 
> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")
 
> head(sales)
                                      C0     C1               C2       C3 C4 C5
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00  UB5 4PJ  T  N
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD  D  N
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00   W4 1DZ  F  N
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH  D  N
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU  D  N
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF  T  N
  C6  C7 C8              C9         C10         C11
1  F 106       READING ROAD    NORTHOLT    NORTHOLT
2  F  58       ADAMS MEADOW   ILMINSTER   ILMINSTER
3  L  58      WHELLOCK ROAD                  LONDON
4  F  17           WESTGATE     MORPETH     MORPETH
5  F   4      MASON GARDENS  WEST WINCH KING'S LYNN
6  F   5    WILD FLOWER WAY DITCHINGHAM      BUNGAY
                           C12            C13 C14
1                       EALING GREATER LONDON   A
2               SOUTH SOMERSET       SOMERSET   A
3                       EALING GREATER LONDON   A
4               CASTLE MORPETH NORTHUMBERLAND   A
5 KING'S LYNN AND WEST NORFOLK        NORFOLK   A
6                SOUTH NORFOLK        NORFOLK   A

本文档说明了存储在每个字段中的数据,对于此特定查询,我们对字段C9-C12感兴趣。 计划是按那些字段对数据帧进行分组,然后按频率降序排序。

当按多个字段分组时,往往最容易创建一个将它们全部连接在一起的新字段,然后按该字段分组。

我从以下内容开始:

> sales$address = paste(sales$C9, sales$C10, sales$C11, sales$C12, sep=", ")
Error in as.character.default(<S4 object of class "Column">) :
  no method for coercing this S4 class to a vector

不太成功! 接下来,我变得更加原始:

> sales$address = sales$C9 + ", " + sales$C10 + ", " + sales$C11 + ", " + sales$C12
 
> head(sales)
                                      C0     C1               C2       C3 C4 C5
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00  UB5 4PJ  T  N
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD  D  N
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00   W4 1DZ  F  N
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH  D  N
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU  D  N
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF  T  N
  C6  C7 C8              C9         C10         C11
1  F 106       READING ROAD    NORTHOLT    NORTHOLT
2  F  58       ADAMS MEADOW   ILMINSTER   ILMINSTER
3  L  58      WHELLOCK ROAD                  LONDON
4  F  17           WESTGATE     MORPETH     MORPETH
5  F   4      MASON GARDENS  WEST WINCH KING'S LYNN
6  F   5    WILD FLOWER WAY DITCHINGHAM      BUNGAY
                           C12            C13 C14 address
1                       EALING GREATER LONDON   A      NA
2               SOUTH SOMERSET       SOMERSET   A      NA
3                       EALING GREATER LONDON   A      NA
4               CASTLE MORPETH NORTHUMBERLAND   A      NA
5 KING'S LYNN AND WEST NORFOLK        NORFOLK   A      NA
6                SOUTH NORFOLK        NORFOLK   A      NA

至少经过编译,但是所有地址都是“ NA”,这不是我们想要的。 经过一番搜索,我意识到有一个concat函数可用于此任务:

> sales$address = concat_ws(sep=", ", sales$C9, sales$C10, sales$C11, sales$C12)
 
> head(sales)
                                      C0     C1               C2       C3 C4 C5
1 {0C7ADEF5-878D-4066-B785-0000003ED74A} 163000 2003-02-21 00:00  UB5 4PJ  T  N
2 {35F67271-ABD4-40DA-AB09-00000085B9D3} 247500 2005-07-15 00:00 TA19 9DD  D  N
3 {B20B1C74-E8E1-4137-AB3E-0000011DF342} 320000 2010-09-10 00:00   W4 1DZ  F  N
4 {7D6B0915-C56B-4275-AF9B-00000156BCE7} 104000 1997-08-27 00:00 NE61 2BH  D  N
5 {47B60101-B64C-413D-8F60-000002F1692D} 147995 2003-05-02 00:00 PE33 0RU  D  N
6 {51F797CA-7BEB-4958-821F-000003E464AE} 110000 2013-03-22 00:00 NR35 2SF  T  N
  C6  C7 C8              C9         C10         C11
1  F 106       READING ROAD    NORTHOLT    NORTHOLT
2  F  58       ADAMS MEADOW   ILMINSTER   ILMINSTER
3  L  58      WHELLOCK ROAD                  LONDON
4  F  17           WESTGATE     MORPETH     MORPETH
5  F   4      MASON GARDENS  WEST WINCH KING'S LYNN
6  F   5    WILD FLOWER WAY DITCHINGHAM      BUNGAY
                           C12            C13 C14
1                       EALING GREATER LONDON   A
2               SOUTH SOMERSET       SOMERSET   A
3                       EALING GREATER LONDON   A
4               CASTLE MORPETH NORTHUMBERLAND   A
5 KING'S LYNN AND WEST NORFOLK        NORFOLK   A
6                SOUTH NORFOLK        NORFOLK   A
                                                               address
1                             READING ROAD, NORTHOLT, NORTHOLT, EALING
2                   ADAMS MEADOW, ILMINSTER, ILMINSTER, SOUTH SOMERSET
3                                      WHELLOCK ROAD, , LONDON, EALING
4                           WESTGATE, MORPETH, MORPETH, CASTLE MORPETH
5 MASON GARDENS, WEST WINCH, KING'S LYNN, KING'S LYNN AND WEST NORFOLK
6                  WILD FLOWER WAY, DITCHINGHAM, BUNGAY, SOUTH NORFOLK

这还差不多! 现在,让我们看看哪些街道的房屋销售最多:

> byAddress = summarize(groupBy(sales, sales$address), count = n(sales$address))
> head(arrange(byAddress, desc(byAddress$count)), 10)
 
                                                            address count
1                          BARBICAN, LONDON, LONDON, CITY OF LONDON  1398
2          CHRISTCHURCH ROAD, BOURNEMOUTH, BOURNEMOUTH, BOURNEMOUTH  1313
3                   MAIDA VALE, LONDON, LONDON, CITY OF WESTMINSTER  1305
4                     ROTHERHITHE STREET, LONDON, LONDON, SOUTHWARK  1253
5             SLOANE AVENUE, LONDON, LONDON, KENSINGTON AND CHELSEA  1219
6  THE STRAND, BRIGHTON MARINA VILLAGE, BRIGHTON, BRIGHTON AND HOVE  1218
7                     FAIRFIELD ROAD, LONDON, LONDON, TOWER HAMLETS  1217
8                             QUEENSTOWN ROAD, , LONDON, WANDSWORTH  1153
9                   UPPER RICHMOND ROAD, LONDON, LONDON, WANDSWORTH  1123
10                      QUEENSTOWN ROAD, LONDON, LONDON, WANDSWORTH  1079

接下来,我们将进一步研究数据,但这是另一篇文章。

翻译自: https://www.javacodegeeks.com/2015/09/sparkr-add-new-column-to-data-frame-by-concatenating-other-columns.html

添加串联校正系统的伯德图

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值