Gather And Spread: Using the tidyverse to go between wide and long
The problem
Frequently data you receive or create will be in formats that are unsuitable for analysis as they stand and require modification in some way. This process is known as “tidying”, or perhaps “munging”. The goal is generally to get the data into a “tidy” format: one row per case, one column per field, and one cell per value. (You can read more about “tidy” data here.)
One of the most common examples of this issue encountered is data where things that should be in the rows are in the columns or vice versa. It’s generally not a transpose, either; some of the content is right where you want it, but not all of it. In these cases, we need to rearrange the data we’ve received into a form that is easier to deal with. The two commands we want to look at today are gather
(move columns into rows) and spread
(move rows into columns). We’re going to use the functions in tidyr
, part of the tidyverse
packages; though, as with many tasks in R, there are a number of other packages with similar functionality you could use instead.
The example
To start with, we’ll need some untidy data. For this, we’ll use some data showing the general reasons people were admitted to hospital by financial year from July 1993 to June 1998.
library(tidyverse)
seps <- read_csv("http://www.mm-c.me/mdsi/hospitals93to98.csv")
head(seps)
## # A tibble: 6 × 8
## IcdChapter Field FY1993 FY1994 FY1995
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 0. Not Reported PatientDays 257965 55582 128507
## 2 0. Not Reported Separations 37178 6146 3832
## 3 1. Infectious and Parasitic Diseases PatientDays 311221 313386 324693
## 4 1. Infectious and Parasitic Diseases Separations 75857 78323 84631
## 5 2. Neoplasms PatientDays 1686919 1707437 1795751
## 6 2. Neoplasms Separations 301928 336447 348905
## # ... with 3 more variables: FY1996 <dbl>, FY1997 <dbl>, FY1998 <dbl>
As you can see, the data isn’t in a great shape for analysis. The years are in columns, which is fine for side-by-side eyeballing, but terrible for charting or similar; and the bed days and separations counts are muddled together. That’s going to make it hard to work with.
The solution
Wide to long
Let’s start by fixing the year issue. To push data that is currently in columns into rows, we need to use the gather()
command:
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
data
: The dataset to be modified (in our case,seps
)key
: the name of the new “naming” variable (year
)value
: the name of the new “result” variable (value
)na.rm
: whether missing values are removed (this dataset doesn’t have any, so it isn’t a problem)convert
: convert anything that seems like it should be in another format to that other format, e.g. numeric to numeric (since we usedread_csv
we don’t need this one either)
So, to move the years into a column, we’ll run this:
inprogress<-gather(seps,year,value,FY1993:FY1998)
head(inprogress)
## # A tibble: 6 × 4
## IcdChapter Field year value
## <chr> <chr> <chr> <dbl>
## 1 0. Not Reported PatientDays FY1993 257965
## 2 0. Not Reported Separations FY1993 37178
## 3 1. Infectious and Parasitic Diseases PatientDays FY1993 311221
## 4 1. Infectious and Parasitic Diseases Separations FY1993 75857
## 5 2. Neoplasms PatientDays FY1993 1686919
## 6 2. Neoplasms Separations FY1993 301928
Okay, this is starting to look at little more like it should. We still have the bed days and separations merged, but at least we’ve pulled together the years.
Long to wide
Next, let’s pull the values into their own columns. For this, we need spread()
:
spread(data, key, value, fill = NA, convert = FALSE)
The format of this one is similar to gather()
:
data
: The data to be reformatted (inprogress
)key
: The column you want to split apart (Field
)value
: The column you want to use to populate the new columns (thevalue
column we just created in the spread step)fill
: what to substitute if there are combinations that don’t exist (not a problem here)convert
: whether to fix incorrect data types as it goes (not a problem here)
So, the resulting command would look like this:
rearranged <- spread(inprogress,Field,value)
head(rearranged)
## # A tibble: 6 × 4
## IcdChapter year PatientDays Separations
## <chr> <chr> <dbl> <dbl>
## 1 0. Not Reported FY1993 257965 37178
## 2 0. Not Reported FY1994 55582 6146
## 3 0. Not Reported FY1995 128507 3832
## 4 0. Not Reported FY1996 182226 4861
## 5 0. Not Reported FY1997 61599 1558
## 6 0. Not Reported FY1998 685879 53575
That looks much better, doesn’t it?
In One Step
Because this is using tidyverse
packages, we can also use the magrittr
package’s pipe command (%>%
) to do this all in one step if preferred:
seps %>%
gather(year,value,FY1993:FY1998) %>%
spread(Field,value)
## # A tibble: 114 × 4
## IcdChapter year PatientDays Separations
## * <chr> <chr> <dbl> <dbl>
## 1 0. Not Reported FY1993 257965 37178
## 2 0. Not Reported FY1994 55582 6146
## 3 0. Not Reported FY1995 128507 3832
## 4 0. Not Reported FY1996 182226 4861
## 5 0. Not Reported FY1997 61599 1558
## 6 0. Not Reported FY1998 685879 53575
## 7 1. Infectious and Parasitic Diseases FY1993 311221 75857
## 8 1. Infectious and Parasitic Diseases FY1994 313386 78323
## 9 1. Infectious and Parasitic Diseases FY1995 324693 84631
## 10 1. Infectious and Parasitic Diseases FY1996 311560 80864
## # ... with 104 more rows