Getting and Cleaning Data
Raw and Processed Data
“Data are values of qualitative or quantitative variables, belonging to a set of items”
set of items: Sometimes called the population the set go objects you are interested in
Qualitative: Country of origin, sex, treatment
Quantitative: Height, weight, blood pressure
Raw Data
- The original source of the data
- Often hard to use for data analysis
- Data analysis includes processing
- Raw data may only need to be processed once
Processed Data
- Data that is ready for analysis
- Processing can include merging, subsetting, transforming, etc.
- There may be standards for processing
- All steps should be recorded
Components of Tidy Data
The four things you should have
- The raw data
- A tidy data set
- A code book describing each variable and its values in the tidy data set
- An explicit and exact recipe you used to go from 1 -> 2,3
The raw data
- The strange binary file your measurement machine spits out
- The unformatted Excel file with 10 worksheets the company you contracted with sent you
- The complicated JSON data you got from scraping the Twitter API
- The hand-entered numbers you collected looking through a microscope
You know the raw data is in the right format if you:
- Ran no software on the data
- Did not manipulate any of the numbers in the data
- You did not remove any data from the data set
- You did not summarize the data in any way
The tidy data
- Each variable you measure should be in one column
- Each different observation of that variable should be in a different row
- There should be one table for each “kind” of variable
- If you have multiple tables, they should include a column in the table that allows them to be linked
Some other important tips:
- Include a row at the top of each file with variable names
- Make variable names human readable AgeAtDiagnosis instead of AgeDx
- In general data should be saved in one file per table
The code book
- Information about the variables (including units!) in the data set not contained in the tidy data
- Information about the summary choices you made
- Information about the experimental study design you used
Some other important tips:
- A common format for this document is a Word/text file
- There should be a section called “Study design” that has a thorough description of how you collected the data
- There must be a section called “Code book” that describes each variable and its units
The instruction list
- Ideally a computer script (in R), but I suppose Python is ok too
- The input for the script is the raw data
- The output is the processed, tidy data
- There are no parameters to the script
In some cases it will not be possible to script every step. In that case you should provide instructions like:
- Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3
- Step 2 - run the software separately for each example
- Step 3 - take column three of output file.out for each sample and that is the corresponding row in the output data set
Downloading Files
Get/set your working directory
- A basic component of working with data is knowing your working directory
- The two main command are getwd() and setwd()
- Be aware of relative versus absolute paths
- Relative - setwd(“./data”), setwd(“…/”)
- Absolute - setwd(“/Users/jtleek/data/”)
- Important difference in Windows setwd(“C:\Users\Andrew\Downloads”)
Checking for and creating directories
- file.exists(“directoryName”) will check to see if the directory exists
- dir.create(“directoryName”) will create a directory if it doesn’t exist
- Here is an example checking for a “data” directory and creating it if it doesn’t exist
if (!file.exists("data")) { dir.create("data") }
Getting data from the internet - download.file()
- Downloads a file from the internet
- Even if you could do this by hand, helps with reproducibility
- Important parameters are url, destfile, method
- Useful for downloading tab-delimited, cdv, and other files
Download a file from the web
fileUrl <-"http://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/filename.csv", method = "curl") ## curl for Mac system
list.files("./data")
## [1] "cameras.csv"
dataDownloaded <- data()
dataDownloaded
## [1] "Sun Jan 12 21:37:44 2014"
Some notes about download.file()
- If the url starts with http you can use download.file()
- If the url starts with https on Windows you may be ok
- If the url starts with https on Mac you may need to set method = “curl”
- If the file is big, this might take a while
- Be sure to record when you downloaded
Reading Local Files
Loading flat files - read.table()
- This is the main function for reading data into R
- Flexible and robust but requires more parameters
- Reads the data into RAM - big data can cause problems
- important parameters file, header, sep, row.names, nrows
- Related: read.csv(), read.csv2()
Baltimore example
cameraData <- read.table("./data/filename.csv")
## Error: line 1 did not have 13 elements
head(cameraData)
## Error: object 'cameraData' not found
cameraData <- read.table("./data/filename.csv", sep = ",", header=True)
Some more important parameters
- quote - you can tell R whether there are any quoted values quote=“” means no quotes
- na.strings - set the character that represents a missing value
- nrows - how many rows to read of the file (e.g. nrows=10 read 10 lines)
- skip - number of lines to skip before starting to read
In my experience, the biggest trouble with reading flat files are quotation marks `or" placed in data values, setting quote=“” often resolves these
Reading Excel Files
read.xlsx(), read.xlsx2() {xlsx package}
library(xlsx)
cameraData <- read.xlsx("./data/filename.xlsx", sheetIndex=1, header=True)
Reading specific rows and columns
colIndex <- 2:3
rowIndex <- 1:4
cameraDataSubset <- read.xlsx("./data/filename.xlsx", sheetIndex=1, colIndex=colIndex, rowIndex=rowIndex)
Further notes
- The write.xlsx function will write out an Excel file with similar arguments
- read.xlsx2 is much faster than read.xlsx but for reading subsets of rows may be slightly unstable
- The XLConnect package has more options for writing and manipulating Excel files
- The XLConnect vignette is a good place to start for that package
- In general it is advised to store your data in either a database or in comma separated files (.csv) or tab separated files (.tab/.txt) as they are easier to distribute
Reading XML
XML
- Extensible markup language
- Frequently used to store structured data
- Particularly widely used in internet applications
- Extracting XML is the basis for most web scraping
- Components
- Markup - labels that give the text structure
- Content - the actual text of the document
Tags, elements and attributes
- Tags correspond to general labels
- Start tags
- End tags
- Empty tags - Elements are specific examples of tags
- Hello, world - Attributes are components of the label
-
- Connect A to B.
Read the file into R
library(XML)
fileUrl <- "http://www.w3schools.com/xml/simple.xml"
doc <- xmlTreeParse(fileUrl, useInternal=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
[1] "breakfast_menu"
names(rootNode)
food food food food
"food" "food" "food" "food"
Directly access parts of the XML document
rootNode[[1]]
rootNode[[1]][[1]]
Programatically extract parts of the file
xmlSApply(rootNode, xmlValue)
XPath
- /node Top level node
- //node Node at any level
- node[@attr-name] Node with an attribute name
- node[@attr-name=‘bob’] Node with attribute name attr-name=‘bob’
Get the items on the menu and prices
xpathSApply(rootNode, "//name", xmlValue)
xpathSApply(rootNode, "//price", xmlValue)
Extract content by attributes
library(XML)
fileUrl <- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens"
doc <- htmlTreeParse(fileUrl, useInternal=TRUE)
scores <- xpathSApply(doc, "//li[@class='score']", xmlValue)
teams <- xpathSApply(doc, "//li[@class='team-name']", xmlValue)
scores
teams
Reading JSON
JSON
- Javascript Object Notation
- Lightweight data storage
- Common format for data from application programming interfaces (APIS)
- Similar structure to XML but different syntax/format
- Data stored as
- Numbers (double)
- Strings (double quote)
- Boolean (true or false)
- Array (ordered, comma separated enclosed in square brackets [] )
- Object (unordered, comma separated collection of key: value pairs in Curley brackets {} )
Reading data from JSON {jsonlite package}
library(jsonlite)
jsonData <- fromJSON("https://qpi.github.com/users/jtleek/repos")
names(jsonData)
Nested objects in JSON
names(jsonData$owner)
jsonData$owner$login
Writing data frames to JSON
myjson <- toJSON(iris, pretty=TRUE)
cat(myjson)
Convert back to JSON
iris2 <- fromJSON(myjson)
head(iris2)
The data.table Package
data.table
- inherits from data.frame
- All functions that accept data.frame work on data.table - Written in C so it is much faster
- Much, much faster at subsetting, group, and updating
Creating data tables just like data frames
library(data.table)
DF = data.frame(x=rnorm(9), y=rep(x("a", "b", "c"),each=3),z=rnorm(9))
head(DF,3)
DT = data.table(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))
head(DF,3)
See all the data tables in memory
tables()
Subsetting rows
DF[2,]
DT[DT$y=="a",]
DT[c(2,3)]
Subsetting columns
DT[,c(2,3)]
Column subsetting in data.table
- The subsetting function is modified for data.table
- The argument you pass after the comma is called an “expression”
- In R an expression is a collection of statements enclosed in Curley brackets
{
x = 1
y = 2
}
k = {print(10);5}
print(k)
Calculating values for variables with expressions
DT[,list(mean(x), sum(x))]
DT[,table(y)]
Adding new columns
DT[,w:=z^2]
DT2 <- DT
DT[, y:= 2]
Multiple operations
DT[, m:= {tmp <- (x+z); log2(tmp+5)}]
plyr like operations
DT[, a:=x>0]
DT[, b:=mean(x+w), by=a]
Special variables
.N An integer, length 1, containing the number r
set.seed(123);
DT <- data.table(x=sample(letters[1:3], 1E5, TRUE))
DT[, .N, by=x]
Keys
DT <- data.table(x=rep(c("a","b","c"), each=100), y=rnorm(300))
setkey(DT,x)
DT['a']
Joins
DT1 <- data.table(x=c('a','b','c','dt1'), y=1:4)
DT2 <- data.table(x=c('a','b','dt2'),z=5:7)
setkey(DT1, x); setkey(DT2, x)
merge(DT1, DT2)
Fast reading
big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
file <- tempfile()
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file))
system.time(read.table(file, header=TRUE, sep="\t")