文章目录
Introduction
Currently, one of the most common ways of storing and sharing data for analysis is through electronic spreadsheets. A spreadsheet stores data in rows and columns. It is basically a file version of a data frame.
When creating spreadsheets with text files, like the ones created with a simple text editor, a new row is defined with return and columns are separated with some predefined special character. The most common characters are comma (,
), semicolon (;
), space ( ), and tab (a preset number of spaces or \t
).
The first row contains column names rather than data. We call this a header, and when we read-in data from a spreadsheet it is important to know if the file has a header or not. Most reading functions assume there is a header. To know if the file has a header, it helps to look at the file before trying to read it. This can be done with a text editor or with RStudio. In RStudio, we can do this by either opening the file in the editor or navigating to the file location, double clicking on the file, and hitting View File.
Google Sheets and Microsoft Excel can’t be viewed with a text editor.
Paths and the working directory
A spreadsheet containing the US murders data is included as part of the dslabs package. Finding this file is not straightforward, but the following lines of code copy the file to the folder in which R looks in by default. We explain how these lines work below.
filename <- "murders.csv" #文件名
dir <- system.file("extdata", package = "dslabs") #dslabs包中extdata文件夹目录
fullpath <- file.path(dir, filename) #文件所在的完整路径
file.copy(fullpath, "murders.csv") #将上述文件拷贝到当前工作环境,并命名
This code does not read the data into R, it just copies a file. But once the file is copied, we can import the data with a simple line of code. Here we use the read_csv
function from the readr package, which is part of the tidyverse.
library(tidyverse)
dat <- read_csv(filename)
-- Column specification ------
cols(
state = col_character(),
abb = col_character(),
region = col_character(),
population = col_double(),
total = col_double()
)
The data is imported and stored in dat
. The rest of this section defines some important concepts and provides an overview of how we write code that tells R how to find the files we want to import.
The filesystem
You can think of your computer’s filesystem as a series of nested folders, each containing other folders and files. Data scientists refer to folders as directories. We refer to the folder that contains all other folders as the root directory. We refer to the directory in which we are currently located as the working directory. The working directory therefore changes as you move through folders: think of it as your current location.
Relative and full paths
The path of a file