R的类型
1. character:"treatment","22","A"
2.numeric:23.44,120,NaN
3.integer:4L,1123L
4.factor:factor("HELLO")
5.Logical:FALSE,TRUE,NA
class("")可以显示变量的类型
R类型转换函数as.新的类型:
as.character(2016)可以将2016转换为char类型
as.factor("something"), 可以得到:levels:something
as.logical(0)可以得到结果FALSE
**包:lubridate可以将string转换为date
library(lubridate)
ymd("2015-08-23")可以将该日期转换为UTC日期类型
ymd("2015 August 25")和
mdy("August 25,2015") 返回 “2015-08-25 UTC”
hms("13:33:09")返回“13H 33M 9S”
比如下边的类型转换:
# Make this evaluate to character
class(true)
# Make this evaluate to numeric
class(8484.00),此处数字两侧没有双引号
# Make this evaluate to integer
class(99L),L可以将此变量作为证书
# Make this evaluate to factor
class(factor("factor"))
# Make this evaluate to logical
class(FALSE),TRUE和FALSE不需要加双引号
# Preview students2 with str()
str(students2)
# Load the lubridate package
library(lubridate)
# Parse as date
dmy("17 Sep 2015")
# Parse as date and time (with no seconds!)
mdy_hm("July 15, 2012 12:56")
# Coerce dob to a date (with no time)
students2$dob <- ymd(students2$dob)
# Coerce nurse_visit to a date and time
students2$nurse_visit <- ymd_hms(students2$nurse_visit)
# Look at students2 once more with str()
str(students2)
**包:stringr
str_trim(" this is a test ")返回"this is a test"
#pad string with zero
str_pad("24493",width=7, side="left",pad="0")给一个7位的字段补上前置0,返回"0024493"
friends=c("Sarah","Tom","Alice")
str_detect(friends,"Alice")--detect a pattern
返回FALSE FALSE TRUE
str_replace()-Find and replace a pattern
tolower()-make all lowercase
toupper()-make all uppercase
# Load the stringr package
library(stringr)
# Trim all leading and trailing whitespace
c(" Filip ", "Nick ", " Jonathan")
str_trim(c(" Filip ", "Nick ", " Jonathan"))
# Pad these strings with leading zeros
c("23485W", "8823453Q", "994Z")
str_pad(c("23485W", "8823453Q", "994Z"),width=9,side="left",pad="0")
Missing values
May be random, but dangerous to assume
Sometimes associated with variable/outcome of interest
In R, represented as NA
May appear in other forms
#N/A (Excel)
Single dot (SPSS, SAS)
Empty string
Inf - "Infinite value" (indicative of outliers?)
1/0
1/0 + 1/0
33333^33333
NaN - "Not a number" (rethink a variable?)
0/0
1/0 - 1/0
Dealing with outliers and obvious errors
When dealing with strange values in your data, you often must decide whether they are just extreme or actually erroneous. Extreme values show up all over the place, but you, the data analyst, must figure out when they are plausible and when they are not.
We have loaded a dataset called students3
, which is another slight variation of the original students
dataset. Two variables appear to have suspicious values: age
and absences
. Let's explore these values further.
Another look at strange values
Another useful way of looking at strange values is with boxplots. Simply put, boxplots draw a box around the middle 50% of values for a given variable, with a bolded horizontal line drawn at the median. Values that fall far from the bulk of the data points (i.e. outliers) are denoted by open circles. (If you're curious about the exact formula for determining what is "far", check out ?hist
.)
In this situation, we are concerned about three things:
- Since this dataset is about students and the only student above the age of 22 is 38 years old, we must wonder whether this is an error in the data or just an older student (perhaps returning to school after working for several years)
- There are four values of -1 for the
absences
variable, which is either a mistake or an intentional coding meant to say, for example, "this value is missing" - There are several extreme values of
absences
in the positive direction, with a maximum value of 75 (which is over 18 times the median value of 4)