『Data Science』R语言学习笔记，观察数据

最新推荐文章于 2024-08-17 20:02:05 发布

weixin_33704234

最新推荐文章于 2024-08-17 20:02:05 发布

阅读量172

点赞数

文章标签： r语言

原文链接：https://my.oschina.net/skyler/blog/714702

版权

为什么80%的码农都做不了架构师？>>>

Getting the data from Web

if(!file.exists("./db")){
    dir.create("./db")
}

fileUrl <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/restaurants.csv", method = "auto")
restData <- read.csv("./db/restaurants.csv")

Looking at a bit of the data

head(restData, n=3)
tail(restData, n=3)

Make summary

summary(restData)

More in depth information

str(restData)

Quantiles of quantitative variables

The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

> quantile(restData$councilDistrict, na.rm = T)
  0%  25%  50%  75% 100%
   1    2    9   11   14
> quantile(restData$councilDistrict, probs = c(0.5, 0.75, 0.9))
50% 75% 90%
  9  11  12

x - numeric vector whose sample quantiles are wanted, or an object of a class for which a method has been defined (see also ‘details’). NA and NaN values are not allowed in numeric vectors unless na.rm is TRUE.
probs - numeric vector of probabilities with values in [0,1]. (Values up to 2e-14 outside that range are accepted and moved to the nearby endpoint.)
na.rm - logical; if true, any NA and NaN's are removed from x before the quantiles are computed.
names - logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs.
type - an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.
... - further arguments passed to or from other methods.

Make table

> table(restData$zipCode, useNA = "ifany")

-21226  21201  21202  21205  21206  21207  21208  21209  21210  21211  21212  21213  21214  21215  21216  21217  21218  21220
     1    136    201     27     30      4      1      8     23     41     28     31     17     54     10     32     69      1

> table(restData$councilDistrict, restData$zipCode)

     -21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213 21214 21215 21216 21217 21218 21220 21222 21223
  1       0     0    37     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     7     0
  2       0     0     0     3    27     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  3       0     0     0     0     0     0     0     0     0     0     0     2    17     0     0     0     3     0     0     0
  4       0     0     0     0     0     0     0     0     0     0    27     0     0     0     0     0     0     0     0     0
  5       0     0     0     0     0     3     0     6     0     0     0     0     0    31     0     0     0     0     0     0
  6       0     0     0     0     0     0     0     1    19     0     0     0     0    15     1     0     0     0     0     0

Check for missing values

sum(is.na(restData$councilDistrict))
any(is.na(restData$councilDistrict))
all(restData$zipCode > 0)

Row and column sums

colSums(is.na(restData))
all(colSums(is.na(restData)) == 0)
all(restData$zipCode > 0)

Values with specific characteristics

> table(restData$zipCode %in% c("21212"))

FALSE  TRUE
 1299    28

> table(restData$zipCode %in% c("21212", "21213"))

FALSE  TRUE
 1268    59

> restData[restData$zipCode %in% c("21212", "21213"), ]
                                     name zipCode                neighborhood councilDistrict policeDistrict
29                      BAY ATLANTIC CLUB   21212                    Downtown              11        CENTRAL
39                            BERMUDA BAR   21213               Broadway East              12        EASTERN
92                              ATWATER'S   21212   Chinquapin Park-Belvedere               4       NORTHERN
111            BALTIMORE ESTONIAN SOCIETY   21213          South Clifton Park              12        EASTERN
187                              CAFE ZEN   21212                    Rosebank               4       NORTHERN

Cross tabs

data(UCBAdmissions)
DF = as.data.frame(UCBAdmissions)
DF
summary(DF)

xt <- xtabs(Freq ~ Gender + Admit, data = DF)   ## Freq must be a column which could be compute, like integer or numeric
xt

Flat tables

> warpbreaks$replicate <- rep(1:9, len = 54)
> xt = xtabs(breaks ~., data = warpbreaks)        ## equals to xtabs(breaks ~ wool + tension + replicate, data = warpbreaks)
> xt
, , replicate = 1

    tension
wool  L  M  H
   A 26 18 36
   B 27 42 20

, , replicate = 2

    tension
wool  L  M  H
   A 30 21 21
   B 14 26 21

, , replicate = 3

    tension
wool  L  M  H
   A 54 29 24
   B 29 19 24


> ftable(xt)
             replicate  1  2  3  4  5  6  7  8  9
wool tension                                     
A    L                 26 30 54 25 70 52 51 26 67
     M                 18 21 29 17 12 18 35 30 36
     H                 36 21 24 18 10 43 28 15 26
B    L                 27 14 29 19 29 31 41 20 44
     M                 42 26 19 16 39 28 21 39 29
     H                 20 21 24 17 13 15 15 16 28

Size of a data set

> fakeData = rnorm(1e5)
> object.size(fakeData)
800040 bytes
> print(object.size(fakeData), units = "Mb")
0.8 Mb

转载于:https://my.oschina.net/skyler/blog/714702