Getting and Cleaning Data-Week 4 Quiz

 

目录

Q1:

Q2:

Q3:

Q4:

Q4:



Q1:

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

Apply strsplit() to split all the names of the data frame on the characters "wgtp". What is the value of the 123 element of the resulting list?

ANS:

#load the data:
if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
download.file(fileUrl,destfile="./data/Dataset.csv")
data1<-read.csv("./data/Dataset.csv")

#see what s in the data in general:
head(data1)

> q1<-strsplit(names(data1),"wgtp")
> q1[123]
[[1]]
[1] ""   "15"


Q2:

Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Remove the commas from the GDP numbers in millions of dollars and average them. What is the average?

Original data sources:

http://data.worldbank.org/data-catalog/GDP-ranking-table

#load the data
>if(!file.exists("./data")){dir.create("./data")}
>fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv"
download.file(fileUrl,destfile="./data/DATA2.csv")
#read the data
> data2<-read.csv("./data/DATA2.csv",skip = 4, nrows = 190, stringsAsFactors = FALSE)[,c(1, 2, 4, 5)]
#check the detail
> dim(data2)
[1] 190   4
> str(data2)
'data.frame':	190 obs. of  4 variables:
 $ X  : chr  "USA" "CHN" "JPN" "DEU" ...
 $ X.1: int  1 2 3 4 5 6 7 8 9 10 ...
 $ X.3: chr  "United States" "China" "Japan" "Germany" ...
 $ X.4: chr  " 16,244,600 " " 8,227,103 " " 5,959,718 " " 3,428,131 " ...
> num<-gsub(",","",data2$X.4)
> num1<-as.numeric(num)
> mean(num1)
[1] 377652.4


Q3:

In the data set from Question 2 what is a regular expression that would allow you to count the number of countries whose name begins with "United"? Assume that the variable with the country names in it is named countryNames. How many countries begin with United?

  • grep("*United",countryNames), 2
  • grep("^United",countryNames), 4
  • grep("United$",countryNames), 3
  • grep("^United",countryNames), 3 (T)

Q4:

Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. Of the countries for which the end of the fiscal year is available, how many end in June?

Original data sources:

http://data.worldbank.org/data-catalog/GDP-ranking-table

http://data.worldbank.org/data-catalog/ed-stats

ANS

#load and give the feature to data3
if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv"
download.file(fileUrl,destfile="./data/DATA3.csv")
data3<-read.csv("./data/DATA3.csv",skip = 4, nrow = 190, stringsAsFactors = FALSE)[,c(1,2,4,5)]
colnames(data3) = c("CountryCode", "Ranking", "Economy", "GDP")

#load data4
fileUrl1 <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv"
download.file(fileUrl1,destfile="./data/DATA4.csv")
data4<-read.csv("./data/DATA4.csv")

#merge data together
mergedata<-merge(data3,data4,by="CountryCode")

indexFiscal <- grep("fiscal", tolower(mergedata$Special.Notes))
sum(grepl("june", tolower(mergedata$Special.Notes[indexFiscal])))

13


Q4:

You can use the quantmod (http://www.quantmod.com/) package to get historical stock prices for publicly traded companies on the NASDAQ and NYSE. Use the following code to download data on Amazon's stock price and get the times the data was sampled.

library(quantmod)
amzn = getSymbols("AMZN",auto.assign=FALSE)
sampleTimes = index(amzn)

How many values were collected in 2012? How many values were collected on Mondays in 2012?

ANS;

> length(grep("2012",sampleTimes))
[1] 250

 

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值