2021-09-17

Week 2 Exercises Exercise 2A

  1. Using the data.frame command, create a new dataframe in R with the data from this dataset.
    data1A=data.frame(
    Name=c(“Luke”,“Maul”,“Vader”,“JarJar”,“Han”,“Windu”,“Yoda”),
    Human=c(“Yes”,“No”,“Yes”,“No”,“Yes”,“Yes”,“No”),
    Sabre=c(“Blue”,“Red”,“Red”,“Nil”,“Nil”,“Purple”,“Green”),
    Height=c(171,183,190,202,181,184,66),
    Force=c(4,3,5,0,2,4,5),
    Affil=c(“Light”,“Dark”,“Dark”,“Light”,“Light”,“Light”,“Light”)
    )

  2. Check the type of each variable.
    str(data1A)

  3. What, if anything, would you change?
    Open answer. Suggested: Human and Affil to Factor. Maybe Force to int. (It looks like R Studio automatically converts Human and Affil to Factors for you).

  4. Change the existing Human and Affil to Factor variables.
    data1A H u m a n = a s . f a c t o r ( d a t a 1 A Human=as.factor(data1A Human=as.factor(data1AHuman)
    data1A A f f i l = a s . f a c t o r ( d a t a 1 A Affil=as.factor(data1A Affil=as.factor(data1AAffil)
    str(data1A)
    (It looks like R Studio automatically converts Human and Affil to Factors for you; you would have to do this if you were using Base R).

  5. Draw a boxplot of Height
    boxplot(data1A$Height)

  6. Do you see any potential outliers? Is it a mistake or a genuine outlier?
    Yes, it’s a genuine outlier. It’s Yoda!

(a) Copy this data table onto Excel.
(b) Save it as a tab delimited text (.txt) file onto your computer.
© Use read.table() to read the .txt file into R.
data1A_v2=read.table(“C:\Users\ianch\Desktop\data1A.txt”,header=T) #for my computer

(a) Check the type of each variable.
str(data1A_v2)
(b) Any different from (2)?
Height and Force are stored as int in data1A_v2 but num in data1A.

Exercise 2B

  1. Install the following packages: ggplot2, GGally, grid & gridExtra.
    install.packages(“ggplot2”)
    library(ggplot2)
    install.packages(“GGally”)
    library(GGally)
    install.packages(“grid”)
    library(grid)
    install.packages(“gridExtra”)
    library(gridExtra)

(a) Make a duplicate of the midwest dataset. Name it anything you want.
data2B=midwest
(b) In this new dataset, convert state, inmetro and category to Factor variables.
data2B s t a t e = a s . f a c t o r ( d a t a 2 B state=as.factor(data2B state=as.factor(data2Bstate)
data2B i n m e t r o = a s . f a c t o r ( d a t a 2 B inmetro=as.factor(data2B inmetro=as.factor(data2Binmetro)
data2B c a t e g o r y = a s . f a c t o r ( d a t a 2 B category=as.factor(data2B category=as.factor(data2Bcategory)
© Check that they have been converted.
str(data2B)

(a) What states are in the census?
levels(data2B$state)
“IL” “IN” “MI” “OH” “WI”
(b) Plot a barplot of the state variable.
ggplot(data2B,aes(state)) + geom_bar()
© Do you see any potential problems?
Open answer. Maybe: the states are not evenly represented.

(a) Plot a boxplot of the percwhite variable (using Base R or ggplot2) to check for outliers.
ggplot(data2B,aes(y=percwhite)) + geom_boxplot()
boxplot(data2B p e r c w h i t e ) ( b ) O n e p o t e n t i a l o u t l i e r ( t h e o n e f a r s e p a r a t e d f r o m e v e r y t h i n g e l s e ) c o u l d b e a d a t a e n t r y m i s t a k e . T r y t o f i n d o u t w h i c h o b s e r v a t i o n t h i s i s . w h i c h ( d a t a 2 B percwhite) (b) One potential outlier (the one far separated from everything else) could be a data entry mistake. Try to find out which observation this is. which(data2B percwhite)(b)Onepotentialoutlier(theonefarseparatedfromeverythingelse)couldbeadataentrymistake.Trytofindoutwhichobservationthisis.which(data2Bpercwhite<20)
© View the corresponding row of the dataset.
data2B[405,]
(d) What county and state is this? (Hint: you may have to call up specific cells if the names are truncated)
data2B[405,c(“county”,“state”)]
MENOMINEE, WI
(e) Do a google search for this county and state. Do you think this was a data entry mistake?
Open answer. Suggested: No, the county was a previous Amerindian reservation and very few Caucasians lived there.

(a) Plot a boxplot of the percamerindan variable.
ggplot(data2B,aes(y=percamerindan)) + geom_boxplot()
boxplot(data2B p e r c a m e r i n d a n ) ( b ) I d e n t i f y t h a t o u t l i e r u s i n g t h e s a m e p r o c e d u r e a s i n ( 4 ) . w h i c h ( d a t a 2 B percamerindan) (b) Identify that outlier using the same procedure as in (4). which(data2B percamerindan)(b)Identifythatoutlierusingthesameprocedureasin(4).which(data2Bpercamerindan>80)
data2B[405,]
MENOMINEE, WI
© How does this inform your conclusions from (4)?
Open answer. Suggested: It confirms my conclusion.

(a) Plot a histogram of the percasian variable.
ggplot(data2B,aes(percasian)) + geom_histogram()
(b) Adjust the bin width to 0.5.
ggplot(data2B,aes(percasian)) + geom_histogram(binwidth=0.5)
© Do you see any potential problems?
Open answer. Suggested: No obvious outliers or other problems.

  1. Plot and inspect the correlation matrix of the dataset.
    ggcorr(data2B)

  2. Do a pairs plot of columns 3,4,6,12:16,19,25,27 of the dataset.
    ggpairs(data2B[,c(3:4,6,12:16,19,25,27)])

  3. From the pairs plot, we see a few potentially interesting things that we want to take a closer look at:
    (a)(i) Let’s check the proportions of inmetro (categorical) in each state (categorical): create two contingency tables of percentages of state vs. inmetro, one where rows sum to 100 and one where columns sum to 100.
    addmargins(prop.table(table(data2B s t a t e , d a t a 2 B state,data2B state,data2Binmetro),1)*100,2)
    addmargins(prop.table(table(data2B s t a t e , d a t a 2 B state,data2B state,data2Binmetro),2)*100,1)
    (a)(ii) Do you see any potential problems?
    Open answer. Suggested: Maybe, the inmetro 1 vs. 0 categories look quite different across different states.

(b)(i) Let’s check how popdensity (continuous) varies across states (categorical): plot a boxplot using ggplot2 of the two variables.
ggplot(data2B,aes(x=state,y=popdensity)) + geom_boxplot()
(b)(ii) Do you see any clear differences across states?
Open answer. Suggested: No (aside from the one potential outlier in IL).

©(i) Let’s check how percwhite (continuous) varies with percblack (continuous): plot a scatterplot using ggplot2 of the two variables.
ggplot(data2B,aes(x=percwhite,y=percblack)) + geom_point()
©(ii) There is a very clear pattern from (ci). Why do you think this is so?
Open answer. Suggested: These are the two biggest ethnic groups and would covary inversely.
©(iii) What county do you think is that one point separate from the rest at percwhite < 25?
MENOMINEE, WI
©(iv) Try adding a linear trend line to the plot from (ci).
ggplot(data2B,aes(x=percwhite,y=percblack)) + geom_point() + geom_smooth(method=lm)
©(v) From the dataset in (2), create another dataset without MENOMINEE, WI (row 405).
data2B_v2=data2B[-405,]
©(vi) Using the dim() function, check that your number of rows has decreased by 1.
dim(data2B)
dim(data2B_v2)
©(vii) Using this new dataset, do another scatterplot of percwhite and percblack, with a linear trend line.
ggplot(data2B_v2,aes(x=percwhite,y=percblack)) + geom_point() + geom_smooth(method=lm)
©(viii) Which trend line better represents the data as a whole? Would you remove MENOMINEE, WI from the dataset? Why or why not?
Open answer. Suggested: The second trend line is a better representation. Yes, I would if I’m looking at the relationship between these two variables.

(d)(i) Let’s check how inmetro may affect percollege. What are their data types and what plot would you use?
inmetro is Categorical, percollege is Continuous. I would use a boxplot.
(d)(ii) Plot a boxplot of percollege for different levels of inmetro.
ggplot(data2B,aes(x=inmetro,y=percollege)) + geom_boxplot()
(d)(iii) Do there look to be any differences between inmetro=0 and inmetro=1 with respect to percollege values?
Open answer. Suggested: Yes, it looks like inmetro=1 has generally higher percollege values.

(e)(i) Let’s check how inmetro may affect percwhite. What are their data types and what plot would you use?
inmetro is Categorical, percwhite is Continuous. I would use a boxplot.
(e)(ii) Plot a boxplot of percollege for different levels of inmetro.
ggplot(data2B,aes(x=inmetro,y=percwhite)) + geom_boxplot()
(e)(iii) Do there look to be any differences between inmetro=0 and inmetro=1 with respect to percollege values?
Open answer. Suggested: Maybe, it looks like there may be lower percwhite in inmetro=1, but the difference is not as clear.

(f)(ii) Let’s check how percollege varies with percasian. What are their data types and what plot would you use?
Both are continuous. I would use a scatterplot.
(f)(ii) Plot a scatterplot of percollege and percasian (from the original dataset in (2)), and add a linear trend line.
ggplot(data=data2B,aes(x=percasian,y=percollege)) + geom_point() + geom_smooth(method=lm)

**

Week 3 Exercises Exercise 3A

**

  1. Let’s import the “Data_Biodiversity.xlsx” dataset and prepare it for analysis.
    (a) In your computer, save the “Data_Biodiversity.xlsx” file as a csv file.

(b) Import the csv file into R as a dataframe called “dBio”.
dBio=read.csv(“C:\Users\HP\Documents\OneDrive - National University of Singapore\20210809to1204 - NUS Modules AY2021-22 Sem1\BL5233\Datasets\Data_Biodiversity.csv”)

© Change specType to a Factor and inspect the data types to make sure.
str(dBio)
dBio s p e c T y p e = a s . f a c t o r ( d B i o specType=as.factor(dBio specType=as.factor(dBiospecType)

  1. Let’s check each variable.
    (a)(i) Check all the continuous variables for outliers using boxplots.
    boxplot(dBio d V a l , d B i o dVal,dBio dVal,dBioweight,dBio h e i g h t , d B i o height,dBio height,dBiolifeEx)
    (a)(ii) Now try plotting them using ggplot.
    require(ggplot2)
    ggplot(dBio,aes(y=dVal))+geom_boxplot()
    ggplot(dBio,aes(y=weight))+geom_boxplot()
    ggplot(dBio,aes(y=height))+geom_boxplot()
    ggplot(dBio,aes(y=lifeEx))+geom_boxplot()
    (a)(iii) Which method was easier?
    Open answer.

(b)(i) Check the counts of each category of the categorical variables specType and trapped individually.
table(dBio s p e c T y p e ) t a b l e ( d B i o specType) table(dBio specType)table(dBiotrapped)
(b)(ii) Create a 2-way contingency table of the two variables.table(dBio s p e c T y p e ) t a b l e ( d B i o specType) table(dBio specType)table(dBiotrapped,dBio$specType)
(b)(iii) Do you see any pattern here?
Yes, specType A seems to be trapped more often than B.

©(i) Do a pairs plot for all the variables using Base R.
pairs(dBio)
©(ii) Do you see any potential patterns?
Open answer.

  • Weight and height look positively correlated (expected).
  • Probably relationship between specType vs. height/weight
  • Possible relationship between height and trapped
  • Hard to see the probably pattern between specType and trapped here.
  1. Let’s examine the relationship between specType and trapped we identified in (2b).
    (a) The explanatory variable (the factor) is specType. What type is it?
    Categorical

(b) The response variable is trapped. What type is it?
Categorical

© What test should we use to check whether the 2 variables are associated?
Chi-squared or Fisher if any expected valueis < 5.

(d) Perform the test in ©.
#Either of these will work
chisq.test(x=dBio s p e c T y p e , y = d B i o specType,y=dBio specType,y=dBiotrapped)
chisq.test(table(dBio s p e c T y p e , d B i o specType,dBio specType,dBiotrapped))

(e) What is the p-value? Are the 2 variables associated?
p-value = 0.0281. Yes, there is evidence to suggest that the variables are associated.

  1. Let’s examine the relationship between weight and height.
    (a) What type of variables are they both?
    Continuous.

(b) Using Base R, plot a scatterplot of weight (y) against height (x) and add a trendline.
plot(dBio w e i g h t   d B i o weight~dBio weight dBioheight)
abline(lm(dBio w e i g h t   d B i o weight~dBio weight dBioheight))

© What test should we use to test whether the 2 variables are correlated?
Pearson or Spearman correlation, depending on whether the variables are normally distributed.

(d) Perform the test in (d).
shapiro.test(dBioKaTeX parse error: Expected 'EOF', got '#' at position 9: weight) #̲not normal shap…height) #normal
#Since weight is not normal, do Spearman
cor.test(dBio h e i g h t , d B i o height,dBio height,dBioweight,method=“spearman”)

(e)(i) What is the p-value? Are they correlated?
p-value = 0.0024. Yes, there is evidence to suggest they are correlated.
(e)(ii) Does this relationship make sense? What else do you notice?
Yes, an animal which is taller would tend to be heavier. However, it looks like there are two separate groups of points.

(f)(i) Do the same plot in (b) using ggplot, but add colour to the points according to specType.
ggplot(dBio,aes(x=height,y=weight))+geom_point(aes(col=specType))+geom_smooth(method=lm)
(f)(ii) What does this graph tell you?
specType A is taller and heavier than specType B.

  1. Let’s compare the height and weight of the two specType categories.
    (a)(i) Extract the heights of specType A only and store them in a new vector.
    aHt=dBio h e i g h t [ d B i o height[dBio height[dBiospecType==“A”]
    (a)(ii) Do the same for the heights of specType B.
    bHt=dBio h e i g h t [ d B i o height[dBio height[dBiospecType==“B”]
    (a)(iii) Test whether they are both normally distributed.
    shapiro.test(aHt)
    shapiro.test(bHt)
    (a)(iv) Are they normally distributed? What else do we need to check before we can decide on a statistical test?
    Yes. We need to check for equal variance.
    (a)(v) Test whether the variances of the two vectors are equal.
    var.test(aHt,bHt)
    (a)(vi) What test should we do to compare the means of the 2 groups?
    Their variances are equal. We can use the Student’s t-test.
    (a)(vii) Perform the test identified in (a)(vi). What is your conclusion?
    #Either one will work
    t.test(aHt,bHt)
    t.test(height~specType,data=dBio)
    #p-value < 0.001. The mean height of the two groups is clearly different.
    (a)(viii) Do a one-tailed alternative testing whether the heights of specType A are larger than that of B.
    #Either one will work
    t.test(aHt,bHt,alternative=“greater”)
    t.test(height~specType,data=dBio,alternative=“greater”)

(b) Compare the weights of specType A and B using a similar procedure to the above.
#Create the vectors
aWt=dBio w e i g h t [ d B i o weight[dBio weight[dBiospecType==“A”]
bWt=dBio w e i g h t [ d B i o weight[dBio weight[dBiospecType==“B”]
#Test for normality
shapiro.test(aWt) #not normal, therefore do Mann-Whitney U test
shapiro.test(bWt)
#Mann-Whitney U test (two-tailed)
wilcox.test(aWt,bWt)
#Mann-Whitney U test (one-tailed)
wilcox.test(aWt,bWt,alternative=“greater”)
#Conclusion: specType B has larger weight than A.

  1. Let’s analyse the “Data_PairedWeight.xlsx” dataset now.
    (a) Save it as a tab delimited .txt file and import it into R.
    dPW=read.table(“C:\Users\HP\Documents\OneDrive - National University of Singapore\20210809to1204 - NUS Modules AY2021-22 Sem1\BL5233\Datasets\Data_PairedWeight.txt”,header=T)

(b) Inspect the data types. Are there any changes you would make? If there are, make them and check the dataset again.
#Open answer. Good idea to change year to a factor.
str(dPW)
dPW y e a r = a s . f a c t o r ( d P W year=as.factor(dPW year=as.factor(dPWyear)
str(dPW)

© We want to compare the weight and bPull of the same individuals measured at 2 different years. What type of tests should we use?
Paired t-test if the data are normally distributed. Wilcoxon signed-rank if not.

(d) Similar to in (5), extract the weight and bPull of the different years into separate vectors.
wt00=dPW w e i g h t [ d P W weight[dPW weight[dPWyear==“2000”]
wt06=dPW w e i g h t [ d P W weight[dPW weight[dPWyear==“2006”]
bP00=dPW b P u l l [ d P W bPull[dPW bPull[dPWyear==“2000”]
bP06=dPW b P u l l [ d P W bPull[dPW bPull[dPWyear==“2006”]

(e) Test whether each vector is normally distributed.
shapiro.test(wt00)
shapiro.test(wt06)
shapiro.test(bP00) #very close, may be safer to use wilcoxon
shapiro.test(bP06)

(f)(i) Use a paired t-test to compared the weight of the two groups (2000 vs. 2006).
t.test(wt00,wt06,paired=T)
(f)(ii) Visualise the difference using a suitable plot.
boxplot(wt00,wt06)
(f)(iii)What is your conclusion?
p-value < 0.001. Weight in 2006 is greater than in 2000.
(g)(i) Use a Wilcoxon signed-rank test to compared the bPull of the two groups (2000 vs. 2006)
wilcox.test(bP00,bP06,paired=T)

(g)(ii) Visualise the difference using a suitable plot.
boxplot(bP00,bP06)
(g)(iii) What is your conclusion?
p-value = 0.002. Evidence to suggest bPull is significantly higher in 2006. However, would be good to check the outliers in 2000.

Week 4 Exercises Exercise 4A

  1. Import the “Data_EnvImpact.xlsx” dataset and prepare it for analysis.
    (a) In your computer, save the “Data_EnvImpact.xlsx” file as a .csv or a tab-delimited .txt file.
    (b) Import the file into R as a dataframe called “dEI”.
    dEI=read.csv(“Data_EnvImpact.csv”)
    dEI=read.table(“Data_EnvImpact.txt”,header=T)
    © Check your variable types. Does anything need to be changed?
    str(dEI)
    #Nothing needs to be changed.

  2. Let’s do a power analysis to see whether we have a sufficiently large sample size. We want to fit a linear model using “popn” to explain 50% of the variation in “impact”.
    (a) What is your u-value?
    #We have 2 coefficients, so u=1
    (b) What is your f2-value?
    #We aim to explain 50% variation, so f2=0.5/(1-0.5)=1
    © Perform the power analysis, what is your minimum sample size, n?
    require(pwr)
    pwr.f2.test(u=1,f2=1, sig.level = 0.05, power=0.8)
    #n=u+1+v=1+1+8.17=10.17, so minimum sample size = 11
    (d) Based on this, do you have enough observations in the dEI dataset?
    str(dEI)
    #93 observations > the minimum of 11, so yes we have enough observations

  3. Perform the linear regression.
    (a) Fit the model using “popn” to explain “impact” and view the results. Does “popn” explain “impact”?
    mod1=lm(impact~popn,data=dEI)
    summary(mod1)
    #Yes, p-value < 0.05.
    (b) Check the model assumptions by plotting diagnostic plots. Do you see any potential problems?
    par(mfrow=c(2,2))
    plot(mod1)
    #It looks OK, although there may be some heteroscedasticity (Scale-Location plot).
    © Check for heteroscedasticity using bptest() and normal residuals using shapiro.test().
    require(lmtest)
    bptest(mod1)
    shapiro.test(resid(mod1))
    #The BP test suggests that there may be heteroscedasticity, but the plot looks OK. However, I would not worry too much about it in this case. It looks like a quirk of the datapoints in this particular dataset—based on the plot, I suspect that if more datapoints were collected, the heteroscedasticity would be solved.
    (d) Plot “impact” against “popn” in a scatterplot.
    #Base R
    plot(impact~popn,data=dEI,pch=16)
    #Ggplot
    require(ggplot2)
    ggplot(data=dEI,aes(x=popn,y=impact))+geom_point()
    (e) Add the model trendline to the scatterplot in (d).
    #Base R
    abline(lm(impact~popn,data=dEI),col=“red”)
    #Ggplot
    ggplot(data=dEI,aes(x=popn,y=impact))+geom_point()+geom_smooth(method=“lm”) (f) Use the predict() function to predict the value of “impact” if “popn”=200.
    predict(mod1,list(popn=200))
    #Predicted impact = 1010.17

Exercise 4B

  1. Use the same dataset from Exercise 4A:
    (a) Do a pairs plot of all the variables.
    #Base R
    pairs(dEI,panel=panel.smooth)
    #Ggplot
    require(GGally)
    ggpairs(dEI)
    (b) Do you see any interesting relationships between “impact” and any other variables?
    #Open answer. Definite positive correlation to “popn”. Possible negative correlation to “alt”.
    © Do a coplot() with “impact” vs “area” as the main variables, and “alt” split into different levels on the horizontal axis.
    coplot(impact~area|alt,data=dEI,row=1)
    (d) What variables/interactions would you include in a Multiple Regression model?
    #Open answer that should be based on (b) and ©. I would at least include “popn”, “alt” and their interaction. There may be an interaction between “alt” and “area”, so can also include “area” to see if has an effect. Doesn’t look like there are any non-linear relationships.

  2. Let’s fit a maximal model and do stepwise simplification.
    (a) Fit a model of “area”, “alt” and “popn” and all their interactions (including the 3-way interaction).
    mod1=lm(impact~areaaltpopn,data=dEI)
    (b) View the results, what term would you remove first?
    summary(mod1)
    #I would remove area:alt:popn
    © Use step() to help you remove the clearly non-significant terms and view the results.
    mod2=step(mod1)
    summary(mod2)
    (d) This looks a little fast, let’s try adding “popn:alt” back into the model using update() and “+” instead of “-”.
    mod3=update(mod2,~.+popn:alt)
    summary(mod3)
    (e) Should we keep “popn:alt”?
    #No, it is non-significant.
    (f) Is the model from © the minimum adequate model?
    #Yes, we cannot remove “area” because the “area:alt” interaction is significant.
    (g) Check the minimum adequate model using diagnostic plots. Do you see any problems?
    par(mfrow=c(2,2))
    plot(mod2)
    #Everything looks fine!
    (h) Compare the models from © and (d) using AIC and an F-test. Which is better? Is it significantly better?
    AIC(mod2,mod3)
    anova(mod2,mod3)
    #mod2 is better, but not significantly so.
    (i) Assuming the model from © is your final model, explain the results in your own words.
    summary(mod2)
    #“impact” has a large and significant positive correlation with “popn”
    #“impact” has a small and significant positive correlation with “alt”, and marginally significant positive correlation with “area”
    #The interaction of “area:alt” seems to have a significant negative effect on “impact”

Week 5 Exercises Exercise 5A

  1. Let’s read in a dataset on looks and explore it.
    (a) Read in the GoodLooking.xlsx dataset as GL (remember to save to either csv or txt first).
    setwd(“C:\Users\HP\Documents\OneDrive - National University of Singapore\20210809to1204 - NUS Modules AY2021-22 Sem1\BL5233\Datasets\Week 05”)
    GL=read.csv(“GoodLooking.csv”)
    head(GL)
    (b) Check whether sex, nation and occup are factors. Convert them if they are not.
    str(GL)
    GL s e x = a s . f a c t o r ( G L sex=as.factor(GL sex=as.factor(GLsex)
    GL n a t i o n = a s . f a c t o r ( G L nation=as.factor(GL nation=as.factor(GLnation)
    GL o c c u p = a s . f a c t o r ( G L occup=as.factor(GL occup=as.factor(GLoccup)
    © Draw separate boxplots to explore the variable separated by:
    ©(i) Sex .
    boxplot(looks~sex,data=GL)
    ©(ii) Nationality .
    boxplot(looks~nation,data=GL)
    ©(iii) Occupation .
    boxplot(looks~occup,data=GL)
    (d) Which of the above look like there are interesting effects to explore?
    #Nationality and occupation.

  2. Let’s run an ANOVA to examine the effects of occupation first.
    (a) Fit an ANOVA using lm() to explain with .
    mod1=lm(looks~occup,data=GL)
    (b) Plot diagnostic plots. Does the model look OK?
    par(mfrow=c(2,2))
    plot(mod1)
    #Looks very good! No problems with heteroscedasticity or non-normal residuals.
    ©(i) Call a summary to look at the effects of each level.
    summary(mod1)
    ©(ii) What level is the model comparing all the levels to? Hint: use str() or levels() to check.
    str(GL)
    levels(GLKaTeX parse error: Expected 'EOF', got '#' at position 8: occup) #̲It is comparing…looks,GL$occup,p.adjust.method=“BH”)
    #(L)ecturers are the best looking! They are significantly better-looking than (B)ankers and §ilots, and very nearly significantly better looking than (F)iremen.

  3. Let’s run an ANOVA to examine the effects of nationality.
    (a) Fit an ANOVA to explain with .
    mod2=lm(looks~nation,data=GL)
    (b) Plot diagnostic plots for this model and comment on whether the model is OK.
    par(mfrow=c(2,2))
    plot(mod2)
    #Looks good!
    ©(i) Assume the normality assumption is violated. What test would you perform instead?
    #A Kruskal-Wallis test
    ©(ii) Perform a Kruskal-Wallis test on the same 2 variables in (3)(a) and view the results.
    mod2k=kruskal.test(looks~nation,data=GL)
    mod2k
    ©(iii) Do pairwise Wilcoxon tests for all levels of (try a Bonferroni correction this time).
    pairwise.wilcox.test(GL l o o k s , G L looks,GL looks,GLnation,p.adjust.method=“bonferroni”)
    (d)(i) Assume the homogeneity of variance assumption is violated. What analysis would you perform instead?
    #Welch’s One-way ANOVA
    (d)(ii) Perform the Welch’s One-way ANOVA on the same 2 variables in (3)(a).
    require(rstatix)
    welch_anova_test(looks~nation,data=GL)

Exercise 5B

  1. Let’s use the mtcars dataset shown in the lecture.
    (a) Save the dataset as and convert , and to factors (if they are not).
    d=mtcars
    d c y l = a s . f a c t o r ( d cyl=as.factor(d cyl=as.factor(dcyl)
    d a m = a s . f a c t o r ( d am=as.factor(d am=as.factor(dam)
    d g e a r = a s . f a c t o r ( d gear=as.factor(d gear=as.factor(dgear)
    d v s = a s . f a c t o r ( d vs=as.factor(d vs=as.factor(dvs)
    (b) Fit a 3-way ANVOCA to explain using , , , interacting with and interacting with ; with 2 non-interacting covariates and .
    mod1=lm(qsec~hp+disp+am+cyl+vs+vs:am+vs:cyl,data=d)
    © Simplify the model using stepwise deletion.
    #Simplifying manually
    summary(mod1) #vs:am is non-significant and has the higher p-value of the 2 interactions, we remove vs:am first
    mod2=update(mod1,~.-vs:am)
    summary(mod2) #now we remove vs:cyl
    mod3=update(mod2,~.-vs:cyl)
    summary(mod3) #no more interactions, so let’s look at the individual variables; disp has the highest p-value, so we remove disp
    mod4=update(mod3,~.-disp)
    summary(mod4) #now cyl is no longer significant on both levels, so we remove cyl
    mod5=update(mod4,~.-cyl)
    summary(mod5)
    #Simplifying using step()
    mod5s=step(mod1)
    summary(mod5s)
    (d)(i) Plot diagnostic plots for the model. Do you see any potential problems?
    par(mfrow=c(2,2))
    plot(mod5)
    #There may be departure from normality and an influential datapoint (Merc 230)
    (d)(ii) Do hypothesis testing to check for normality and homogeneity of variance. Is there a problem?
    shapiro.test(resid(mod5))
    require(lawstat)
    gp2=paste(d v s , d vs,d vs,dam,sep="-")
    levene.test(d$qsec,gp2)
    #The residuals are not normally distributed
    (d)(iii) What shall we do to solve the problem(s) identified in (d)(i) and (d)(ii)?
    #Either remove Merc 230 and try again or log transform qsec and try again.
    #This is where the Art of data analysis comes in, do we remove the datapoint (and lose the information it gives) or transform the data (to try to keep the information)?
    #Let’s assume we know the dataset and we want to try to keep Merc 230 if we can.
    (e) Fit another ANCOVA with the same variables as in (d)(i) but log transform .
    mod6=lm(log(qsec)~hp+vs+am,data=d)
    (f) Check the assumptions graphically and using hypothesis tests.
    par(mfrow=c(2,2))
    plot(mod6)
    shapiro.test(resid(mod6))
    #The residuals are now normally distributed. No need to do the Levene’s test again.
    (g) Is this your minimum adequate model?
    summary(mod6)
    #Yes! All the variables are significant.
    (h) Try to report the results in your own words, as if you are writing a report.
    #Open answer.
    #“Horsepower, cylinder arrangement and transmission type all have a significant effect on the speed of a car (all p-value < 0.001). An increase in 1 horsepower decreases the time taken to complete a quarter mile by 0.0007 seconds, v-shaped (vs. straight) cylinder arrangement decreases the same time by 0.09 seconds and manual (vs. automatic) transmissions decreases the same time by 0.09 seconds.”

Exercise 5C

  1. Let’s try to plot our final model from Exercise 5B: qsec~hp+vs+am.
    (a) and are both continuous variables. What is our main plot type?
    #A scatter plot
    (b) Plot a scatterplot of against using ggplot2. Note: our dataset in Exercise 5B was named .
    require(ggplot2)
    p=ggplot(d,aes(x=hp,y=qsec))
    p+geom_point()
    © Add colour to the plot in (b) to differentiate between the levels of .
    p+geom_point(aes(col=vs))
    (d) Add different shapes to the plot in © to differentiate between the levels of .
    p+geom_point(aes(col=vs,shape=am))
    (e) Try using colour for and shape for instead.
    p+geom_point(aes(col=am,shape=vs))
    (f) Try using colour for and size for .
    p+geom_point(aes(col=am,size=vs))
    (g) Which of the 3 plots in (d), (e) and (f) do you prefer?
    #Open answer. I preferred (e) best because it was easiest to intepret than (d) and triangles are more aesthetically pleasing than the big circles in (f).
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值