l课程讲解中的lesson3.rmd
Lesson 3
========================================================
***
### What to Do First?
Notes:
***
### Pseudo-Facebook User Data
Notes:
```{r Pseudo-Facebook User Data}
getwd()
list.files()
pf <- read.csv('pseudo_facebook.tsv',sep='\t')
names(pf)
```
***
### Histogram of Users' Birthdays
Notes:
```{r Histogram of Users\' Birthdays}
install.packages('ggplot2')
library(ggplot2)
install.packages('ggthemes',dependencies=TRUE)
library(ggthemes)
theme_set(theme_minimal(24))
names(pf)
ggplot(aes(x=dob_day),data=pf) + geom_histogram(binwidth=1)+scale_x_continuous(breaks=1:31)
```
***
#### What are some things that you notice about this histogram?
Response:defaut birthday is 1.
***
### Moira's Investigation
Notes: plot many plots and find :many people will underestimated the size of their audience.(1/4)
***
### Estimating Your Audience Size
Notes:
```{r faceting}
ggplot(aes(x=dob_day),data=pf) + geom_histogram(binwidth=1)+scale_x_continuous(breaks=1:31)+facet_wrap(~dob_month)
```
***
#### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
Response:
#### How many of your friends do you think saw that post?
Response:
#### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
Response:
***
### Perceived Audience Size
Notes:
***
### Faceting
Notes:
```{r Faceting}
ggplot(data=pf,aes(x=dob_day)) + geom_histogram(binwidth=1) + scale_x_continuous(breaks=1:31) + facet_wrap(~dob_month,ncol=4)
```
#### Let鈥檚 take another look at our plot. What stands out to you here?
Response:
***
### Be Skeptical - Outliers and Anomalies
Notes:
***
### Moira's Outlier
Notes:
#### Which case do you think applies to Moira鈥檚 outlier?
Response:
***
### Friend Count
Notes:
#### What code would you enter to create a histogram of friend counts?
```{r Friend Count}
ggplot(aes(x=friend_count),data=pf) + geom_histogram()
```
#### How is this plot similar to Moira's first plot?
Response:
***
### Limiting the Axes
Notes:
```{r Limiting the Axes}
ggplot(aes(x=friend_count),data=pf) + geom_histogram() + scale_x_continuous(limits = c())
```
### Exploring with Bin Width
Notes:
***
### Adjusting the Bin Width
Notes:
### Faceting Friend Count
```{r Faceting Friend Count}
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))
ggplot(aes(x=friend_count),data=pf) + geom_histogram(binwidth=25) + scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50)) + facet_wrap(~gender)
```
***
### Omitting NA Values
Notes:
```{r Omitting NA Values}
ggplot(aes(x=friend_count),data=subset(pf, !is.na(gender))) + geom_histogram(binwidth=25) + scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50)) + facet_wrap(~gender)
```
***
### Statistics 'by' Gender
Notes:
```{r Statistics \'by\' Gender}
table(pf$gender)
by(pf$friend_count,pf$gender,summary)
```
#### Who on average has more friends: men or women?
Response:
#### What's the difference between the median friend count for women and men?
Response:
#### Why would the median be a better measure than the mean?
Response:
***
### Tenure
Notes:
```{r Tenure}
ggplot(aes(x=tenure),data=pf) + geom_histogram(binwidth=30,color='black',fill='#099DD9')
```
***
#### How would you create a histogram of tenure by year?
```{r Tenure Histogram by Year}
ggplot(aes(x=tenure/365),data=pf) + geom_histogram(binwidth=.25,color='black',fill='#F79420')
```
***
### Labeling Plots
Notes:
```{r Labeling Plots}
ggplot(aes(x=tenure/365),data=pf) + geom_histogram(binwidth=.25,color='black',fill='#F79420') +
scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7)) + xlab('Number of years using Facebooks') + ylab('Number of users in sample')
```
***
### User Ages
Notes:
```{r User Ages}
ggplot(aes(x=age),data=pf) + geom_histogram(binwidth=1,color='black',fill='#5760AB') + scale_x_continuous(breaks=seq(0,113,5)) +
xlab('Age of user about facebook') + ylab('Number of users in sample')
```
#### What do you notice?
Response:
***
### The Spread of Memes
Notes:
***
### Lada's Money Bag Meme
Notes:
***
### Transforming Data
Notes:
```{r transformating data}
ggplot(aes(x=friend_count),data=subset(pf, !is.na(gender))) + geom_histogram(binwidth=20) + scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50))
summary(pf$friend_count)
summary(log10(pf$friend_count +1))
summary(sqrt(pf$friend_count))
p1 = ggplot(aes(x=friend_count),data=pf) + geom_histogram(binwidth=10,color='black',fill='#099D99') + scale_x_continuous(limits=c(0,500),breaks=seq(0,500,1))
p2 = ggplot(aes(x=log(friend_count +1 )),data=pf) + geom_histogram(binwidth=1,color='black',fill='#F79420') + scale_x_continuous(limits=c(0,10),breaks=seq(0,10,1))
p3 = ggplot(aes(x=sqrt(friend_count)),data=pf) + geom_histogram(binwidth=1,color='black',fill='#5760AB') + scale_x_continuous(limits=c(0,40),breaks=seq(0,40,1))
grid.arrange(p1,p2,p3,ncol=1)
```
***
### Add a Scaling Layer
Notes:
```{r Add a Scaling Layer}
# way one:
logscale <- ggplot(aes(x=log10(friend_count)),data = pf) + geom_histogram(color='black',fill='#099D99')
countscale <- ggplot(aes(x=friend_count),data = pf) + geom_histogram( color='black',fill='#F79420') + scale_x_log10()
grid.arrange(logscale,countscale,ncol=1)
```
***
### Frequency Polygons
```{r Frequency Polygons}
p1 = ggplot(aes(x=friend_count,y= ..count.. / sum(..count..)),data=subset(pf,!is.na(gender))) +
geom_freqpoly(aes(color=gender),binwidth=10) +
scale_x_continuous(limits=c(0,1000),breaks=seq(0.1000,50)) +
xlab('number of friends') +
ylab('Percentage') # Percentage of users with that friend count
p2 = ggplot(aes(x=friend_count,y= ..count.. / sum(..count..)),data=subset(pf,!is.na(gender))) +
geom_freqpoly(aes(color=gender),binwidth=10) +
scale_x_continuous(limits=c(0,1000),breaks=seq(0.1000,50)) +
scale_x_log10() +
xlab('number of friends') +
ylab('after log10')
p3 = ggplot(aes(x=www_likes),data=subset(pf,!is.na(gender))) +
geom_freqpoly(aes(color=gender)) +
scale_x_log10()
# scale_x_continuous(limits=c(0,1000),breaks=seq(0.1000,50)) +
# xlab('www_likes') +
# ylab('Percentage of users with that friend count')
grid.arrange(p1,p2,p3,ncol=1)
```
***
### Likes on the Web
Notes:
```{r Likes on the Web}
by(pf$www_likes,pf$gender,sum)
```
***
### Box Plots
Notes:
```{r Box Plots}
p1 = ggplot(aes(x=gender,y=friend_count),data=subset(pf,!is.na(gender))) +
geom_boxplot()
p2 = ggplot(aes(x=gender,y=friend_count),data=subset(pf,!is.na(gender))) +
geom_boxplot() +
scale_y_continuous(limits = c(0,1000))
p3 = ggplot(aes(x=gender,y=friend_count),data=subset(pf,!is.na(gender))) +
geom_boxplot() +
coord_cartesian(ylim=c(0,1000))
grid.arrange(p1,p2,p3,ncol=1)
```
#### Adjust the code to focus on users who have friend counts between 0 and 1000.
```{r}
by(pf$friend_count,pf$gender,summary)
by(pf$friend_count,pf$gender,sum)
by(pf$friend_count,pf$gender,mean)
```
***
### Box Plots, Quartiles, and Friendships
Notes:
```{r Box Plots, Quartiles, and Friendships}
```
#### On average, who initiated more friendships in our sample: men or women?
Response:
#### Write about some ways that you can verify your answer.
Response:
```{r Friend Requests by Gender}
```
Response:
***
### Getting Logical
Notes:
```{r Getting Logical}
summary(pf$mobile_likes)
summary(pf$mobile_likes>0)
mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes>0,1,0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
```
Response:
***
### Analyzing One Variable
Reflection:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!
习题集:practice_lesson3.R
#library(ggplot2)
#diamondsdata <- data(diamonds)
#summary(diamondsdata)
ggplot(aes(x=price),data=diamondsinfo) + geom_histogram(binwidth = 30,color='red',fill='#F79420')
ggsave('primicial_price.png')
summary(diamondsinfo$price)
a1 = summary(diamondsinfo$price < 500)
a2 = summary(diamondsinfo$price < 250)
a3 = summary(diamondsinfo$price >= 15000)
# find the largest peak in the price histogram you created earlier
ggplot(aes(x=price),data=diamondsinfo) + geom_histogram(binwidth = 30,color='red',fill='#F79420') + scale_x_continuous(limits = c(0,300))
ggsave('primicial_price_300.png')
# break out the histogram of diamond prices by cut
ggplot(aes(x=cut,y=price),data=diamondsinfo) + geom_bar(stat="identity",color='red',fill='#F79420') +
xlab('the class of cut') + ylab('the price ')
by(diamondsinfo$price,diamondsinfo$cut,summary)
# looked at the histogram as a reminder
qplot(x=price,data=diamondsinfo) + facet_wrap(~cut)
diamondsinfo$cut
# create a histogram of price per carat and facet it by cut
qplot(x=price/carat,data=diamondsinfo) + geom_histogram(binwidth = 30,color='red',fill='#F79420') + facet_wrap(~cut)
#the price of diamonds using boxplot
ggplot(aes(x=price),data=diamondsinfo) + geom_boxplot() + facet_wrap(~cut)
by(diamondsinfo$price,diamondsinfo$color,summary)
# Calculate IQR
IQR(subset(diamondsinfo,color=J)$price)
IQR(subset(diamondsinfo,price<1000)$price)
IQR(subset(diamondsinfo,color=D)$price)
# the price per carat of diamonds across the different colors of diamonds using boxplots
ggplot(aes(x=color,y=price/carat),data=diamondsinfo) +
geom_boxplot()
# investigate the weight of the diamonds(carat) using a frequency polygon whether log10
ggplot(aes(x=carat),data=diamondsinfo) +
geom_freqpoly(aes(color=color)) +
scale_x_log10()
ggplot(aes(x=carat),data=diamondsinfo) +
geom_freqpoly(aes(color=cut))