数据科学之第三章中lesson3的探索单变量 (R语言)

本文链接：https://blog.csdn.net/jasminexjf/article/details/90311018

l课程讲解中的lesson3.rmd

Lesson 3
========================================================

***

### What to Do First?
Notes:

***

### Pseudo-Facebook User Data
Notes:

```{r Pseudo-Facebook User Data}
getwd()
list.files()
pf <- read.csv('pseudo_facebook.tsv',sep='\t')
names(pf)
```

***

### Histogram of Users' Birthdays
Notes:

```{r Histogram of Users\' Birthdays}
install.packages('ggplot2')
library(ggplot2)

install.packages('ggthemes',dependencies=TRUE)
library(ggthemes)
theme_set(theme_minimal(24))

names(pf)
ggplot(aes(x=dob_day),data=pf) + geom_histogram(binwidth=1)+scale_x_continuous(breaks=1:31)
```

***

#### What are some things that you notice about this histogram?
Response:defaut birthday is 1.

***

### Moira's Investigation
Notes: plot many plots and find :many people will underestimated the size of their audience.(1/4)

***

### Estimating Your Audience Size
Notes:
```{r faceting}
    ggplot(aes(x=dob_day),data=pf) +       geom_histogram(binwidth=1)+scale_x_continuous(breaks=1:31)+facet_wrap(~dob_month)

```

***

#### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
Response:

#### How many of your friends do you think saw that post?
Response:

#### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
Response:

***

### Perceived Audience Size
Notes:

***
### Faceting
Notes:

```{r Faceting}
  ggplot(data=pf,aes(x=dob_day)) + geom_histogram(binwidth=1) + scale_x_continuous(breaks=1:31) + facet_wrap(~dob_month,ncol=4)
```

#### Let鈥檚 take another look at our plot. What stands out to you here?
Response:

***

### Be Skeptical - Outliers and Anomalies
Notes:

***

### Moira's Outlier
Notes:
#### Which case do you think applies to Moira鈥檚 outlier?
Response:

***

### Friend Count
Notes:

#### What code would you enter to create a histogram of friend counts?

```{r Friend Count}
ggplot(aes(x=friend_count),data=pf) + geom_histogram()

```

#### How is this plot similar to Moira's first plot?
Response:

***

### Limiting the Axes
Notes:

```{r Limiting the Axes}
    ggplot(aes(x=friend_count),data=pf) + geom_histogram() + scale_x_continuous(limits = c())
```

### Exploring with Bin Width
Notes:

***

### Adjusting the Bin Width
Notes:

### Faceting Friend Count
```{r Faceting Friend Count}
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))

ggplot(aes(x=friend_count),data=pf) + geom_histogram(binwidth=25) + scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50)) + facet_wrap(~gender)
```

***

### Omitting NA Values
Notes:

```{r Omitting NA Values}
  ggplot(aes(x=friend_count),data=subset(pf, !is.na(gender))) + geom_histogram(binwidth=25) + scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50)) + facet_wrap(~gender)
```

***

### Statistics 'by' Gender
Notes:

```{r Statistics \'by\' Gender}
    table(pf$gender)
by(pf$friend_count,pf$gender,summary)
```

#### Who on average has more friends: men or women?
Response:

#### What's the difference between the median friend count for women and men?
Response:

#### Why would the median be a better measure than the mean?
Response:

***

### Tenure
Notes:

```{r Tenure}
  ggplot(aes(x=tenure),data=pf) + geom_histogram(binwidth=30,color='black',fill='#099DD9')


```

***

#### How would you create a histogram of tenure by year?

```{r Tenure Histogram by Year}
    ggplot(aes(x=tenure/365),data=pf) + geom_histogram(binwidth=.25,color='black',fill='#F79420')

```

***

### Labeling Plots
Notes:

```{r Labeling Plots}
    ggplot(aes(x=tenure/365),data=pf) + geom_histogram(binwidth=.25,color='black',fill='#F79420') + 
    scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7)) + xlab('Number of years using Facebooks') + ylab('Number of users in sample')
```

***

### User Ages
Notes:

```{r User Ages}
  ggplot(aes(x=age),data=pf) + geom_histogram(binwidth=1,color='black',fill='#5760AB') + scale_x_continuous(breaks=seq(0,113,5)) + 
xlab('Age of user about facebook') + ylab('Number of users in sample')
```

#### What do you notice?
Response:

***

### The Spread of Memes
Notes:

***

### Lada's Money Bag Meme
Notes:

***

### Transforming Data
Notes:
```{r transformating data}
   ggplot(aes(x=friend_count),data=subset(pf, !is.na(gender))) + geom_histogram(binwidth=20) + scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50))

  summary(pf$friend_count)
  
  summary(log10(pf$friend_count +1))
  
  summary(sqrt(pf$friend_count))
  
  p1 = ggplot(aes(x=friend_count),data=pf) + geom_histogram(binwidth=10,color='black',fill='#099D99') + scale_x_continuous(limits=c(0,500),breaks=seq(0,500,1))
  
  p2 = ggplot(aes(x=log(friend_count +1 )),data=pf) + geom_histogram(binwidth=1,color='black',fill='#F79420') +  scale_x_continuous(limits=c(0,10),breaks=seq(0,10,1))
  
   p3 = ggplot(aes(x=sqrt(friend_count)),data=pf) + geom_histogram(binwidth=1,color='black',fill='#5760AB') + scale_x_continuous(limits=c(0,40),breaks=seq(0,40,1))
   
   grid.arrange(p1,p2,p3,ncol=1)


```

***

### Add a Scaling Layer
Notes:

```{r Add a Scaling Layer}
#  way one:
  logscale <- ggplot(aes(x=log10(friend_count)),data = pf) + geom_histogram(color='black',fill='#099D99')

  countscale <- ggplot(aes(x=friend_count),data = pf) + geom_histogram( color='black',fill='#F79420') + scale_x_log10()

    grid.arrange(logscale,countscale,ncol=1)
```

***


### Frequency Polygons

```{r Frequency Polygons}
p1 = ggplot(aes(x=friend_count,y= ..count.. / sum(..count..)),data=subset(pf,!is.na(gender))) +   
    geom_freqpoly(aes(color=gender),binwidth=10) + 
    scale_x_continuous(limits=c(0,1000),breaks=seq(0.1000,50)) + 
    xlab('number of friends') + 
    ylab('Percentage')  # Percentage of users with that friend count

p2 = ggplot(aes(x=friend_count,y= ..count.. / sum(..count..)),data=subset(pf,!is.na(gender))) +   
    geom_freqpoly(aes(color=gender),binwidth=10) + 
    scale_x_continuous(limits=c(0,1000),breaks=seq(0.1000,50)) + 
    scale_x_log10() + 
    xlab('number of friends') + 
    ylab('after log10')

p3 = ggplot(aes(x=www_likes),data=subset(pf,!is.na(gender))) +   
    geom_freqpoly(aes(color=gender)) + 
    scale_x_log10()
    # scale_x_continuous(limits=c(0,1000),breaks=seq(0.1000,50)) + 
   
    # xlab('www_likes') + 
    # ylab('Percentage of users with that friend count')

grid.arrange(p1,p2,p3,ncol=1)

```

***

### Likes on the Web
Notes:

```{r Likes on the Web}
by(pf$www_likes,pf$gender,sum)
```


***

### Box Plots
Notes:

```{r Box Plots}
p1 = ggplot(aes(x=gender,y=friend_count),data=subset(pf,!is.na(gender))) +
  geom_boxplot()

p2 = ggplot(aes(x=gender,y=friend_count),data=subset(pf,!is.na(gender))) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0,1000))

p3 = ggplot(aes(x=gender,y=friend_count),data=subset(pf,!is.na(gender))) +
  geom_boxplot() +
  coord_cartesian(ylim=c(0,1000))

grid.arrange(p1,p2,p3,ncol=1)
```

#### Adjust the code to focus on users who have friend counts between 0 and 1000.

```{r}
by(pf$friend_count,pf$gender,summary)
by(pf$friend_count,pf$gender,sum)
by(pf$friend_count,pf$gender,mean)
```

***

### Box Plots, Quartiles, and Friendships
Notes:

```{r Box Plots, Quartiles, and Friendships}

```

#### On average, who initiated more friendships in our sample: men or women?
Response:
#### Write about some ways that you can verify your answer.
Response:
```{r Friend Requests by Gender}
    
```

Response:

***

### Getting Logical
Notes:

```{r Getting Logical}
summary(pf$mobile_likes)

summary(pf$mobile_likes>0)

mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes>0,1,0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
```

Response:

***

### Analyzing One Variable
Reflection:

***

Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!

习题集：practice_lesson3.R

#library(ggplot2)
#diamondsdata <- data(diamonds)
#summary(diamondsdata)

ggplot(aes(x=price),data=diamondsinfo) + geom_histogram(binwidth = 30,color='red',fill='#F79420') 
ggsave('primicial_price.png')
summary(diamondsinfo$price)
a1 = summary(diamondsinfo$price < 500)
a2 = summary(diamondsinfo$price < 250)
a3 = summary(diamondsinfo$price >= 15000)

# find the largest peak in the price histogram you created earlier
ggplot(aes(x=price),data=diamondsinfo) + geom_histogram(binwidth = 30,color='red',fill='#F79420') + scale_x_continuous(limits = c(0,300)) 
ggsave('primicial_price_300.png')

# break out the histogram of diamond prices by cut
ggplot(aes(x=cut,y=price),data=diamondsinfo) + geom_bar(stat="identity",color='red',fill='#F79420') +
xlab('the class of cut') + ylab('the price ')
by(diamondsinfo$price,diamondsinfo$cut,summary)

# looked at the histogram as a reminder
qplot(x=price,data=diamondsinfo) + facet_wrap(~cut)
  
diamondsinfo$cut
# create a histogram of price per carat and facet it by cut
qplot(x=price/carat,data=diamondsinfo) + geom_histogram(binwidth = 30,color='red',fill='#F79420') + facet_wrap(~cut)

#the price of diamonds using boxplot
ggplot(aes(x=price),data=diamondsinfo) + geom_boxplot()  + facet_wrap(~cut)

by(diamondsinfo$price,diamondsinfo$color,summary)
# Calculate IQR
IQR(subset(diamondsinfo,color=J)$price)
IQR(subset(diamondsinfo,price<1000)$price)
IQR(subset(diamondsinfo,color=D)$price)


# the price per carat of diamonds across the different colors of diamonds using boxplots
ggplot(aes(x=color,y=price/carat),data=diamondsinfo) +
  geom_boxplot()

# investigate the weight of the diamonds(carat) using a frequency polygon whether log10
ggplot(aes(x=carat),data=diamondsinfo) +   
  geom_freqpoly(aes(color=color)) + 
  scale_x_log10()


ggplot(aes(x=carat),data=diamondsinfo) +   
  geom_freqpoly(aes(color=cut))