文章目录
- 1. Joining Tables
- 1.1 The inner_join verb (video)
- 1.2 What columns would you join on?
- 1.3 Joining parts and part categories
- 1.4 Joining with a one-to-many relationship (video)
- 1.5 Joining parts and inventories
- 1.6 Joining in either direction
- 1.7 Joining three or more tables (video)
- 1.8 Joining three tables
- 1.9 What's the most common color?
- 2. Left and Right Joins
- 2.1 The left_join verb (video)
- 2.2 Left joining two sets by part and color
- 2.3 Left joining two sets by color
- 2.4 Finding an observation that doesn't have a match
- 2.5 The right-join verb (video)
- 2.6 Which joins is best?
- 2.7 Counting part colors
- 2.8 Cleaning up your count
- 2.9 Joining tables to themselves (video)
- 2.10 Joining themes to their children
- 2.11 Joining themes to their grandchildren
- 2.12 Left-joining a table to itself
- 3. Full, Semi, and Anti Joins
- 3.1 The full_join verb (video)
- 3.2 Differences between batman and star wars
- 3.3 Aggregating each theme
- 3.4 Full-joining batman and star wars LEGO parts
- 3.5 Comparing batman and star wars LEGO parts
- 3.6 The semi- and anti-join verbs (video)
- 3.7 Select the join
- 3.8 Something within one set but not another
- 3.9 What colors are included in at least one set?
- 3.10 Which sets is missing version 1?
- 3.11 Visualizing set differences (video)
- 3.12 Aggregating sets to look at their differences
- 3.13 Combining sets
- 3.14 Visualizing the difference: batman and star wars
- 4. Case Study: Joins and Stack Overflow Data
- 4.1 Stack overflow questions (video)
- 4.2 Left-joining questions and tags
- 4.3 Comparing scores across tags
- 4.4 What tags never appear on R questions?
- 4.5 Joining questions and answers (video)
- 4.6 Finding gaps between questions and answers
- 4.7 Joining question and answer counts
- 4.8 Joining questions, answers and tags
- 4.9 Average answers by question
- 4.10 The bind_rows verb (video)
- 4.11 Joining questions and answers with tags
- 4.12 Binding and counting posts with tags
- 4.13 Visualizing questions and answers in tags
- 4.14 Congratulations!
1. Joining Tables
1.1 The inner_join verb (video)
1.2 What columns would you join on?
1.3 Joining parts and part categories
The inner_join
is the key to bring tables together. To use it, you need to provide the two tables that must be joined and the columns on which they should be joined.
In this exercise, you’ll join a list of LEGO parts, available as parts
, with these parts’ corresponding categories, available as part_categories
. For example, the part Sticker Sheet 1 for Set 1650-1
is from the Stickers
part category. You can join these tables to see all parts’ categories!
Instruction 1:
- Add the correct joining verb, the name of the second table, and the joining column for the second table.
# Add the correct verb, table, and joining column
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"))
Instruction 2:
- Now, use the
suffix
argument to add"_part"
and"_category"
suffixes to replace thename.x
andname.y
fields.
# Use the suffix argument to replace .x and .y suffixes
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"), suffix = c("_part", "_category"))
1.4 Joining with a one-to-many relationship (video)
1.5 Joining parts and inventories
The LEGO data has many tables that can be joined together. Often times, some of the things you care about may be a few tables away (we’ll get to that later in the course). For now, we know that parts
is a list of all LEGO parts, and a new table, inventory_parts
, has some additional information about those parts, such as the color_id
of each part you would find in a specific LEGO kit.
Let’s join these two tables together to observe how joining parts
with inventory_parts
increases the size of your table because of the one-to-many relationship that exists between these two tables.
Instruction:
- Connect the
parts
andinventory_parts
tables by their part numbers using an inner join.
# Combine the parts and inventory_parts tables
parts %>%
inner_join(inventory_parts, by = "part_num")
1.6 Joining in either direction
An inner_join
works the same way with either table in either position. The table that is specified first is arbitrary, since you will end up with the same information in the resulting table either way.
Let’s prove this by joining the same two tables from the last exercise in the opposite order!
Instruction:
- Connect the
inventory_parts
table with theparts
tables.
# Combine the parts and inventory_parts tables
inventory_parts %>%
inner_join(parts, by = "part_num" )
1.7 Joining three or more tables (video)
1.8 Joining three tables
You can string together multiple joins with inner_join
and the pipe (%>%
), both with which you are already very familiar!
We’ll now connect sets
, a table that tells us about each LEGO kit, with inventories
, a table that tells us the specific version of a given set, and finally to inventory_parts
, a table which tells us how many of each part is available in each LEGO kit.
So if you were building a Batman LEGO set, sets
would tell you the name of the set, inventories
would give you IDs for each of the versions of the set, and inventory_parts
would tell you how many of each part would be in each version.
Instruction:
- Combine the
inventories
table with thesets
table. - Next, join the
inventory_parts
table to the table you created in the previous join by the inventory IDs.
sets %>%
# Add inventories using an inner join
inner_join(inventories,by = "set_num") %>%
# Add inventory_parts using an inner join
inner_join(inventory_parts, by = c( "id" = "inventory_id" ))
1.9 What’s the most common color?
Now let’s join an additional table, colors
, which will tell us the color of each part in each set, so that we can answer the question, “what is the most common color of a LEGO piece?”
Instruction 1:
- Inner join the
colors
table using thecolor_id
column from the previous join and theid
column fromcolors
; use the suffixes"_set"
and"_color"
.
# Add an inner join for the colors table
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color"))
Instruction 2:
- Count the
name_color
column and sort the results so the most prominent colors appear first.
# Count the number of colors and sort
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
count(name_color)%>%
arrange(desc(n))
2. Left and Right Joins
2.1 The left_join verb (video)
2.2 Left joining two sets by part and color
In the video, you learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets. We’ve created these for you and they have been preloaded for you:
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
Instruction:
- Combine the
star_destroyer
andmillennium_falcon
tables with the suffixes_falcon
and_star_destroyer
.
# Combine the star_destroyer and millennium_falcon tables
millennium_falcon %>%
left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))
2.3 Left joining two sets by color
In the videos and the last exercise, you joined two sets based on their part and color. What if you joined the datasets by color alone? As with the last exercise, the Millennium Falcon and Star Destroyer sets have been created and preloaded for you:
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
Instruction 1:
- Sum the
quantity
column bycolor_id
in the Millennium Falcon dataset.
# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
Instruction 2:
Now, sum the quantity
column by color_id
in the Star Destroyer dataset.
# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
Instruction 3:
- Left join the two datasets,
millennium_falcon_colors
andstar_destroyer_colors
, using thecolor_id
column and the_falcon
and_star_destroyer
suffixes.
# Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
left_join(star_destroyer_colors, by = c("color_id"), suffix = c("_falcon", "_star_destroyer"))
2.4 Finding an observation that doesn’t have a match
Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.
For example, the inventories
table has a version column, for when a LEGO kit gets some kind of change or upgrade. It would be fair to assume that all sets
(which joins well with inventories
) would have at least a version 1. But let’s test this assumption out in the following exercise.
Instruction:
- Use a
left_join
to join togethersets
andinventory_version_1
using their common column. filter
for where theversion
column isNA
usingis.na
.
inventory_version_1 <- inventories %>%
filter(version == 1)
# Join versions to sets
sets %>%
left_join(inventory_version_1, by = "set_num" ) %>%
# Filter for where version is na
filter(is.na(version))
2.5 The right-join verb (video)
2.6 Which joins is best?
2.7 Counting part colors
Sometimes you’ll want to do some processing before you do a join, and prioritize keeping the the second (right) table’s rows instead. In this case, a right join is for you.
In the example below, we’ll count the part_cat_id
from parts
, before using a right_join
to join with part_categories
. The reason we do this is because we don’t only want to know the count of part_cat_id
in parts
, but we also want to know if there are any part_cat_ids
not present in part_categories
.
Instruction 1:
- Use the
count
verb to count eachpart_cat_id
in the parts table. - Use a
right_join
to joinpart_categories
. You’ll need to use thepart_cat_id
from the count and theid
column frompart_categories
.
parts %>%
# Count the part_cat_id
count(part_cat_id)%>%
# Right join part_categories
right_join(part_categories, by = c("part_cat_id" = "id"))
Instruction 2:
filter
for where the columnn
is NA.
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Filter for NA
filter(is.na(n))
2.8 Cleaning up your count
In both left and right joins, there is the opportunity for there to be NA values in the resulting table. Fortunately, the replace_na
function can turn those NAs into meaningful values.
In the last exercise, we saw that the n
column had NAs after the right_join
. Let’s use the replace_na
column, which takes a list
of column names and the values with which NAs should be replaced, to clean up our table.
Instruction:
- Use
replace_na
to replace NAs in then
column with the value 0.
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Use replace_na to replace missing values in the n column
replace_na(list(n = 0))
2.9 Joining tables to themselves (video)
2.10 Joining themes to their children
Tables can be joined to themselves!
In the themes
table, which is available for you to inspect in the console, you’ll notice there is both an id
column and a parent_id
column. Keeping that in mind, you can join the themes
table to itself to determine the parent-child relationships that exist for different themes.
In the videos, you saw themes joined to their own parents. In this exercise, you’ll try a similar approach of joining themes to their own children, which is similar but reversed. Let’s try this out to discover what children the theme "Harry Potter"
has.
Instruction:
- Inner join
themes
to their own children, resulting in the suffixes"_parent"
and"_child"
, respectively. - Filter this table to find the children of the “Harry Potter” theme.
themes %>%
# Inner join the themes table
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent","_child"))%>%
# Filter for the "Harry Potter" parent name
filter(name_parent == "Harry Potter")
2.11 Joining themes to their grandchildren
We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children’s children.
Here, we can inner join themes
to a filtered version of itself again to establish a between our last join’s children and their children.
Instruction:
Use another inner join to combine themes
again with itself.
- Be sure to use the suffixes
"_parent"
and"_grandchild"
so the columns in the resulting table are clear. - Update the
by
argument to specify the correct columns to join on.
# Join themes to itself again to find the grandchild relationships
themes %>%
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))
2.12 Left-joining a table to itself
So far, you’ve been inner joining a table to itself in order to find the children of themes like "Harry Potter
" or "The Lord of the Rings"
.
But some themes might not have any children at all, which means they won’t be included in the inner join. As you’ve learned in this chapter, you can identify those with a left_join
and a filter()
.
Instruction:
- Left join the themes table to its own children, with the suffixes
_parent
and_child
respectively. - Filter the result of the join to find themes that have no children.
themes %>%
# Left join the themes table to its own children
left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
# Filter for themes that have no child themes
filter(is.na(id_child))
3. Full, Semi, and Anti Joins
3.1 The full_join verb (video)
3.2 Differences between batman and star wars
In the video, you compared two sets. Now, you’ll compare two themes, each of which is made up of many sets.
First, you’ll need to join in the themes
. Recall that doing so requires going through the sets
first. You’ll use the inventory_parts_joined
table from the video, which is already available to you in the console.
inventory_parts_joined <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version)
Instruction:
- In order to join in the themes, you’ll first need to combine the
sets
andinventory_parts_joined
tables. - Then, combine the
themes
table with your first join, using the suffix argument to clarify which table eachname
came from ("_set"
or"_theme"
).
inventory_parts_joined %>%
# Combine the sets table with inventory_parts_joined
inner_join(sets, by = "set_num")%>%
# Combine the themes table with your first join
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
3.3 Aggregating each theme
Previously, you combined tables to compare themes. Before doing this comparison, you’ll want to aggregate the data to learn more about the pieces that are a part of each theme, as well as the colors of those pieces.
The table you created previously has been preloaded for you as inventory_sets_themes
. It was filtered for each theme, and the objects have been saved as batman
and star_wars
.
inventory_sets_themes <- inventory_parts_joined %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
batman <- inventory_sets_themes %>%
filter(name_theme == "Batman")
star_wars <- inventory_sets_themes %>%
filter(name_theme == "Star Wars")
Instruction:
- Count the part number and color id for the parts in Batman and Star Wars, weighted by quantity.
# Count the part number and color id, weight by quantity
batman %>%
count(part_num, color_id, wt = quantity)
star_wars %>%
count(part_num, color_id, wt = quantity)
3.4 Full-joining batman and star wars LEGO parts
Now that you’ve got separate tables for the pieces in the batman
and star_wars
themes, you’ll want to be able to combine them to see any similarities or differences between the two themes. The aggregating from the last exercise has been saved as batman_parts
and star_wars_parts
, and is preloaded for you.
batman_parts <- batman %>%
count(part_num, color_id, wt = quantity)
star_wars_parts <- star_wars %>%
count(part_num, color_id, wt = quantity)
Instruction:
- Combine the
star_wars_parts
table with thebatman_parts
table; use the suffix argument to include the"_batman"
and"_star_wars"
suffixes. - Replace all the NA values in the
n_batman
andn_star_wars
columns with 0s.
batman_parts %>%
# Combine the star_wars_parts table
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars"))%>%
# Replace NAs with 0s in the n_batman and n_star_wars columns
replace_na(list(n_batman = 0, n_star_wars = 0))
3.5 Comparing batman and star wars LEGO parts
The table you created in the last exercise includes the part number of each piece, the color id, and the number of each piece in the Star Wars and Batman themes. However, we have more information about each of these parts that we can gain by combining this table with some of the information we have in other tables. Before we compare the themes, let’s ensure that we have enough information to make our findings more interpretable. The table from the last exercise has been saved as parts_joined
and is preloaded for you.
parts_joined <- batman_parts %>%
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
replace_na(list(n_batman = 0, n_star_wars = 0))
Instruction:
- Sort the number of star wars pieces in the
parts_joined
table in descending order. - Join the
colors
table to theparts_joined
table. - Combine the
parts
table to the previous join; add"_color"
and"_part"
suffixes to specify whether or not the information came from thecolors
table or theparts
table.
parts_joined %>%
# Sort the number of star wars pieces in descending order
arrange(desc(n_star_wars))%>%
# Join the colors table to the parts_joined table
inner_join(colors, by = c("color_id" = "id"))%>%
# Join the parts table to the previous join
inner_join(parts, by = "part_num", suffix = c("_color", "_part"))
3.6 The semi- and anti-join verbs (video)
3.7 Select the join
3.8 Something within one set but not another
In the videos, you learned how to filter using the semi- and anti-join verbs to answer questions you have about your data. Let’s focus on the batwing
dataset, and use our skills to determine which parts are in both the batwing
and batmobile
sets, and which sets are in one, but not the other. While answering these questions, we’ll also be determining whether or not the parts we’re looking at in both sets also have the same color in common.
The batmobile
and batwing
datasets have been preloaded for you.
batmobile <- inventory_parts_joined %>%
filter(set_num == "7784-1") %>%
select(-set_num)
batwing <- inventory_parts_joined %>%
filter(set_num == "70916-1") %>%
select(-set_num)
Instruction:
- Filter the batwing set for parts that are also in the batmobile, whether or not they have the same color.
- Filter the batwing set for parts that aren’t also in the batmobile, whether or not they have the same color.
# Filter the batwing set for parts that are also in the batmobile set
batwing %>%
semi_join(batmobile, by = "part_num")
# Filter the batwing set for parts that aren't in the batmobile set
batwing %>%
anti_join(batmobile, by = "part_num")
3.9 What colors are included in at least one set?
Besides comparing two sets directly, you could also use a filtering join like semi_join
to find out which colors ever appear in any inventory part. Some of the colors could be optional, meaning they aren’t included in any sets.
The inventory_parts
and colors
tables have been preloaded for you.
Instruction:
Use the inventory_parts
table to find the colors that are included in at least one set.
# Use inventory_parts to find colors included in at least one set
colors %>%
semi_join(inventory_parts, by = c("id" = "color_id"))
3.10 Which sets is missing version 1?
Each set included in the LEGO data has an associated version number. We want to understand the version we are looking at to learn more about the parts that are included. Before doing that, we should confirm that there aren’t any sets that are missing a particular version.
Let’s start by looking at the first version of each set to see if there are any sets that don’t include a first version.
Instruction:
- Use
filter()
to extractversion
1
from theinventories
table; save the filter toversion_1_inventories
. - Use
anti_join
to combineversion_1_inventories
with sets to determine whichset
is missing a version 1.
# Use filter() to extract version 1
version_1_inventories <- inventories %>%
filter(version == 1)
# Use anti_join() to find which set is missing a version 1
sets %>%
anti_join(version_1_inventories, by = "set_num")
3.11 Visualizing set differences (video)
3.12 Aggregating sets to look at their differences
To compare two individual sets, and the kinds of LEGO pieces that comprise them, we’ll need to aggregate the data into separate themes. Additionally, as we saw in the video, we’ll want to add a column so that we can understand the percentages of specific pieces that are part of each set, rather than looking at the numbers of pieces alone.
The inventory_parts_themes
table has been preloaded for you.
inventory_parts_themes <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version) %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
Instruction:
- Add a filter for the Batman set to create the
batman_colors
object. - Add a percent column to
batman_colors
that displays the total divided by the sum of the total. - Filter and aggregate the Star Wars set data to create the
star_wars_colors
object; add a percent column to the object to display the percent of the total.
batman_colors <- inventory_parts_themes %>%
# Filter the inventory_parts_themes table for the Batman theme
filter(name_theme == "Batman") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
# Add a percent column of the total divided by the sum of the total
mutate(percent = total / sum(total))
# Filter and aggregate the Star Wars set data; add a percent column
star_wars_colors <- inventory_parts_themes %>%
filter(name_theme == "Star Wars") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(percent = total / sum(total))
3.13 Combining sets
The data you aggregated in the last exercise has been preloaded for you as batman_colors
and star_wars_colors
. Prior to visualizing the data, you’ll want to combine these tables to be able to directly compare the themes’ colors.
batman_colors <- inventory_parts_themes %>%
filter(name_theme == "Batman") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(percent = total / sum(total))
star_wars_colors <- inventory_parts_themes %>%
filter(name_theme == "Star Wars") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(percent = total / sum(total))
Instruction 1:
- Join the
batman_colors
andstar_wars_colors
tables; be sure to include all observations from both tables. - Replace the NAs in the
total_batman
andtotal_star_wars
columns.
batman_colors %>%
# Join the Batman and Star Wars colors
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
# Replace NAs in the total_batman and total_star_wars columns
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id"))
Instruction 2:
- Add a
difference
column that calculates the difference betweenpercent_batman
andpercent_star_wars
, and atotal
column, which is the sum oftotal_batman
andtotal_star_wars
. - Add a
filter
to select observations wheretotal
is at least 200.
batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
# Create the difference and total columns
mutate(difference = percent_batman - percent_star_wars,
total = total_batman + total_star_wars) %>%
# Filter for totals greater than 200
filter(total > 200)
3.14 Visualizing the difference: batman and star wars
In the last exercise, you created colors_joined
. Now you’ll create a bar plot with one bar for each color (name
), showing the difference in percentages.
Because factors and visualization are beyond the scope of this course, we’ve done some processing for you: here is the code that created the colors_joined
table that will be used in the video.
colors_joined <- batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
mutate(difference = percent_batman - percent_star_wars,
total = total_batman + total_star_wars) %>%
filter(total >= 200) %>%
mutate(name = fct_reorder(name, difference))
Instruction:
Create a bar plot using the colors_joined
table to display the most prominent colors in the Batman and Star Wars themes, with the bars colored by their name
.
# Create a bar plot using colors_joined and the name and difference columns
ggplot(colors_joined, aes(name, difference, fill = name)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = color_palette, guide = FALSE) +
labs(y = "Difference: Batman - Star Wars")
4. Case Study: Joins and Stack Overflow Data
4.1 Stack overflow questions (video)
4.2 Left-joining questions and tags
Three of the Stack Overflow survey datasets are questions
, question_tags
, and tags
:
questions
: an ID and the score, or how many times the question has been upvoted; the data only includes R-based questionsquestion_tags
: a tag ID for each question and the question’s idtags
: a tag id and the tag’s name, which can be used to identify the subject of each question, such as ggplot2 or dplyr
In this exercise, we’ll be stitching together these datasets and replacing NAs in important fields.
Note that we’ll be using left_joins
in this exercise to ensure we keep all questions, even those without a corresponding tag. However, since we know the questions
data is all R data, we’ll want to manually tag these as R questions with replace_na
.
Instruction 1:
- Join together
questions
andquestion_tags
using theid
andquestion_id
columns, respectively.
# Join the questions and question_tags tables
questions %>%
left_join(question_tags, by = c("id" = "question_id"))
Instruction 2:
- Use another join to add in the
tags
table.
# Join in the tags table
questions %>%
left_join(question_tags, by = c("id" = "question_id")) %>%
left_join(tags, by = c("tag_id" = "id"))
Instruction 3:
- Use
replace_na
to change theNA
s in thetag_name
column to"only-r"
.
# Replace the NAs in the tag_name column
questions %>%
left_join(question_tags, by = c("id" = "question_id")) %>%
left_join(tags, by = c("tag_id" = "id")) %>%
replace_na(list(tag_name = "only-r"))
4.3 Comparing scores across tags
The complete dataset you created in the last exercise is available to you as questions_with_tags
. Let’s do a quick bit of analysis on it! You’ll use familiar dplyr verbs like group_by
, summarize
, arrange
, and n
to find out the average score of the most asked questions.
Instruction:
- Aggregate by the
tag_name
. - Summarize to get the total number of questions,
num_questions
, as well as the mean score for each question,score
. - Arrange
num_questions
in descending order to sort the answers by the most asked questions.
questions_with_tags %>%
# Group by tag_name
group_by(tag_name) %>%
# Get mean score and num_questions
summarize(score = mean(score),
num_questions = n()) %>%
# Sort num_questions in descending order
arrange(desc(num_questions))
4.4 What tags never appear on R questions?
The tags
table includes all Stack Overflow tags, but some have nothing to do with R
. How could you filter for just the tags that never appear on an R question? The tags
and question_tags
tables have been preloaded for you.
Instruction:
- Use a join to determine which tags never appear on an R question.
# Using a join, filter for tags that are never on an R question
tags %>%
anti_join(question_tags, by = c("id" = "tag_id"))
4.5 Joining questions and answers (video)
4.6 Finding gaps between questions and answers
Now we’ll join together questions
with answers
so we can measure the time between questions and answers.
Instruction:
- Use an inner join to combine the
questions
andanswers
tables using the suffixes"_question"
and"_answer"
, respectively. - Subtract
creation_date_question
fromcreation_date_answer
within theas.integer()
function to create thegap
column.
questions %>%
# Inner join questions and answers with proper suffixes
inner_join(answers, by = c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
# Subtract creation_date_question from creation_date_answer to create gap
mutate(gap = as.integer(creation_date_answer - creation_date_question))
4.7 Joining question and answer counts
We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions
table.
Instruction:
- Count and sort the
question_id
column in theanswers
table to create theanswer_counts
table. - Join the
questions
table with theanswer_counts
table. - Replace the NA values in the
n
column with 0s.
# Count and sort the question id column in the answers table
answer_counts <- answers %>%
group_by(question_id) %>%
count(question_id)
# Combine the answer_counts and questions tables
questions %>%
left_join(answer_counts, by = c("id" = "question_id")) %>%
# Replace the NAs in the n column
replace_na(list(n = 0))
4.8 Joining questions, answers and tags
Let’s build on the last exercise by adding the tags
table to our previous joins. This will allow us to do a better job of identifying which R topics get the most traction on Stack Overflow. The tables you created in the last exercise have been preloaded for you as answer_counts
and question_answer_counts
.
answer_counts <- answers %>%
count(question_id, sort = TRUE)
question_answer_counts <- questions %>%
left_join(answer_counts, by = c("id" = "question_id")) %>%
replace_na(list(n = 0))
Instruction:
- Combine the
question_tags
table withquestion_answer_counts
using aninner_join
. - Now, use another
inner_join
to add the tags table.
question_answer_counts %>%
# Join the question_tags tables
inner_join(question_tags, by = c("id" = "question_id")) %>%
# Join the tags table
inner_join(tags, by = c("tag_id" = "id"))
4.9 Average answers by question
The table you created in the last exercise has been preloaded for you as tagged_answers
. You can use this table to determine, on average, how many answers each questions gets.
tagged_answers <- question_answer_counts %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
Some of the important variables from this table include: n
, the number of answers for each question, and tag_name
, the name of each tag associated with each question.
Let’s use some of our favorite dplyr verbs to find out how many answers each question gets on average.
Instruction:
- Aggregate the
tagged_answers
table bytag_name
. - Summarize
tagged_answers
to get the count ofquestions
and theaverage_answers
. - Sort the resulting
questions
column in descending order.
tagged_answers %>%
# Aggregate by tag_name
group_by(tag_name) %>%
# Summarize questions and average_answers
summarize(questions = n(),
average_answers = mean(n)) %>%
# Sort the questions in descending order
arrange(desc(questions))
4.10 The bind_rows verb (video)
4.11 Joining questions and answers with tags
To learn more about the questions
and answers
table, you’ll want to use the question_tags
table to understand the tags associated with each question that was asked, and each answer that was provided. You’ll be able to combine these tables using two inner joins on both the questions
table and the answers
table.
Instruction:
- Use two inner joins to combine the
question_tags
andtags
tables with thequestions
table. - Now, use two inner joins to combine the
question_tags
andtags
tables with theanswers
table.
# Inner join the question_tags and tags tables with the questions table
questions %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
# Inner join the question_tags and tags tables with the answers table
answers %>%
inner_join(question_tags, by = "question_id") %>%
inner_join(tags, by = c("tag_id" = "id"))
4.12 Binding and counting posts with tags
The tables you created in the previous exercise have been preloaded as questions_with_tags
and answers_with_tags
. First, you’ll want to combine these tables into a single table called posts_with_tags
. Once the information is consolidated into a single table, you can add more information by creating a date variable using the lubridate
package, which has been preloaded for you.
questions_with_tags <- questions %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
answers_with_tags <- answers %>%
inner_join(question_tags, by = "question_id") %>%
inner_join(tags, by = c("tag_id" = "id"))
Instruction:
- Combine the
questions_with_tags
andanswers_with_tags
tables intoposts_with_tags
. - Add a
year
column to theposts_with_tags
table, then aggregate bytype
,year
, andtag_name
.
# Combine the two tables into posts_with_tags
posts_with_tags <- bind_rows(questions_with_tags %>% mutate(type = "question"),
answers_with_tags %>% mutate(type = "answer"))
# Add a year column, then aggregate by type, year, and tag_name
posts_with_tags %>%
mutate(year = year(creation_date)) %>%
count(type, year, tag_name)
4.13 Visualizing questions and answers in tags
In the last exercise, you modified the posts_with_tags
table to add a year
column, and aggregated by type
, year
, and tag_name
. The modified table has been preloaded for you as by_type_year_tag
, and has one observation for each type (question/answer), year, and tag. Let’s create a plot to examine the information that the table contains about questions and answers for the dplyr
and ggplot2
tags. The ggplot2
package has been preloaded for you.
by_type_year_tag <- posts_with_tags %>%
mutate(year = year(creation_date)) %>%
count(type, year, tag_name)
Instruction:
- Filter the
by_type_year_tag
table for the dplyr and ggplot2 tags. - Create a line plot with that filtered table that plots the frequency (
n
) over time, colored by question/answer and faceted by tag.
# Filter for the dplyr and ggplot2 tag names
by_type_year_tag_filtered <- by_type_year_tag %>%
filter(tag_name == "dplyr" | tag_name == "ggplot2")
# Create a line plot faceted by the tag name
ggplot(by_type_year_tag_filtered, aes(year, n, color = type)) +
geom_line() +
facet_wrap(~ tag_name)