R | 对亚马逊新总部可能位置进行可视化-CSDN博客

前不久，亚马逊宣布在寻找一个城市建立第二个总部。
他们对这个城市的标准是超过100万人，同时这个城市也要人才济济。
在一个新闻网站上面发现，我发现了一些可能的城市名单，他们都有超过100万的人口，而且每个城市都是相当多的人才。
https://www.cbsnews.com/news/amazon-hq2-cities-location-choices-new-second-headquarters/

新闻网页上已经对其进行了可视化，但是在这里，我要用R进行可视化。这是一个对网络爬虫/数据处理/可视化很好的锻炼。

有了这个想法，我们首先要爬取数据，然后数据处理（使用dplyr和一些其他工具），然后使用ggplot2绘制地图。

我需要说明的是我们这个分析不是完美的，因为不知道全部的城市名单，也不知道最终的选择标准，即使我们知道了，一个完整的分析远远超过了想博客这样的简单程度.

按照刚才说的，我们首先要做的就是进行数据预处理。

而且，如果你是一个数据科学家，这篇文章会给你提供一个思路，如何使用R和其他工具来分析，如果你是搞市场销售或市场运营，你也可以非常泳衣的用这个临时应急快速分析，作为一个模板或作为你工作的一个起始点、

首先，我们先加载一些我们需要的包。

#==============
# LOAD PACKAGES
#==============

library(rvest)
library(tidyverse)
library(stringr)
library(ggmap)

我们要用rvest包里面的几个函数来爬取数据，然后转换为一个数据框。

html.amz_cities <- read_html("https://www.cbsnews.com/news/amazons-hq2-cities-second-headquarters-these-cities-are-contenders/")

df.amz_cities <- html.amz_cities %>%
  html_nodes("table") %>%
  .[[1]] %>% 
  html_table()
# inspect
df.amz_cities %>% head()

我们现在来改变一下列名，党我们爬取到的数据时，它从网页上面没有读取到，所以我们就要自己动手添加。

#====================
# CHANGE COLUMN NAMES
#====================

# inspect initial column names
colnames(df.amz_cities)

# assign new column names
colnames(df.amz_cities) <- c("metro_area", 'state', 'population_tot', 'bachelors_degree_pct')

# inspect
df.amz_cities %>% head()

和我们预期的一样，爬取的数据列名（原网页显示的列名）在我们新建数据框的第一行，这个是不合适的，所以我们就删去第一行

#==============================================
# REMOVE FIRST ROW
# - when we scraped the data, the column names
#   on the table were read in as the first row
#   of data.
# - Therefore, we need to remove the first row
#==============================================

df.amz_cities <- df.amz_cities %>% filter(row_number() != 1)

现在我们需要修改两个变量bachelors_degree_pct 和population_tot,它们现在是字符类型，但我们需要将它转换为数字类型。因此我们需要强制类型转换。

#===================================================================================
# MODIFY VARIABLES
# - both bachelors_degree_pct and population_tot were scraped as character variables
#    but we need them in numeric format
# - we will use techniques to parse/coerce these variable from char to numeric
#===================================================================================

#--------------------------------
# PARSE AS NUMBER: population_tot
#--------------------------------

df.amz_cities <- mutate(df.amz_cities, population_tot = parse_number(population_tot))

# check
typeof(df.amz_cities$population_tot)

# inspect
df.amz_cities %>% head()

#-----------------------------
# COERCE: bachelors_degree_pct
#-----------------------------

df.amz_cities <- mutate(df.amz_cities, bachelors_degree_pct = as.numeric(bachelors_degree_pct))

现在我们需要创建一个变量，包含城市名。数据中有一个变量叫metro_area，比如New York-Newark-Jersey City.中的。metro_area这个变量或许有用，但是我们从数据中地理编码时也许会出错，因为它表示的范围太广。因此我们需要一个准确的城市名来对其进行地理编码。

出于这种目的，我们新建一个city变量通过metro 名来存储一个具体的城市名。我们要用到stringr::str_extract()函数，以及结合正则表达式就可以提取城市名。

#=============================================================
# CREATE VARIABLE: city
# - here, we're using the stringr function str_extract() to
#   extract the primary city name from the metro_area variable
# - to do this, we're using a regex to pull out the city name
#   prior to the first '-' character
#=============================================================

df.amz_cities <- df.amz_cities %>% mutate(city = str_extract(metro_area, "^[^-]*"))

现在我们已经有具体的城市名，现在要用函数来对每个城市进行地理编码获取每个城市的经纬度。然后我们再用cbind()函数将地理编码数据在加入到数据框里面。


#=========================================
# GEOCODE
# - here, we're getting the lat/long data
#=========================================

data.geo <- geocode(df.amz_cities$city)

#inspect

data.geo %>% head()
data.geo

#========================================
# RECOMBINE: merge geo data to data frame
#========================================

df.amz_cities <- cbind(df.amz_cities, data.geo)
df.amz_cities

现在我们要用dplyr::rename()函数将数据框列名为lon重命名为long。

#==============================================================
# RENAME VARIABLE: lon -> long
# - we'll rename lon to lon, just because 'long' is consistent
#   with the name for longitude in other data sources
#   that we will use
#==============================================================

df.amz_cities <- rename(df.amz_cities, long = lon)

# get column names names
df.amz_cities %>% names()

为了让数据读起来很简单，我们需要对数据进行重新排序，city, state, and metro，然后是地理坐标信息，最后再是人口，和大学学位比例。

#==========================================
# REORDER COLUMN NAMES
# - here, we're just doing it manually ...
#==========================================

df.amz_cities <- select(df.amz_cities, city, state, metro_area, long, lat, population_tot, bachelors_degree_pct)


# inspect

df.amz_cities %>% head()

我们要在一张美国地图上面进行可视化，这里需要用到map_data()函数。

#================================================
# GET USA MAP
# - this is the map of the USA states, upon which
#   we will plot our city data points
#================================================

map.states <- map_data("state")

#====================================
# PLOT
# - here, we're actually creating the 
#   data visualizations with ggplot()
#====================================

最后，我们开始绘制。但通常先做第一次迭代检查是否正确。
#------------------------------------------------
# FIRST ITERATION
# - this is just a 'first pass' to check that
#   everything looks good before we take the time
#   to format it
#------------------------------------------------
ggplot() +
  geom_polygon(data = map.states, aes(x = long, y = lat, group = group)) +
  geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct))

这里写图片描述

从颜色深度等级上看一切看起来都正常。那些点也在正确的位置，大致看来都很正常。

和之前第一个版本相比，第一个版本太简单了。在数据分析中这是一个关于80/20很好的例子:这数据可视化中，在使用ggplot()仅使用20%的代码，你可以掌握80%的方法。
Keep in mind, that compared to the finalized version below, the ‘first iteration’ is much much simpler to build. This is a great example of the 80/20 rule in data analysis: in this visualization, you can get 80% of the way with only 20% of the total ggplot() code.

现在，我们有一个初始版本，我们会通过添加标题、主题元素的格式，并通过调整图片。

#--------------------------------------------------
# FINALIZED VERSION (FORMATTED)
# - this is the 'finalized' version with all of the
#   detailed formatting
#--------------------------------------------------

ggplot() +
  geom_polygon(data = map.states, aes(x = long, y = lat, group = group)) +
  geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct*.01), alpha = .5) +
  geom_point(data = df.amz_cities, aes(x = long, y = lat, size = population_tot, color = bachelors_degree_pct*.01), shape = 1) +
  coord_map(projection = "albers", lat0 = 30, lat1 = 40, xlim = c(-121,-73), ylim = c(25,51)) +
  scale_color_gradient2(low = "red", mid = "yellow", high = "green", midpoint = .41, labels = scales::percent_format()) +
  scale_size_continuous(range = c(.9, 11),  breaks = c(2000000, 10000000, 20000000),labels = scales::comma_format()) +
  guides(color = guide_legend(reverse = T, override.aes = list(alpha = 1, size = 4) )) +
  labs(color = "Bachelor's Degree\nPercent"
       ,size = "Total Population\n(metro area)"
       ,title = "Possible cities for new Amazon Headquarters"
       ,subtitle = "Based on population & percent of people with college degrees") +
  theme(text = element_text(colour = "#444444", family = "Gill Sans")
        ,panel.background = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
        ,axis.text = element_blank()
        ,plot.title = element_text(size = 28)
        ,plot.subtitle = element_text(size = 12)
        ,legend.key = element_rect(fill = "white")
        )