r语言抓取网页数据_R中的网页抓取

最新推荐文章于 2024-09-11 10:37:58 发布

吴雄辉

最新推荐文章于 2024-09-11 10:37:58 发布

阅读量2.7k

点赞数

文章标签： python js ViewUI

原文链接：https://medium.com/swlh/web-scraping-in-r-978dbcb79b

版权

r语言抓取网页数据

Objective: To fetch the latest product prices that have been hosted on the competitor’s website programmatically.

目标：以编程方式获取竞争对手网站上托管的最新产品价格。

For the purpose of demonstration, let’s look into the websites of WeWork and Regus; two leading players in the co-working industry who competes among each other to serve hot desks, dedicated desks, and private offices across the globe. Let’s try to scrap their websites in California to retrieve the latest product price listings programmatically.

为了演示，让我们看一下WeWork和Regus的网站。协同工作行业中的两家领先企业相互竞争，为全球的热办公桌，专用办公桌和私人办公室提供服务。让我们尝试在加利福尼亚州废弃其网站，以编程方式检索最新的产品价格清单。

There were four milestones to accomplish the objective:

实现目标有四个里程碑：

Web scraped Regus sites using httr/rvest packages.
Web使用httr / rvest软件包抓取了雷格斯网站。
Cleaned the dataset and incorporated geospatial coordinates.
清理数据集并合并地理空间坐标。
Repeated steps 1 & 2 for WeWork websites.
对WeWork网站重复步骤1和2。
Embedded R script in Power BI and visualized the final output.
Power BI中的嵌入式R脚本并可视化了最终输出。

Phase 1: Web scraped Regus sites using httr/rvest packages

阶段1：使用httr / rvest软件包在Web上刮除雷格斯网站

Step 1.1. Imported Libraries: Imported all the relevant libraries upfront.
步骤1.1。导入的库：预先导入了所有相关的库。

library(tidyverse)
library(rvest)
library(revgeo)
library(“opencage”)
library(dplyr)
library(sqldf)
library(formattable)
library(stringr)
library(ngram)
library(httr)
library(rlist)
library(jsonlite)
library(lubridate)
library(splitstackshape)

Step 1.2. Regus Location API: Extracted the co-working locations in California from Regus location API using httr package.
步骤1.2。雷格斯位置API：使用httr软件包从雷格斯位置API中提取加利福尼亚的共同工作地点。

options(warn=-1)
time <- Sys.time()
location_data_final <- data.frame()url <- “https://www.regus.com/api/search/centres"resp <- GET(url)
http_type(resp)
jsonRespText<-content(resp,as=”text”) 
jsonRespParsed<-content(resp,as=”parsed”)regus_locations <-fromJSON(jsonRespText) 
regus_locations <- regus_locations %>% select (CenterNumber
 ,CenterName 
 ,Latitude,Longitude
 ,FormattedAddress 
 ,CenterCity)

Phase 2: Cleaned the dataset and incorporated geospatial coordinates

阶段2：清理数据集并合并地理空间坐标

Step 2.1. Reverse Geocoding: Used revgeo package to identify the street, city, state, and country of the lat, lng coordinates.
步骤2.1。反向地理编码：使用revgeo软件包来标识经纬度坐标的街道，城市，州和国家/地区。

reverse_geocode<- revgeo(longitude=regus_locations$Longitude
                         ,latitude=regus_locations$Latitude
                         ,output='frame')

reverse_geocode <- cbind(regus_locations, reverse_geocode)

regus_locations_full  <- regus_locations %>% inner_join (
                         reverse_geocode, by = c("CenterNumber"))

regus_california <- regus_locations_full %>%  filter(
                      country == "United States of America"
                     ,State == "California")

Step 2.2. Regus Product Pricing API: Extracted the Regus co-working product pricing per location in California. Finally wrapped the whole process of data extraction with a ‘for loop’.
步骤2.2。雷格斯产品定价API：提取了加利福尼亚州每个地点的雷格斯协同工作产品定价。最后，用“ for循环”包装了整个数据提取过程。

dedicated_desk <- data.frame()
for (i in 1: as.integer(regus_california %>% summarise(n()))){

#Status Code 500 Path

url <- paste("https://www.regus.com/ecommercevo/price?
productType=platinum&daysLength=30&monthLength=1
&startDate=", Sys.Date(),"&countryCode=US
&centreNumber=",regus_california$CenterNumber[i], sep ="")
  
resp <- GET(url)
  
if(resp$status_code == 500){
url <- paste("https://www.regus.com/ecommercevo/price?
productType=platinum&monthLength=1&startDate=", Sys.Date(),"
&countryCode=US&centreNumber=", "&_=",as.integer(
as.numeric(Sys.time())),924,sep ="")

resp <- GET(url)
jsonRespText<-content(resp,as="text") 
vo_list <-fromJSON(jsonRespText)
    
if(!is.null(vo_list$Price)){
dedicated_desk_temp <-  as_data_frame(
vo_list$Price)%>%rename(dedicated_desk_price=value)
dedicated_desk_temp$center_number <- 
as.integer (regus_california$CenterNumber[i])}

if(is.null(vo_list$Price)){
dedicated_desk_temp <-  as_data_frame( "$0" ) %>% 
rename( dedicated_desk_price =value)
dedicated_desk_temp$center_number <- as.integer(
regus_california$CenterNumber[i])}

dedicated_desk <- rbind(dedicated_desk, dedicated_desk_temp)}
  
else if (resp$status_code == 200){
jsonRespText<-content(resp,as="text") 
vo_list <-fromJSON(jsonRespText)
 
if(!is.null(vo_list$Price)){
dedicated_desk_temp <-  as_data_frame( 
vo_list$Price) %>% rename(dedicated_desk_price=value)
dedicated_desk_temp$center_number <- as.integer ( 
regus_california$CenterNumber[i])}

if(is.null(vo_list$Price)){
dedicated_desk_temp <-  as_data_frame( "$0" ) %>% 
rename(dedicated_desk_price =value)
dedicated_desk_temp$center_number <- 
as.integer (regus_california$CenterNumber[i])
    }
    dedicated_desk <- rbind(dedicated_desk, dedicated_desk_temp)
  }
}

regus_california <- regus_california %>%  left_join(
dedicated_desk, by = c("CenterNumber" = "center_number"))

Phase 3: Web scraped & Cleaned WeWork datasets

阶段3：Web抓取并清理的WeWork数据集

Step 3.1. WeWork Product Pricing API: Extracted the WeWork co-working product pricing per locations in California. Finally wrapped the whole process of data extraction with a ‘for loop’.
步骤3.1。 WeWork产品定价API：提取了加利福尼亚州每个地点的WeWork合作产品定价。最后，用“ for循环”包装了整个数据提取过程。

url <- c("https://www.wework.com/l/sf-bay-area--CA")

location_data_final <- data.frame()

for(i in url){ 
webpage <- read_html(i)
location_data <- 
html_nodes(webpage, ".mb0, .ray-card__content") %>%
html_text() %>%
enframe()
location_data$urlid <- i
location_data_final <- rbind(location_data_final
                       ,location_data)}

wework_1.a <- location_data_final %>%  select (value) 
%>%  mutate (row_num = row_number(),filter_check = row_num%%2==0) 
%>%  filter(filter_check == "TRUE")%>%  select (
-filter_check)%>%  select(-row_num) %>%  rename(
                        site_name_short = value)


wework_1.b <- location_data_final %>%  select (value) %>%  mutate (
row_num = row_number(),filter_check = row_num%%2!=0 ) %>%  filter(
filter_check == "TRUE")%>%  select (-filter_check)%>%  select(
-row_num) 

wework_1 <- cbind(wework_1.a, wework_1.b)

Step 3.2. Cleaned WeWork dataset: Cleaned the raw dataset by extracting the required information out of the WeWork dataset
步骤3.2。清理过的WeWork数据集：通过从WeWork数据集中提取所需信息来清理原始数据集

wework_1 <- wework_1 %>% rename (full_info = value) %>% mutate(
                             full_info = str_squish (full_info))

wework_1 <- sqldf("select * from wework_1 
            where full_info not like '%Pricing for this location 
            is not yet available%'")

wework_1 <- sqldf("select * from wework_1 
            where full_info not like '%Move in ahead of the curve 
            with special pre-opening rates%'")

wework_1.1 <- str_split_fixed(wework_1$full_info, 
              "Starting prices", 2) %>% as.data.frame()

wework_1.2 <- as.data.frame(wework_1.1$V2) 
              %>% rename (value = `wework_1.1$V2`)

wework_1.2 <- separate(wework_1.2, value, 
              c("Private Office", "Price"), 
              sep = "Private Office")

wework_1.2 <- separate(wework_1.2, Price, 
              c("private_office_price", "Price"), 
              sep = "Dedicated Desk") 

wework_1.2 <- separate(wework_1.2, Price, 
              c("dedicated_desk_price", "hot_desk_price"), 
              sep = "Hot Desk")

wework_interim <- cbind(wework_1, wework_1.1, wework_1.2) 
                  %>%  select(-full_info,-V2) 
                  %>%  rename(site_name = V1) 


wordcount_final <- data.frame()

for(i in 1: nrow(wework_interim)){wordcount_temp <-  
enframe ( wordcount
(wework_interim$site_name_short[i]) ) %>% select (value)

wordcount_final  <- rbind(wordcount_final, wordcount_temp)}


wework_pricing <- cbind(wework_interim, wordcount_final) %>%  rename(
                  word_count= value) %>%  select (-`Private Office`) 
                  %>%  mutate(building_name = 
                  word(site_name, word_count+1, -1)) 
                  %>%  select(-word_count) 
                  %>% mutate(site_name = building_name
                  ,date_time_Stamp = format(Sys.time()
                  ,  "%d-%m-20%y"),country = "United States"
                  ,company =  "wework", currency_name = "US Dollar"
                  , currency_short = "USD" 
                  , web_url = "https://www.wework.com/l/united-states")

Step 3.3. Reverse Geocoding: Used revgeo package to identify the street, city, state, and country of the lat, lng co-ordinates
步骤3.3。反向地理编码：使用revgeo软件包来标识经纬度坐标的街道，城市，州和国家/地区

output_final_lat  <- data.frame()
output_final_lng <- data.frame()
output_final_state  <- data.frame()


for (i in 1:length(wework_pricing$site_name)) {
output_temp <- opencage_forward(placename = wework_pricing$site_name[i]
, key = "d3f30e7282414d52ba36461e84613c34" )

output_final_lat  <- bind_rows (output_final_lat, 
enframe( output_temp$results$geometry.lat[[1]] ))

output_final_lng  <- bind_rows (output_final_lng, 
enframe( output_temp$results$geometry.lng[[1]] ))

output_final_state  <- bind_rows (output_final_state, 
enframe( output_temp$results$components.state[[1]]))}


wework_pricing$lat <- output_final_lat$value
wework_pricing$lng <- output_final_lng$value
wework_pricing$state <- output_final_state$value 


reverse_geocode<- revgeo(longitude=wework_pricing$lng, 
latitude=wework_pricing$lat, output='frame')

wework_pricing <- wework_pricing %>% mutate(
  street_name = word(site_name, 2, 3),
  city        = reverse_geocode$city,
  postcode    = reverse_geocode$zip )

Phase 4: Embedded R script in Power BI and visualized the final output

阶段4：Power BI中的嵌入式R脚本并可视化最终输出

Image for post — Geospatial Competitor Analysis 1 (Image by Author)

Conclusion

结论

So, in a nutshell, we have successfully extracted the price listings per product per location of two competitors using the web scraping techniques in R programming. Finally, we visualized the output on a geospatial scale to build the data story.

因此，简而言之，我们已经使用R编程中的Web抓取技术成功提取了两个竞争对手每个位置的每种产品的价格清单。最后，我们以地理空间规模可视化输出以构建数据故事。

What’s Next?

下一步是什么？

This is just the starting point of a journey to endless possibilities in competitor intelligence. We could do so much more with the extracted data like the following:

这只是迈向竞争者智慧无限可能性的起点。我们可以对提取的数据做更多的事情，如下所示：

Expand the web harvesting technique to build and analyze the trend globally.
扩展网络收集技术以构建和分析全球趋势。
Automate the whole process of data extraction to build a real-time product pricing model.
使数据提取的整个过程自动化，以建立实时的产品定价模型。
Feed the daily product price listings into cloud storage (S3/Google Cloud) to monitor the competitor’s daily price fluctuations in a BI platform.
将每日产品价格清单输入到云存储(S3 / Google Cloud)中，以监视竞争对手在BI平台中的每日价格波动。
Aggregate the price listings per product per location over a period of time to reap insights from the week on week/ month on month/ year on year analysis.
汇总一段时间内每个位置每个产品的价格清单，以从每周/每周/每月/每月/逐年分析中获得见解。
Auto-compute the avg. product price per city to see if we are underpriced or overpriced in comparison to the competitor’s price listings at any point in time.
自动计算平均值。每个城市的产品价格，以查看我们在任何时间点与竞争对手的价格清单相比是被低估还是被高估。

GitHub Repository

GitHub储存库

I have learned (and continue to learn) from many folks in Github. Hence sharing my entire R scripts and Power BI file in a public GitHub Repository in case if it benefits any seekers online. Also, feel free to reach out to me if you need any help in understanding the fundamentals of web scraping in R. Happy to share what I know:) Hope this helps!

我已经从Github的许多人那里学习了(并继续学习)。因此，在公共GitHub存储库中共享我的整个R脚本和Power BI文件，以防万一它使在线求职者受益。另外，如果您需要任何帮助来了解R中Web抓取的基础知识，请随时与我联系。乐于分享我所知道的内容：)希望这会有所帮助！

About the Author

关于作者