豆瓣电影R语言爬虫和数据分析._r语言爬取数据以及数据分析-CSDN博客

本文链接：https://blog.csdn.net/weixin_51463905/article/details/118340497

本文介绍如何利用R语言进行数据爬取、处理和分析，涉及rvest包爬虫、stringr包字符串处理、dplyr包数据聚合、ggplot包数据可视化以及worldcloud2包绘制词云图，同时讲解了正则表达式和sapply函数的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

主要内容：
1、r语言爬虫 rvest包的使用。
2、r语言字符串处理stringr包的使用。
3、r语言聚合dplyr 包的使用。
4、r语言可视化ggplot 包的使用。
5、r语言画词云图worldcloud2 包的使用。
6、正则表达式 str_match 的使用
7、sapply的用法。
8、字符串切割函数str_split的用法。

代码片段1（字符串切割和字符串正则匹配）：

    > (a <- "2017-12-25")
    [1] "2017-12-25"
    > (b <- str_split(a,"-"))
    [[1]]
    [1] "2017" "12"   "25"  
    
    > (c <- str_match(a,"-(.*?)-")[,2])
    [1] "12"
    > 
[/code]

**代码片段2（sapply函数 运用，功能强大，类似scala map函数，可自定义函数作用于每个元素）**

```code
    (d <- c(1,2,3,4,5,6,7,8,9))
    #每个元素乘以2
    (e <- sapply(d,function(x) x*2))

**代码片段3（rvest爬虫管道% >%解析法）： **

    # 读取网页内容
    page <- html_session(url)
    # 获取电影的链接
    movie_url <- html_nodes(page, 'p>a') %>% html_attr("href")
    
    # 获取电影名称
    movie_name <- html_nodes(page, 'p>a') %>% html_text()
[/code]

**代码片段4（dplyr包 group_by 和summarise 的用法，分组求和）**

```code
    # 聚合操作
    groupby_countrys <- group_by(df, countries)
    df <- summarise(groupby_countrys, Freq = sum(Freq))
[/code]

**代码片段5（arrange 排序功能）**

```code
    # 降序排序
    df <- arrange(df, desc(Freq))
[/code]

**代码片段6（ggplot 画条形图）**

```code
    # 1、参评人数最多的Top10的电影
    # 配置画图的数据
    p <- ggplot(data = arrange(raw_data, desc(evalue_users))[1:10,], 
                mapping = aes(x = reorder(movie_name,-evalue_users), 
                              y = evalue_users)) + 
      # 限制y周的显示范围
      coord_cartesian(ylim = c(500000, 750000)) + 
      # 格式化y轴标签的数值
      scale_y_continuous(breaks = seq(500000, 750000, 100000),
                         labels = paste0(round(seq(500000, 750000, 100000)/10000, 2), 'W')) + 
      # 绘制条形图
      geom_bar(stat = 'identity', fill = 'steelblue') +
      # 添加轴标签和标题
      labs(x = NULL, y = '评价人数', title = '评价人数最多的top10电影') + 
      # 设置x轴标签以60度倾斜
      theme(axis.text.x = element_text(angle = 60, vjust = 0.5),
            plot.title = element_text(hjust &