R语言—豆瓣搜索电影

最新推荐文章于 2024-04-04 09:58:41 发布

Tracy数据

最新推荐文章于 2024-04-04 09:58:41 发布

阅读量1.9k

点赞数

分类专栏： R 文章标签： R r语言豆瓣 xml R爬虫

R 专栏收录该内容

29 篇文章 3 订阅

订阅专栏

豆瓣搜索想要的电影名并返回电影评分，走起

library(RCurl)
library(XML)
movieScore <- function(x) {
    stopifnot(is.character(x))
    # 提交搜索豆瓣表单
    search <- getForm("http://movie.douban.com/subject_search", search_text = x)
    searchweb <- htmlParse(search)
    # 解析搜索结果页面
    resnodes <- getNodeSet(searchweb, "//div[@id='wrapper']//table[1]//a")
    if (is.null(resnodes)) 
        return(NULL) else resurl <- xmlGetAttr(resnodes[[1]], name = "href")
    # 得到影片页面后第二次解析
    resweb <- getURL(resurl, .encoding = "UTF-8")
    content <- htmlParse(resweb, encoding = "UTF-8")
    resnodes <- getNodeSet(content, "//div[@id='interest_sectl']//p[@class='rating_self clearfix']//strong")
    namenodes <- getNodeSet(content, "//div[@id='content']//h1//span")
    # 得到影片评分
    score <- xmlValue(resnodes[[1]])
    name <- xmlValue(namenodes[[1]])
    return(list(name = name, score = score))
}

看看天机这部大烂片多少分。

movieScore("天机")

## $name
## [1] "天机·富春山居图"
## 
## $score
## [1] "2.9"

抓网页比较慢，豆瓣为人民群众着想提供了API，我们也可以使用API来调取分数，函数也比较简单。

library(RCurl)
library(XML)
library(RJSONIO)
movieScoreapi <- function(x) {
    api <- "https://api.douban.com/v2/movie/search?q={"
    url <- paste(api, x, "}", sep = "")
    res <- getURL(url)
    reslist <- fromJSON(res)
    name <- reslist$subjects[[1]]$title
    score <- reslist$subjects[[1]]$rating$average
    return(list(name = name, score = score))
}
movieScoreapi("僵尸世界大战")

## $name
## [1] "僵尸世界大战"
## 
## $score
## [1] 7.5

有了这个查分函数，我们可以在R中批量查阅电影评分了。但是豆瓣对于频繁的访问会有限制，对于没有认证的API使用是每分钟10次，超过就会暂时封IP。对于网页抓取，肖楠在第六次R会议上有个很棒的演讲，有兴趣的同学可以去统计之都看看。

Tracy数据

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录