hacker代码_如何仅用7行R代码构建Hacker News Frontpage抓取工具

最新推荐文章于 2024-04-14 10:02:06 发布

cumi6497

最新推荐文章于 2024-04-14 10:02:06 发布

阅读量254

点赞数

文章标签：可视化 python java js 编程语言 ViewUI

原文链接：https://www.freecodecamp.org/news/how-to-build-a-hacker-news-frontpage-scraper-with-just-7-lines-of-r-code-221af6acb98/

版权

hacker代码

by AMR

通过AMR

如何仅用7行R代码构建Hacker News Frontpage抓取工具 (How to build a Hacker News Frontpage scraper with just 7 lines of R code)

Web scraping used to be a difficult task requiring expertise in XML Tree parsing and HTTP Requests. But with new-age scraping libraries like beautifulsoup (for Python) and rvest (for R), web scraping has become a toy for any beginner to play with.

过去，Web抓取是一项艰巨的任务，需要具备XML树解析和HTTP请求方面的专业知识。但是随着新时代的抓取库，例如beautifulsoup(对于Python)和rvest(对于R)，网络抓取已成为任何初学者的玩具。

This post aims to explain how simple it is to use R, a very nice programming language, to perform Data Analysis and Data Visualization. The task ahead is very simple. Build a web scraper that scrapes the content of one of the most popular pages on the Internet (at least among Coders): Hacker News Front Page.

这篇文章旨在说明使用R(一种非常好的编程语言)执行数据分析和数据可视化有多么简单。前面的任务非常简单。构建一个Web抓取器，以抓取Internet上最流行的页面之一(至少在Coders中)的内容： Hacker News Front Page 。

软件包安装和加载 (Package Installation and Loading)

The R package that we are going to use is rvest. rvest can be installed from CRAN and loaded into R like below:

我们将使用的R包是rvest. rvest可以从CRAN安装并加载到R中，如下所示：

library(rvest)

read_html() function of rvest can be used to extract the HTML content of the url given as the argument for read_html function.

read_html()的函数rvest可用于提取给定的作为用于read_html函数的参数的URLHTML内容。

content <- read_html('https://news.ycombinator.com/')

For read_html() to work without any concern, please make sure you are not behind any organization firewall. If so, configure your RStudio with a proxy to bypass the firewall, otherwise you might face a connection timed out error.

为了使read_html()可以正常工作，请确保您没有任何组织防火墙的支持。如果是这样，请为RStudio配置代理以绕过防火墙，否则可能会遇到connection timed out error 。

Below is the screenshot of HN front page layout (with key elements highlighted):

以下是HN首页布局的屏幕截图(突出显示了关键元素)：

Now, with the HTML content of the Hacker News front page loaded into the R object content, let us extract the data that we need — starting with the Title.

现在，将Hacker News主页HTML内容加载到R对象的内容中 ，让我们提取所需的数据-从标题开始。

There is one particularly important aspect of making any web scraping assignment successful. That is to identify the right CSS selector, or XPath values, of the HTML elements whose values are supposed to be scraped. The easiest way to get the right element value is to use the inspect tool in Developer Tools of any browser.

使任何Web抓取作业成功都有一个特别重要的方面。那就是识别应该刮掉其值HTML元素的正确CSS选择器或XPath值。获得正确元素值的最简单方法是使用任何浏览器的“开发人员工具”中the inspect tool 。

Here’s the screenshot of the CSS selector value. It is highlighted using the Chrome Inspect Tool when hovered over Title of the links present in Hacker News Frontpage.

这是CSS选择器值的屏幕截图。将鼠标悬停在Hacker News Frontpage中链接的标题上方时，使用Chrome Inspect工具将其突出显示。

title <- content %>% html_nodes('a.storylink') %>% html_text()title [1] "Magic Leap One"                                                                   [2] "Show HN: Terminal – native micro-GUIs for shell scripts and command line apps"    [3] "Tokio internals: Understanding Rust's async I/O framework"                        [4] "Funding Yourself as a Free Software Developer"                                    [5] "US Federal Ban on Making Lethal Viruses Is Lifted"                                [6] "Pass-Thru Income Deduction"                                                       [7] "Orson Welles' first attempt at movie-making"                                      [8] "D’s Newfangled Name Mangling"                                                     [9] "Apple Plans Combined iPhone, iPad, and Mac Apps to Create One User Experience"    [10] "LiteDB – A .NET NoSQL Document Store in a Single Data File"                      [11] "Taking a break from Adblock Plus development"                                    [12] "SpaceX’s Falcon Heavy rocket sets up at Cape Canaveral ahead of launch"          [13] "This is not a new year’s resolution"                                             [14] "Artists and writers whose works enter the public domain in 2018"                 [15] "Open Beta of Texpad 1.8, macOS LaTeX editor with integrated real-time typesetting"[16] "The triumph and near-tragedy of the first Moon landing"                          [17] "Retrotechnology – PC desktop screenshots from 1983-2005"                         [18] "Google Maps' Moat"                                                               [19] "Regex Parser in C Using Continuation Passing"                                    [20] "AT&T giving $1000 bonus to all its employees because of tax reform"              [21] "How a PR Agency Stole Our Kickstarter Money"                                     [22] "Google Hangouts now on Firefox without plugins via WebRTC"                       [23] "Ubuntu 17.10 corrupting BIOS of many Lenovo laptop models"                       [24] "I Know What You Download on BitTorrent"                                          [25] "Carrie Fisher’s Private Philosophy Coach"                                        [26] "Show HN: Library of API collections for Postman"                                 [27] "Uber is officially a cab firm, says European court"                              [28] "The end of the Iceweasel Age (2016)"                                             [29] "Google will turn on native ad-blocking in Chrome on February 15"                 [30] "Bitcoin Cash deals frozen as insider trading is probed"

The rvest package supports pipe %>% operator. Thus, the R object containing the content of the HTML page (read with read_html) can be piped with html_nodes() that takes a CSS selector or XPath as its argument. It can then extract the respective XML tree (or HTML node value) whose text value could be extracted with html_text() function.

rvest软件包支持管道％>％运算符。因此，可以使用以CSS选择器或XPath作为参数的th html_node s()来th html_node包含HTML页内容(用read_html读取)的R对象。然后，它可以提取相应的XML树(或HTML节点值)，其文本值可以通过th html_tex t()函数提取。

The beauty of rvest is that it abstracts the entire XML parsing operation under the hood of functions like html_nodes() and html_text(). Thus making it easier for us to achieve our scraping goal with minimal code.

rvest的优点在于，它在html_nodes()和html_text()之类的函数的作用下抽象了整个XML解析操作。因此，我们可以用最少的代码更轻松地实现我们的抓取目标。

Like with Title, the CSS selector value of other required elements of the web page can be identified with the Chrome Inspect tool. They can also be passed as an argument to html_nodes() function and respective values can be extracted and stored in R objects.

与“标题”一样，可以使用Chrome Inspect工具识别网页其他必需元素CSS选择器值。也可以将它们作为参数传递给html_nodes()函数，并且可以提取相应的值并将其存储在R对象中。

link_domain <- content %>% html_nodes('span.sitestr') %>% html_text()score <- content %>% html_nodes('span.score') %>% html_text()age <- content %>% html_nodes('span.age') %>% html_text()

All the essential pieces of information were extracted from the page. Now an R data frame can be made with the extracted elements to put the extracted data into a structured format.

从页面中提取了所有重要的信息。现在，可以使用提取的元素制作R数据帧，以将提取的数据放入结构化格式中。

df <- data.frame(title = title, link_domain = link_domain, score = score, age = age)

Below is the screenshot of the final dataframe in RStudio viewer:

下面是RStudio查看器中最终数据帧的屏幕截图：

Thus, in just 7 lines of code, we have successfully built a Hacker News Frontpage Scraper in R.

因此，仅用7行代码，我们就成功地在R中构建了一个Hacker News Frontpage Scraper。

R is a wonderful language to perform Data Analysis and Data Visualization. The code used here is available on my github.

R是执行数据分析和数据可视化的出色语言。此处使用的代码在我的github上可用。

翻译自: https://www.freecodecamp.org/news/how-to-build-a-hacker-news-frontpage-scraper-with-just-7-lines-of-r-code-221af6acb98/

hacker代码

cumi6497

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hacker代码_如何仅用7行R代码构建Hacker News Frontpage抓取工具

hacker代码by AMR 通过AMR 如何仅用7行R代码构建Hacker News Frontpage抓取工具 (How to build a Hacker News Frontpage scraper with just 7 lines of R code)Web scraping used to be a difficult task requiring expertise in X...
复制链接

扫一扫