构建您的第一个Web爬网程序,第1部分

Rubyland在过去几年中占据了网络抓取焦点的两大宝石:Nokogiri和Mechanize。 在将它们付诸实践之前,我们会针对每个问题花费一篇文章,并通过一个实际的例子进行介绍。

主题

  • 网页搜寻?
  • 允许
  • 问题
  • 能吉里
  • 萃取?
  • 页数
  • API
  • 节点导航

网页搜寻?

除了网页或屏幕抓取之外,还有很多更奇特的术语。 Web收集和Web数据提取几乎可以立即告诉您发生了什么。 我们可以自动从网页中提取数据,也没有那么复杂。

从某种意义上说,这些工具使您可以模仿和自动化人工Web浏览。 您编写的程序仅提取您感兴趣的数据类型。 定位特定数据几乎与使用CSS选择器一样容易。

几年前,我订阅了一些在线视频课程,该课程有100万个短视频,但没有选择批量下载它们。 我必须自己检查每个链接,然后自己做可怕的“另存为”。 这是人类的网络抓取,这是我们在缺乏使此类内容自动化的知识时经常需要做的事情。 课程本身还不错,但是之后我不再使用他们的服务。 太乏味了。

今天,我不会太在意这种令人着迷的UX。 可以为我完成下载工作的刮板程序仅花费我几分钟即可完成。 没关系!

在开始之前,让我先快速分解一下。 整个过程可以简化为几个步骤。 首先,我们获取一个包含所需数据的网页。 然后,我们在该页面中进行搜索,并确定要提取的信息。

最后一步是将这些位作为目标,必要时将它们切成薄片,并决定如何以及在何处存储它们。 编写良好HTML通常是使此过程变得轻松愉快的关键。 对于更复杂的提取,如果必须处理结构不良的标记,可能会很痛苦。

API呢? 很好的问题。 如果您可以使用API​​访问服务,则通常无需编写自己的刮板。 这种方法主要用于不提供此类便利的网站。 如果没有API,这通常是自动从网站提取信息的唯一方法。

您可能会问,这个刮东西实际上是如何工作的? 如果不深入研究,简短的答案是遍历树数据结构。 Nokogiri从您提供的文档中构建这些数据结构,并让您将感兴趣的位作为目标进行提取。 例如,CSS是一种用于遍历树,搜索树数据结构的语言,我们可以利用它进行数据提取。

有很多方法和解决方案可以使用。 Rubyland有两块宝石已经成为人们关注的焦点。 许多人仍然依靠Nokogiri和Mechanize来满足HTML抓取的需求。 两者都经过测试,证明自己易于使用,同时功能强大。 我们将研究它们两者。 但是在此之前,我想花一点时间来解决这个简短的介绍性系列文章的结尾处要解决的问题。

允许

开始抓取之前,请确保您具有尝试访问的站点的权限以进行数据提取。 例如,如果站点具有API或RSS feed,则不仅可以轻松获得所需的内容,而且还可以作为合法的选择。

如果您在他们的网站上进行大量的抓取工作,并不是每个人都会喜欢它的-这是可以理解的。 让您在您感兴趣的特定站点上受教育,并且不要惹麻烦。 您可能会造成严重损害的可能性很小,但在不知不觉中冒险的危险并不是要走的路。

问题

我需要建立一个新的播客。 设计不是我想要的那样,我讨厌发布新帖子的方式。 该死的所见即所得! 一些上下文。 大约两年前,我制作了播客的第一个版本。 当时的想法是与Sinatra一起玩,并打造出超轻量级的产品。 自从我量身定制了几乎所有东西之后,我遇到了一些意外的问题。

来自Rails,这绝对是我的一次教育之旅,但我很快就后悔没有使用我可以通过GitHub页面通过GitHub部署的静态站点。 部署和维护新剧集缺乏我一直想要的简单性。 有一段时间,我决定自己要炸更大的鱼,而专注于生产新的播客材料。

去年夏天,我开始变得认真起来,并在通过GitHub页面托管的Middleman网站上工作。 在演出的第二季,我想要新鲜的东西。 全新的简化设计,Markdown用于发布新剧集,与Heroku互不打架-天! 事实是,我周围有139集,需要导入并进行转换才能与Middleman合作。

对于帖子,Middleman使用.markdown文件,这些文件被称为数据的前题-基本上替代了我的数据库。 139集不可以手动执行此传输。 那就是计算的目的。 我需要找到一种方法来解析我的旧网站HTML,抓取相关内容,并将其传输到用于在Middleman上发布新播客片段的博客文章中。

因此,在接下来的三篇文章中,我将向您介绍Rubyland中用于此类任务的常用工具。 最后,我们将介绍我的解决方案,并向您展示一些实用的内容。

能吉里

即使您是Ruby / Rails的新手,也很可能听说过这个小宝石。 该名称经常被删除,并容易与您保持联系。 我不确定很多人是否知道nokogiri是日语中的“锯”。

一旦了解该工具的功能,它就是一个合适的名称。 这个宝石的创造者是可爱的Tenderlove Aaron PattersonNokogiri可以将XML和HTML文档转换为数据结构,更确切地说是树形数据结构。 该工具速度很快,并且提供了一个不错的界面。 总体而言,这是一个非常强大的库,可以满足您的大量HTML抓取需求。

您不仅可以使用Nokogiri解析HTML,还可以使用Nokogiri。 XML也是公平的游戏。 它为您提供XML路径语言和CSS接口的选项,以遍历您加载的文档。 XML路径语言(简称XPath)是一种查询语言。

它允许我们从XML文档中选择节点。 CSS选择器很可能是初学者所熟悉的。 与您编写的样式一样,CSS选择器使您可以轻松地将页面的特定部分作为目标区域,以进行提取。 您只需要将特定的目的地作为目标,就可以让Nokogiri知道您要做什么。

页数

我们首先需要获取感兴趣的实际页面。我们指定了要解析的Nokogiri文档类型,例如XML或HTML:

Nokogiri::XML

Nokogiri::HTML
some_scraper.rb
require "nokogiri"

require "open-uri"

page = Nokogiri::XML(File.open("some.xml"))

page = Nokogiri::HTML(File.open("some.html"))

Nokogiri:XMLNokogiri:HTML可以使用IO对象或String对象。 上面发生的事情很简单。 这将打开并使用open-uri获取指定的页面,然后将其结构,其XML或HTML加载到新的Nokogiri文档中。 XML并不是初学者经常要处理的东西。

因此,我建议我们暂时专注于HTML解析。 为什么要open-uri ? Ruby Standard Library中的这个模块使我们可以轻松获取站点。 由于IO对象是公平的游戏,因此我们可以轻松使用open-uri

API

让我们通过一个迷你示例将其付诸实践:

at_css

some_podcast_scraper.rb
require 'nokogiri'

require "open-uri"

url = 'http://betweenscreens.fm/'

page = Nokogiri::HTML(open(url))

header = page.at_css("h2.post-title")

title = header.text

puts "This is the raw header of the latest episode: #{header}"

puts "This is the title of the latest episode: #{title}"

我们在这里所做的工作代表了Web抓取通常涉及的所有步骤-仅在微观层面上。 我们确定所需的URL和所需的站点,然后将它们加载到新的Nokogiri文档中。 然后,我们打开该页面并定位到特定部分。

在这里,我只想知道最新一集的标题。 我需要使用at_css方法和h2.post-titleCSS选择器来h2.post-title提取点。 但是,使用这种方法,我们只会刮取这个奇异元素。 这为我们提供了整个选择器-在大多数情况下,这并不是我们所需要的。 因此,我们通过text方法仅提取此节点的内部文本部分。 为了进行比较,您可以检查标题和下面的文本的输出。

输出量
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2>

This is the title of the latest episode: David Heinemeier Hansson

尽管此示例的应用程序非常有限,但它具有所有要素以及您需要了解的所有步骤。 我觉得这很简单很酷。 因为从这个例子中可能看不出来,所以我想指出这个工具有多强大。 让我们看看用Nokogiri脚本还能做什么。

注意!

如果您是初学者,并且不确定如何定位所需HTML,建议您进行在线搜索以了解如何检查浏览器中网站的内容。 基本上,所有主流浏览器现在都使此过程变得非常容易。

在Chrome上,您只需要右键单击网站上的元素,然后选择检查选项。 这将在浏览器底部打开一个小窗口,向您显示类似站点DOM的X射线。 它有更多选择,我建议您花一些时间在Google上进行自我教育。 这是明智的时间!

CSS

css方法将不仅为我们提供选择的单个元素,而且为我们提供与页面上搜索条件匹配的任何元素。 非常简洁明了!

some_scraper.rb
require 'nokogiri'

require "open-uri"

url = 'http://betweenscreens.fm/'

page = Nokogiri::HTML(open(url))

headers = page.css("h2.post-title")

headers.each do |header|
  puts "This is the raw title of the latest episode: #{header}"
end

headers.each do |header|
  puts "This is the title of the latest episode: #{header.text}"
end
输出量
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2>
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/141/">Zach Holman</a></h2>
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/140/">Joel Glovier</a></h2>
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/139/">João Ferreira</a></h2>
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/138/">Corwin Harrell</a></h2>
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/137/">Roberto Machado</a></h2>
This is the raw title of the latest episode: <h2 class="post-title"><a href="episodes/136/">James Edward Gray II</a></h2>

This is the title of the latest episode: David Heinemeier Hansson
This is the title of the latest episode: Zach Holman
This is the title of the latest episode: Joel Glovier
This is the title of the latest episode: João Ferreira
This is the title of the latest episode: Corwin Harrell
This is the title of the latest episode: Roberto Machado
This is the title of the latest episode: James Edward Gray II

此示例中唯一的不同之处是,我首先对原始标头进行了迭代。 我还使用text方法提取了其内部文本。 Nokogiri会自动在页面末尾停止,并且不会尝试自动在任何地方进行分页。

假设我们想获得更多信息,例如每个剧集的日期和字幕。 我们可以简单地扩展上面的示例。 无论如何,逐步进行此操作是一个好主意。 花点时间工作,并增加更多复杂性。

some_scraper.rb
require 'nokogiri'

require "open-uri"

url = 'http://betweenscreens.fm/'

page = Nokogiri::HTML(open(url))

articles = page.css("article.index-article")

articles.each do |article|
  header     = article.at_css("h2.post-title")
  date       = article.at_css(".post-date")
  subtitle   = article.at_css(".topic-list")

  puts "This is the raw header:    #{header}"
  puts "This is the raw date:      #{date}"
  puts "This is the raw subtitle:  #{subtitle}\n\n"
 
  puts "This is the text header:   #{header.text}"
  puts "This is the text date:     #{date.text}"
  puts "This is the text subtitle: #{subtitle.text}\n\n"
end
输出量
This is the raw header: <h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2>
This is the raw date: <span class="post-date">Oct 18 | 2016</span>
This is the raw subtitle: <h3 class="topic-list">Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk</h3>

This is the text header: David Heinemeier Hansson
This is the text date: Oct 18 | 2016
This is the text subtitle: Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk

This is the raw header: <h2 class="post-title"><a href="episodes/141/">Zach Holman</a></h2>
This is the raw date: <span class="post-date">Oct 12 | 2016</span>
This is the raw subtitle: <h3 class="topic-list">Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment &amp; Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication</h3>

This is the text header: Zach Holman
This is the text date: Oct 12 | 2016
This is the text subtitle: Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment & Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication

This is the raw header: <h2 class="post-title"><a href="episodes/140/">Joel Glovier</a></h2>
This is the raw date: <span class="post-date">Oct 10 | 2016</span>
This is the raw subtitle: <h3 class="topic-list">Digital Product Design | Product Design @ GitHub | Loving Design | Order &amp; Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers &amp; Open Source</h3>

This is the text header: Joel Glovier
This is the text date: Oct 10 | 2016
This is the text subtitle: Digital Product Design | Product Design @ GitHub | Loving Design | Order & Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers & Open Source

This is the raw header: <h2 class="post-title"><a href="episodes/139/">João Ferreira</a></h2>
This is the raw date: <span class="post-date">Aug 26 | 2015</span>
This is the raw subtitle: <h3 class="topic-list">Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values</h3>

This is the text header: João Ferreira
This is the text date: Aug 26 | 2015
This is the text subtitle: Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values

This is the raw header: <h2 class="post-title"><a href="episodes/138/">Corwin Harrell</a></h2>
This is the raw date: <span class="post-date">Aug 06 | 2015</span>
This is the raw subtitle: <h3 class="topic-list">Q&amp;A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration &amp; pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays |</h3>

This is the text header: Corwin Harrell
This is the text date: Aug 06 | 2015
This is the text subtitle: Q&A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration & pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays |

This is the raw header: <h2 class="post-title"><a href="episodes/137/">Roberto Machado</a></h2>
This is the raw date: <span class="post-date">Aug 03 | 2015</span>
This is the raw subtitle: <h3 class="topic-list">CEO @ Subvisual | RubyConf Portugal | Creators School | Consultancy | Company role models | Group Buddies | Portuguese startup | Rebranding | Technologies used | JS frameworks | TDD &amp; BDD | Startup mistakes | Culture of learning | Young entrepreneurs</h3>

This is the text header: Roberto Machado
This is the text date: Aug 03 | 2015
This is the text subtitle: CEO @ Subvisual | RubyConf Portugal | Creators School | Consultancy | Company role models | Group Buddies | Portuguese startup | Rebranding | Technologies used | JS frameworks | TDD & BDD | Startup mistakes | Culture of learning | Young entrepreneurs

This is the raw header: <h2 class="post-title"><a href="episodes/136/">James Edward Gray II</a></h2>
This is the raw date: <span class="post-date">Jul 30 | 2015</span>
This is the raw subtitle: <h3 class="topic-list">Screencasting | Less Code | Reading code | Getting unstuck | Rails’s codebase | CodeNewbie | Small examples | Future plans | PeepCode | Frequency &amp; pricing</h3>

This is the text header: James Edward Gray II
This is the text date: Jul 30 | 2015
This is the text subtitle: Screencasting | Less Code | Reading code | Getting unstuck | Rails’s codebase | CodeNewbie | Small examples | Future plans | PeepCode | Frequency & pricing

至此,我们已经有了一些数据。 我们可以按照自己喜欢的方式构造或屠宰它。 上面的内容应该只是以可读的方式显示我们所拥有的。 当然,通过使用text方法使用正则表达式,我们可以更深入地研究每一个。

在解决实际播客问题时,我们将对此进行更详细的研究。 它不是regexp的课程,但是您会在实际操作中看到更多的内容-但不用担心,只不过会使您的大脑流血。

属性

在此阶段可能方便的是也提取单个情节的href 。 它再简单不过了。

some_scraper.rb
require 'nokogiri'

require "open-uri"

url = 'http://betweenscreens.fm/'

page = Nokogiri::HTML(open(url))

articles = page.css("article.index-article")

articles.each do |article|
  header      = article.at_css("h2.post-title")
  date        = article.at_css(".post-date")
  subtitle    = article.at_css(".topic-list")
  link        = article.at_css("h2.post-title a")
  podcast_url = "http://betweenscreens.fm/"

  puts "This is the raw header:    #{header}"
  puts "This is the raw date:      #{date}"
  puts "This is the raw subtitle:  #{subtitle}"
  puts "This is the raw link:      #{link}\n\n"

  puts "This is the text header:   #{header.text}"
  puts "This is the text date:     #{date.text}"
  puts "This is the text subtitle: #{subtitle.text}"
  puts "This is the raw link:      #{podcast_url}#{link[:href]}\n\n"
end

这里要注意的最重要的位是[:href]podcast_url 。 如果在[:]上标记,则可以从目标选择器中提取属性。 我进一步进行了抽象,但是您可以在下面更清楚地看到它的工作方式。

...

href = article.at_css("h2.post-title a")[:href]

...

为了获得完整而有用的URL,我将根域保存在变量中,并为每个情节构造了完整的URL。

...

podcast_url = "http://betweenscreens.fm/"

puts "This is the raw link: #{podcast_url}#{link[:href]}\n\n"

...

让我们快速看一下输出:

输出量
This is the raw header:   <h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>
This is the raw date:     <span class="post-date">Oct 25 | 2016</span>
This is the raw subtitle: <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3>
This is the raw link:     <a href="episodes/143/">Jason Long</a>

This is the text header: Jason Long
This is the text date:   Oct 25 | 2016
This is the text subtitle: Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas
This is the href:     http://betweenscreens.fm/episodes/143/

This is the raw header:   <h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2>
This is the raw date:     <span class="post-date">Oct 18 | 2016</span>
This is the raw subtitle: <h3 class="topic-list">Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk</h3>
This is the raw link:     <a href="episodes/142/">David Heinemeier Hansson</a>

This is the text header: David Heinemeier Hansson
This is the text date:   Oct 18 | 2016
This is the text subtitle: Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk
This is the href:     http://betweenscreens.fm/episodes/142/

This is the raw header:   <h2 class="post-title"><a href="episodes/141/">Zach Holman</a></h2>
This is the raw date:     <span class="post-date">Oct 12 | 2016</span>
This is the raw subtitle: <h3 class="topic-list">Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment &amp; Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication</h3>
This is the raw link:     <a href="episodes/141/">Zach Holman</a>

This is the text header: Zach Holman
This is the text date:   Oct 12 | 2016
This is the text subtitle: Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment & Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication
This is the href:     http://betweenscreens.fm/episodes/141/

This is the raw header:   <h2 class="post-title"><a href="episodes/140/">Joel Glovier</a></h2>
This is the raw date:     <span class="post-date">Oct 10 | 2016</span>
This is the raw subtitle: <h3 class="topic-list">Digital Product Design | Product Design @ GitHub | Loving Design | Order &amp; Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers &amp; Open Source</h3>
This is the raw link:     <a href="episodes/140/">Joel Glovier</a>

This is the text header: Joel Glovier
This is the text date:   Oct 10 | 2016
This is the text subtitle: Digital Product Design | Product Design @ GitHub | Loving Design | Order & Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers & Open Source
This is the href:     http://betweenscreens.fm/episodes/140/

This is the raw header:   <h2 class="post-title"><a href="episodes/139/">João Ferreira</a></h2>
This is the raw date:     <span class="post-date">Aug 26 | 2015</span>
This is the raw subtitle: <h3 class="topic-list">Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values</h3>
This is the raw link:     <a href="episodes/139/">João Ferreira</a>

This is the text header: João Ferreira
This is the text date:   Aug 26 | 2015
This is the text subtitle: Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values
This is the href:     http://betweenscreens.fm/episodes/139/

This is the raw header:   <h2 class="post-title"><a href="episodes/138/">Corwin Harrell</a></h2>
This is the raw date:     <span class="post-date">Aug 06 | 2015</span>
This is the raw subtitle: <h3 class="topic-list">Q&amp;A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration &amp; pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays |</h3>
This is the raw link:     <a href="episodes/138/">Corwin Harrell</a>

This is the text header: Corwin Harrell
This is the text date:   Aug 06 | 2015
This is the text subtitle: Q&A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration & pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays |
This is the href:     http://betweenscreens.fm/episodes/138/

This is the raw header:   <h2 class="post-title"><a href="episodes/137/">Roberto Machado</a></h2>
This is the raw date:     <span class="post-date">Aug 03 | 2015</span>
This is the raw subtitle: <h3 class="topic-list">CEO @ Subvisual | RubyConf Portugal | Creators School | Consultancy | Company role models | Group Buddies | Portuguese startup | Rebranding | Technologies used | JS frameworks | TDD &amp; BDD | Startup mistakes | Culture of learning | Young entrepreneurs</h3>
This is the raw link:     <a href="episodes/137/">Roberto Machado</a>

This is the text header: Roberto Machado
This is the text date:   Aug 03 | 2015
This is the text subtitle: CEO @ Subvisual | RubyConf Portugal | Creators School | Consultancy | Company role models | Group Buddies | Portuguese startup | Rebranding | Technologies used | JS frameworks | TDD & BDD | Startup mistakes | Culture of learning | Young entrepreneurs
This is the href:     http://betweenscreens.fm/episodes/137/

整洁,不是吗? 您可以执行相同的操作来提取选择器的[:class]

require 'nokogiri'

require "open-uri"

url = 'http://betweenscreens.fm/'

page = Nokogiri::HTML(open(url))

body_classes = page.at_css("body")[:class]

如果该节点具有多个类,则将获得所有它们的列表。

节点导航

  • 父母
  • 孩子们
  • previous_sibling
  • next_sibling

我们习惯于用CSS甚至jQuery处理树形结构。 如果Nokogiri不提供方便的API在此类树中移动,那将是一件痛苦的事情。

some_scraper.rb
require 'nokogiri'

require "open-uri"

url = 'http://betweenscreens.fm/'

page = Nokogiri::HTML(open(url))

header = page.at_css("h2.post-title")
header_children = page.at_css("h2.post-title").children
header_parent = page.at_css("h2.post-title").parent
header_prev_sibling = page.at_css("h2.post-title").previous_sibling

puts "#{header}\n\n"
puts "#{header_children}\n\n"
puts "#{header_parent}\n\n"
puts "#{header_prev_sibling}\n\n"
输出量
#header
<h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>

#header_children
<a href="episodes/143/">Jason Long</a>

#header_parent
<article class="index-article">
  <span class="post-date">Oct 25 | 2016</span><h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>
    <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3>
    <div class="soundcloud-player-small">  
    </div>
</article>

#header_previous_sibling
<span class="post-date">Oct 25 | 2016</span>

如您所见,这是一些非常强大的功能-尤其是当您看到.parent能够一次性收集到的内容时。 无需手动定义一堆节点,您可以批量收集它们。

您甚至可以将它们链接起来以进行更多的遍历。 当然,您可以根据需要将其复杂化,但是我要提醒您保持简单。 它很快会变得有些笨拙和难以理解。 记住,“保持简单,愚蠢!”

...

header_parent_parent = page.at_css("h2.post-title").parent.parent
header_prev_sibling_parent_children = page.at_css("h2.post-title").previous_sibling.parent.children

...
some_scraper.rb
require 'nokogiri'

require "open-uri"

url = 'http://betweenscreens.fm/'

page = Nokogiri::HTML(open(url))

header = page.at_css("h2.post-title")
header_prev_sibling_children = page.at_css("h2.post-title").previous_sibling.children
header_parent_parent = page.at_css("h2.post-title").parent.parent
header_prev_sibling_parent = page.at_css("h2.post-title").previous_sibling.parent
header_prev_sibling_parent_children = page.at_css("h2.post-title").previous_sibling.parent.children

puts "#{header}\n\n"
puts "#{header_prev_sibling_children}\n\n"
puts "#{header_parent_parent}\n\n"
puts "#{header_prev_sibling_parent}\n\n"
puts "#{header_prev_sibling_parent_children}\n\n"
输出量
#header
<h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>

#header_previous_sibling_children
Oct 25 | 2016

#header_parent_parent
<li>
  <article class="index-article">
  <span class="post-date">Oct 25 | 2016</span><h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>
    <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3>
    <div class="soundcloud-player-small">  
    </div>
  </article>
</li>

#header_previous_sibling_parent
<article class="index-article">
  <span class="post-date">Oct 25 | 2016</span><h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>
    <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3>
    <div class="soundcloud-player-small">  
    </div>
</article>

#header_previous_sibling_parent_children
  <span class="post-date">Oct 25 | 2016</span><h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>
    <h3 class="topic-list">Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3>
    <div class="soundcloud-player-small">  
    </div>

最后的想法

Nokogiri不是一个巨大的图书馆,但是它提供了很多东西。 我建议您使用到目前为止所学的知识,并在碰壁时通过其文档扩展知识。 但是不要惹麻烦!

这个小介绍将使您更好地理解自己可以做什么以及如何工作。 希望您能自己再做一些探索,并从中获得一些乐趣。 您将自己发现,它是一个不断提供的丰富工具。

翻译自: https://code.tutsplus.com/articles/building-your-first-web-scraper-01--cms-27559

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值