构建您的第一个Web爬网程序,第2部分

在本教程中,您将学习如何使用Mechanize单击链接,填写表单和上传文件。 您还将学习如何切片机械化页面对象,以及如何自动执行Google搜索并保存其结果。

主题

  • 单页与分页
  • 机械化
  • 代理商
  • Nokogiri方法
  • 链接
  • 请点击
  • 形式

单页与分页

到目前为止,我们已经花了一些时间来弄清楚如何使用Nokogiri刮取单个页面的屏幕。 这是向前迈出一步并学习如何从多个页面提取内容的良好基础。

毕竟,我们要解决的问题涉及从140多个剧集中获取内容,这比合理容纳单个网页的内容要多。 我们必须进行分页工作,并且需要弄清楚如何跟踪兔子洞中的内容。

这是Nokogiri停止的地方,另一个名为Mechanize的有用宝石开始发挥作用。

机械化

机械化是另一个功能强大的工具,可以提供很多好处。 从本质上讲,它使您可以自动与需要从中提取内容的网站进行交互。 从某种意义上讲,它使我想起了一些从Capybara进行测试时可能了解的功能。

不要误会我的意思,在单个页面上玩Nokogiri本身很棒,但是对于更多辛辣的数据提取工作,我们需要更多的功能。 本质上,我们可以根据需要爬网许多页面,并与它们的元素进行交互-模仿和自动化人类行为。 很强大的东西!

使用该gem,您可以跟踪链接,填写表单字段并提交数据(即使在表上也处理cookie)。 这意味着您还可以模仿用户登录到私人会话,并仅从您有权访问的站点获取内容。

您用您的凭据填写登录名,并告诉Mechanize如何进行后续操作。 由于您可以单击链接并提交表单,因此使用此工具几乎可以做的很少。 它与Nokogiri有着密切的关系,也取决于它。 亚伦·帕特森Aaron Patterson )再次是这个可爱的宝石的作者之一。

实例化机械化代理

在开始使事物机械化之前,我们需要实例化一个Mechanize代理。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

agent将用于获取页面,类似于我们对Nokogiri所做的操作。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

这里发生的是Mechanize代理获得了播客页面及其cookie。

提取页面内容

现在,我们有一个页面可供提取。 在执行此操作之前,我建议我们先使用inspect方法查看一下。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

puts page.inspect

输出相当可观。 看一下自己,看看Mechanize::Page对象由什么组成。 在这里,您可以看到该页面的所有属性。

对我来说,这是一个非常方便的对象,可用于分割要提取的数据。

输出量

#<Mechanize::Page
 {url #http://betweenscreens.fm/>}
 {meta_refresh}
 {title "Between | Screens "}
 {iframes
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290328784&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290126141&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/289018386&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287425105&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287105342&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/221003494&color=ff0000&auto...>
  #<Mechanize::Page::Frame
   nil
   "">https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/218101809&color=ff0000&auto...}
 {frames}
 {links
  #<Mechanize::Page::Link "Logo cube" "/">
  #https://github.com/vis-kid/betweenscreens">
  #<Mechanize::Page::Link "about" "pages/about/">
  #<Mechanize::Page::Link "design" "design/">
  #<Mechanize::Page::Link "code" "code/">
  #<Mechanize::Page::Link "Randy J. Hunt" "episodes/144/">
  #<Mechanize::Page::Link "Jason Long" "episodes/143/">
  #<Mechanize::Page::Link "David Heinemeier Hansson" "episodes/142/">
  #<Mechanize::Page::Link "Zach Holman" "episodes/141/">
  #<Mechanize::Page::Link "Joel Glovier" "episodes/140/">
  #<Mechanize::Page::Link "João Ferreira" "episodes/139/">
  #<Mechanize::Page::Link "Corwin Harrell" "episodes/138/">
  #<Mechanize::Page::Link "Older Stuff »" "page/2/">
  #<Mechanize::Page::Link "Exercise" "/tags/exercise/">
  #<Mechanize::Page::Link "Company benefits" "/tags/company-benefits/">
  #<Mechanize::Page::Link "Tmux" "/tags/tmux/">
  #<Mechanize::Page::Link "FileTask" "/tags/filetask/">
  #<Mechanize::Page::Link "Decision making" "/tags/decision-making/">
  #<Mechanize::Page::Link "Favorite feature" "/tags/favorite-feature/">
  #<Mechanize::Page::Link "Working out" "/tags/working-out/">
  #<Mechanize::Page::Link "Scott Savarie" "/tags/scott-savarie/">
  #<Mechanize::Page::Link "Titles" "/tags/titles/">
  #<Mechanize::Page::Link "Erik Spiekermann" "/tags/erik-spiekermann/">
  #<Mechanize::Page::Link "Newbie mistakes" "/tags/newbie-mistakes/">
  #<Mechanize::Page::Link "Playbook" "/tags/playbook/">
  #<Mechanize::Page::Link "Delegation" "/tags/delegation/">
  #<Mechanize::Page::Link "Heat maps" "/tags/heat-maps/">
  #<Mechanize::Page::Link "Europe" "/tags/europe/">
  #<Mechanize::Page::Link "Sizing type" "/tags/sizing-type/">
  #<Mechanize::Page::Link "Focus" "/tags/focus/">
  #<Mechanize::Page::Link "Virtual assistants" "/tags/virtual-assistants/">
  #<Mechanize::Page::Link "Writing" "/tags/writing/">
  #<Mechanize::Page::Link "Hacking" "/tags/hacking/">
  #<Mechanize::Page::Link "Joel Glovier" "/tags/joel-glovier/">
  #<Mechanize::Page::Link "Corwin Harrell" "/tags/corwin-harrell/">
  #<Mechanize::Page::Link "Mario C. Delgado" "/tags/mario-c-delgado/">
  #<Mechanize::Page::Link "Tom Dale" "/tags/tom-dale/">
  #<Mechanize::Page::Link "Obie Fernandez" "/tags/obie-fernandez/">
  #<Mechanize::Page::Link "Chad Pytel" "/tags/chad-pytel/">
  #<Mechanize::Page::Link "Zach Holman" "/tags/zach-holman/">
  #<Mechanize::Page::Link "Max Luster" "/tags/max-luster/">
  #<Mechanize::Page::Link "Kyle Fiedler" "/tags/kyle-fiedler/">
  #<Mechanize::Page::Link "Roberto Machado" "/tags/roberto-machado/">}
 {forms}>

如果要查看HTML页面本身,可以在bodycontent方法上进行标记。

some_scraper.rb

...

print page.body

...

输出量

<!doctype html>

<html>
  <head>
    <meta charset="utf-8" />
    <meta http-equiv='X-UA-Compatible' content='IE=edge;chrome=1' />
    <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
    <meta name="viewport" content="initial-scale=1">
    <title>Between | Screens </title>
    <link rel="alternate" type="application/atom+xml" title="Atom Feed" href="/feed.xml" />
    <link href="stylesheets/all-11b45acc.css" rel="stylesheet" />
    <script src="javascripts/all-4c20da82.js"></script>
  </head>

  <body>
    <header>
      <div id="logo">
        <a href="/"><img src="images/Between_Screens_Logo_Cube_Up-539d6997.svg" alt="Logo cube" /></a>
      </div>
      <nav class="navigation">
        <ul class="nav-list"> 
fork">https://github.com/vis-kid/betweenscreens">fork!
          <li><a href="pages/about/">about</a></li>
          <li><a href="design/">design</a></li>
          <li><a href="code/">code</a></li>
        </ul>
      </nav>
    </header>

    <div id="main" role="main">
      <div class='posts'>
        <ul>
          <li>
            <article class="index-article">
              <span class='post-date'>Oct 27 | 2016</span><h2 class='post-title'><a href="episodes/144/">Randy J. Hunt</a></h2>
              <h3 class='topic-list'>Organizing teams | Diversity | Desires | Pizza rule | Effective over clever | Novel solutions | Straightforwardness | Research | Coffeeshop test | Small changes | Reducing errors | Granular diffs</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                  height="166"
                  scrolling="no"
                  frameborder="no"
                  src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290328784&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Oct 25 | 2016</span><h2 class='post-title'><a href="episodes/143/">Jason Long</a></h2>
              <h3 class='topic-list'>Open source | Empathy | Lower barriers | Learning tool | Design contributions | Git website | Branding | GitHub | Neovim | Tmux | Design love | Knowing audiences | Showing work | Dribbble | Progressions | Ideas</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                height="166"
                scrolling="no"
                frameborder="no"
                src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290126141&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Oct 18 | 2016</span><h2 class='post-title'><a href="episodes/142/">David Heinemeier Hansson</a></h2>
              <h3 class='topic-list'>Rails community | Tone | Technical disagreements | Community policing | Ungratefulness | No assholes allowed | Basecamp | Open source persona | Aspirations | Guarding motivations | Dealing with audiences | Pressure | Honesty | Diverse opinions | Small talk</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                height="166"
                scrolling="no"
                frameborder="no"
                src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/289018386&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Oct 12 | 2016</span><h2 class='post-title'><a href="episodes/141/">Zach Holman</a></h2>
              <h3 class='topic-list'>Getting Fired | Taboo | Transparency | Different Perspectives | Timing | Growth Stages | Employment & Dating | Managers | At-will Employment | Tech Industry | Europe | Low hanging Fruits | Performance Improvement Plans | Meeting Goals | Surprise Firings | Firing Fast | Mistakes | Company Culture | Communication</h3>
              <div class='soundcloud-player-small'>  
                <iframe width="100%"
                  height="166"
                  scrolling="no"
                  frameborder="no"
                  src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287425105&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Oct 10 | 2016</span><h2 class='post-title'><a href="episodes/140/">Joel Glovier</a></h2>
              <h3 class='topic-list'>Digital Product Design | Product Design @ GitHub | Loving Design | Order & Chaos | Drawing | Web Design | HospitalRun | Diversity | Startup Culture | Improving Lives | CURE International | Ember | Offline First | Hospital Information System | Designers & Open Source</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                  height="166"
                  scrolling="no"
                  frameborder="no"
                  src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287105342&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Aug 26 | 2015</span><h2 class='post-title'><a href="episodes/139/">João Ferreira</a></h2>
              <h3 class='topic-list'>Masters @ Work | Subvisual | Deadlines | Design personality | Design problems | Team | Pushing envelopes | Delightful experiences | Perfecting details | Company values</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                height="166"
                scrolling="no"
                frameborder="no"
                src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/221003494&color=ff0000&...>
              </div>
            </article>
          </li>

          <li>
            <article class="index-article">
              <span class='post-date'>Aug 06 | 2015</span><h2 class='post-title'><a href="episodes/138/">Corwin Harrell</a></h2>
              <h3 class='topic-list'>Q&A | 01 | University | Graphic design | Design setup | Sublime | Atom | thoughtbot | Working location | Collaboration & pairing | Vim advocates | Daily routine | Standups | Clients | Coffee walks | Investment Fridays |</h3>
              <div class='soundcloud-player-small'>
                <iframe width="100%"
                height="166"
                scrolling="no"
                frameborder="no"
                src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/218101809&color=ff0000&...>
              </div>
            </article>
          </li>
        </ul>
      </div>

      <section>
        <div class='pagination-link'><a href="page/2/">Older Stuff »</a></div>
      </section>
    </div>

    <footer>
      <div class='footer-tags'>
        <h3>Random Tags</h3>
        <ul class='random-tag-list'>
          <li><a href="/tags/exercise/">Exercise</a></li>
          <li><a href="/tags/company-benefits/">Company benefits</a></li>
          <li><a href="/tags/tmux/">Tmux</a></li>
          <li><a href="/tags/filetask/">FileTask</a></li>
          <li><a href="/tags/decision-making/">Decision making</a></li>
          <li><a href="/tags/favorite-feature/">Favorite feature</a></li>
          <li><a href="/tags/working-out/">Working out</a></li>
          <li><a href="/tags/scott-savarie/">Scott Savarie</a></li>
          <li><a href="/tags/titles/">Titles</a></li>
          <li><a href="/tags/erik-spiekermann/">Erik Spiekermann</a></li>
          <li><a href="/tags/newbie-mistakes/">Newbie mistakes</a></li>
          <li><a href="/tags/playbook/">Playbook</a></li>
          <li><a href="/tags/delegation/">Delegation</a></li>
          <li><a href="/tags/heat-maps/">Heat maps</a></li>
          <li><a href="/tags/europe/">Europe</a></li>
          <li><a href="/tags/sizing-type/">Sizing type</a></li>
          <li><a href="/tags/focus/">Focus</a></li>
          <li><a href="/tags/virtual-assistants/">Virtual assistants</a></li>
          <li><a href="/tags/writing/">Writing</a></li>
          <li><a href="/tags/hacking/">Hacking</a></li>
        </ul>
      </div>

      <div class='recent-posts'>
        <h3>Random Interviewees</h3>
        <ul>
          <li><a href="/tags/joel-glovier/">Joel Glovier</a></li>
          <li><a href="/tags/corwin-harrell/">Corwin Harrell</a></li>
          <li><a href="/tags/mario-c-delgado/">Mario C. Delgado</a></li>
          <li><a href="/tags/tom-dale/">Tom Dale</a></li>
          <li><a href="/tags/obie-fernandez/">Obie Fernandez</a></li>
          <li><a href="/tags/chad-pytel/">Chad Pytel</a></li>
          <li><a href="/tags/zach-holman/">Zach Holman</a></li>
          <li><a href="/tags/max-luster/">Max Luster</a></li>
          <li><a href="/tags/kyle-fiedler/">Kyle Fiedler</a></li>
          <li><a href="/tags/roberto-machado/">Roberto Machado</a></li>
        </ul>
      </div>
    </footer>
  </body>
</html>

由于此播客页面上只有少量不同的元素,因此这里是从github.com返回的Mechanize::Page 。 它具有更多内容可供查看。 我认为这对感受很重要。

输出github.com

#<Mechanize::Page
 {url #https://github.com/>}
 {meta_refresh}
 {title "How people build software · GitHub"}
 {iframes}
 {frames}
 {links
  #<Mechanize::Page::Link "Skip to content" "#start-of-content">
  #https://github.com/">
  #<Mechanize::Page::Link "\n          Personal\n" "/personal">
  #<Mechanize::Page::Link "\n          Open source\n" "/open-source">
  #<Mechanize::Page::Link "\n          Business\n" "/business">
  #<Mechanize::Page::Link "\n          Explore\n" "/explore">
  #<Mechanize::Page::Link "Sign up" "/join?source=header-home">
  #<Mechanize::Page::Link "Sign in" "/login">
  #<Mechanize::Page::Link "Pricing" "/pricing">
  #<Mechanize::Page::Link "Blog" "/blog">
  #https://help.github.com">
  #https://github.com/search">
  #https://help.github.com/terms">
  #https://help.github.com/privacy">
  #<Mechanize::Page::Link "Sign up for GitHub" "/join?source=button-home">
  #<Mechanize::Page::Link
   "\n      \n        \n      \n      \n        A whole new Universe\n        \n          Learn about the exciting features and announcements revealed at this year's GitHub Universe conference.\n        \n      \n    "
   "/universe-2016">
  #<Mechanize::Page::Link "Individuals " "/personal">
  #<Mechanize::Page::Link "Communities " "/open-source">
  #<Mechanize::Page::Link "Businesses " "/business">
  #<Mechanize::Page::Link "NASA" "//github.com/nasa">
  #<Mechanize::Page::Link "Sign up for GitHub" "/join?source=button-home">
  #https://github.com/contact">
  #https://developer.github.com">
  #https://training.github.com">
  #https://shop.github.com">
  #https://github.com/blog">
  #https://github.com/about">
  #https://github.com">
  #https://github.com/site/terms">
  #https://github.com/site/privacy">
  #https://github.com/security">
  #https://status.github.com/">
  #https://help.github.com">
  #<Mechanize::Page::Link "Reload" "">
  #<Mechanize::Page::Link "Reload" "">}
 {forms
  #<Mechanize::Form
   {name nil}
   {method "GET"}
   {action "/search"}
   {fields
    [hidden:0x3feb90f8297c type: hidden name: utf8 value: ✓]
    [text:0x3feb90f827d8 type: text name: q value: ]}
   {radiobuttons}
   {checkboxes}
   {file_uploads}
   {buttons}>
  #<Mechanize::Form
   {name nil}
   {method "POST"}
   {action "/join"}
   {fields
    [hidden:0x3feb90f7be38 type: hidden name: utf8 value: ✓]
    [hidden:0x3feb90f7bbb8 type: hidden name: authenticity_token value: vjRATKj7smXreq6Lt02r+MzW+ewWoi+fRzQXPedFAlOZgwzxQ0dZnChirhDfd7vyWZZZBO+ZFydLNedjIEDsrQ==]
    [text:0x3feb90f7b9d8 type: text name: user[login] value: ]
    [text:0x3feb90f7b7f8 type: text name: user[email] value: ]
    [field:0x3feb90f7b654 type: password name: user[password] value: ]
    [hidden:0x3feb90f7b474 type: hidden name: source value: form-home]}
   {radiobuttons}
   {checkboxes}
   {file_uploads}
   {buttons [button:0x3feb90f7a038 type: submit name:  value: ]}>}>

回到播客,您还可以查看诸如编码,HTTP响应代码,URI或响应标头之类的内容。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

puts 'Encodings'
puts page.encodings
puts 'Repsonse Headers'
puts page.response
puts 'HTTP response code'
puts page.code
puts 'URI'
puts page.uri

输出量

Encodings
EUC-JP
utf-8
utf-8

Repsonse Headers
{"server"=>"GitHub.com", "date"=>"Sat, 29 Oct 2016 17:56:00 GMT", "content-type"=>"text/html; charset=utf-8", "transfer-encoding"=>"chunked", "last-modified"=>"Fri, 28 Oct 2016 01:48:56 GMT", "access-control-allow-origin"=>"*", "expires"=>"Sat, 29 Oct 2016 18:06:00 GMT", "cache-control"=>"max-age=600", "content-encoding"=>"gzip", "x-github-request-id"=>"501C936D:C723:1631523C:5814E2B0"}

HTTP response code
200

URI
http://betweenscreens.fm/

如果您想深入了解,还有更多的东西。 我就这样了。

Nokogiri方法

  • at
  • search

Mechanize使用Nokogiri从页面抓取数据。 您可以在第一篇文章中应用所学到的关于Nokogiri的知识,也可以在Mechanize页面上使用它。 这意味着您通常使用Mechanize来导航页面和Nokogiri方法来满足您的抓取需求。

例如,如果要搜索单个对象,则可以at中使用,而search将返回与特定页面上的选择器匹配的所有对象。 换句话说,这些方法对Nokogiri文档对象和Mechanize页面对象都适用。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

first_title = page.at('h2.post-title')

all_titles = page.search('h2.post-title')

all_titles.each do |title|
  puts title
end

puts " * "*33

puts first_title

输出量

<h2 class="post-title"><a href="episodes/144/">Randy J. Hunt</a></h2>
<h2 class="post-title"><a href="episodes/143/">Jason Long</a></h2>
<h2 class="post-title"><a href="episodes/142/">David Heinemeier Hansson</a></h2>
<h2 class="post-title"><a href="episodes/141/">Zach Holman</a></h2>
<h2 class="post-title"><a href="episodes/140/">Joel Glovier</a></h2>
<h2 class="post-title"><a href="episodes/139/">João Ferreira</a></h2>
<h2 class="post-title"><a href="episodes/138/">Corwin Harrell</a></h2>
 *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  * 
<h2 class="post-title"><a href="episodes/144/">Randy J. Hunt</a></h2>

链接

  • links
  • link_with
  • links_with

我们还可以根据自己的喜好浏览整个站点。 机械化最重要的部分可能是它使您能够玩链接的能力。 否则,您可以独自坚持使用Nokogiri。 让我们看一下,如果我们要求页面提供链接,我们将返回什么。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

puts "#{page.links}"

输出量

[#<Mechanize::Page::Link "Logo cube" "/">
, #https://github.com/vis-kid/betweenscreens">
, #<Mechanize::Page::Link "about" "pages/about/">
, #<Mechanize::Page::Link "design" "design/">
, #<Mechanize::Page::Link "code" "code/">
, #<Mechanize::Page::Link "Randy J. Hunt" "episodes/144/">
, #<Mechanize::Page::Link "Jason Long" "episodes/143/">
, #<Mechanize::Page::Link "David Heinemeier Hansson" "episodes/142/">
, #<Mechanize::Page::Link "Zach Holman" "episodes/141/">
, #<Mechanize::Page::Link "Joel Glovier" "episodes/140/">
, #<Mechanize::Page::Link "João Ferreira" "episodes/139/">
, #<Mechanize::Page::Link "Corwin Harrell" "episodes/138/">
, #<Mechanize::Page::Link "Older Stuff »" "page/2/">
, #<Mechanize::Page::Link "Exercise" "/tags/exercise/">
, #<Mechanize::Page::Link "Company benefits" "/tags/company-benefits/">
, #<Mechanize::Page::Link "Tmux" "/tags/tmux/">
, #<Mechanize::Page::Link "FileTask" "/tags/filetask/">
, #<Mechanize::Page::Link "Decision making" "/tags/decision-making/">
, #<Mechanize::Page::Link "Favorite feature" "/tags/favorite-feature/">
, #<Mechanize::Page::Link "Working out" "/tags/working-out/">
, #<Mechanize::Page::Link "Scott Savarie" "/tags/scott-savarie/">
, #<Mechanize::Page::Link "Titles" "/tags/titles/">
, #<Mechanize::Page::Link "Erik Spiekermann" "/tags/erik-spiekermann/">
, #<Mechanize::Page::Link "Newbie mistakes" "/tags/newbie-mistakes/">
, #<Mechanize::Page::Link "Playbook" "/tags/playbook/">
, #<Mechanize::Page::Link "Delegation" "/tags/delegation/">
, #<Mechanize::Page::Link "Heat maps" "/tags/heat-maps/">
, #<Mechanize::Page::Link "Europe" "/tags/europe/">
, #<Mechanize::Page::Link "Sizing type" "/tags/sizing-type/">
, #<Mechanize::Page::Link "Focus" "/tags/focus/">
, #<Mechanize::Page::Link "Virtual assistants" "/tags/virtual-assistants/">
, #<Mechanize::Page::Link "Writing" "/tags/writing/">
, #<Mechanize::Page::Link "Hacking" "/tags/hacking/">
, #<Mechanize::Page::Link "Joel Glovier" "/tags/joel-glovier/">
, #<Mechanize::Page::Link "Corwin Harrell" "/tags/corwin-harrell/">
, #<Mechanize::Page::Link "Mario C. Delgado" "/tags/mario-c-delgado/">
, #<Mechanize::Page::Link "Tom Dale" "/tags/tom-dale/">
, #<Mechanize::Page::Link "Obie Fernandez" "/tags/obie-fernandez/">
, #<Mechanize::Page::Link "Chad Pytel" "/tags/chad-pytel/">
, #<Mechanize::Page::Link "Zach Holman" "/tags/zach-holman/">
, #<Mechanize::Page::Link "Max Luster" "/tags/max-luster/">
, #<Mechanize::Page::Link "Kyle Fiedler" "/tags/kyle-fiedler/">
, #<Mechanize::Page::Link "Roberto Machado" "/tags/roberto-machado/">
]

莫莉,让我们分解一下。 由于我们没有告诉Mechanize去其他地方,因此仅从第一页就获得了一系列链接。 机械化以降序浏览该页面,并从上到下返回此链接列表。 我创建了一个带有绿色指针的小图像,该指针指向您在输出中可以看到的各种链接。

顺便说一句,这已经向您展示了我的播客重新设计的最终结果。 我认为此版本出于演示目的会更好一些。 您还可以大致了解最终结果,以及为什么我需要刮擦旧的Sinatra网站。

屏幕截图

播客链接

与往常一样,我们也可以仅从中提取文本。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

page.links.each do |link|
  puts link.text
end

输出量

Logo cube
fork!
about
design
code
Randy J. Hunt
Jason Long
David Heinemeier Hansson
Zach Holman
Joel Glovier
João Ferreira
Corwin Harrell
Older Stuff »
Exercise
Company benefits
Tmux
FileTask
Decision making
Favorite feature
Working out
Scott Savarie
Titles
Erik Spiekermann
Newbie mistakes
Playbook
Delegation
Heat maps
Europe
Sizing type
Focus
Virtual assistants
Writing
Hacking
Joel Glovier
Corwin Harrell
Mario C. Delgado
Tom Dale
Obie Fernandez
Chad Pytel
Zach Holman
Max Luster
Kyle Fiedler
Roberto Machado

批量获取所有这些链接可能非常有用,也可能很乏味。 对我们来说幸运的是,我们有一些工具可以微调我们需要的东西。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

focus_link = agent.page.links.find { |link| link.text == 'Focus' }

puts focus_link

输出量

Focus

繁荣! 现在我们到了某个地方! 我们可以像这样放大特定的链接。 我们可以使用更好的API(例如links_withlink_with来定位符合特定条件(例如其文本)的link_with 。 另外,如果我们有多个Focus链接,则可以使用方括号[]放大页面上的特定数字。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

focus_link = agent.page.links_with(:text => 'Focus')[2]

puts focus_link

如果您不是在链接文本之后而是在链接本身之后,则只需指定特定的href即可找到该链接。 机械化不会妨碍您。 您可以使用href代替方法来输入text

some_scraper.rb
page = agent.page.link_with(href: '/episodes/95/')

page = agent.page.links_with(href: '/episodes/95/')

如果只想找到包含所需文本的第一个链接,则也可以使用此语法。 非常方便,可读性更高。

some_scraper.rb
focus_links = agent.page.link_with(:text => 'Focus')

跟随那个家伙,看看这个Focus链接背后隐藏着什么? 让我们click它!

请点击

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

focus_links = agent.page.links.find { |link| link.text == 'Focus' }.click.links

puts focus_links

这将使我们像以前一样获得一长串链接。 看看合并.click.links多么容易。 机械化为您单击链接,然后将页面转到新的目的地。 由于我们还请求了链接列表,因此我们将获得Mechanize在该新页面上可以找到的所有链接。

假设我有同一位受访者的两个文本链接,一个链接到标签,一个链接到最近的一集,我想从每个页面中获取链接。

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

podcast_url = "http://betweenscreens.fm/"

page = agent.get(podcast_url)

links = agent.page.links_with(text: "Some interviewee")

links.each do |link|
  puts link.click.links
end

这将为您提供两个页面的链接列表。 您遍历受访者的每个链接,然后Mechanize跟随单击的链接,并为您收集在新页面上找到的链接。 您可以在下面找到一些示例,在其中可以比较组合以开始使用。

some_scraper.rb
agent.page.links.find { |l| l.text == 'Focus' }
agent.page.links.find { |l| l.text == 'Focus' }.click
agent.page.link_with(text: 'Focus')
agent.page.links_with(text: 'Focus')[0]
agent.page.links_with(text: 'Focus')[1].click
agent.page.links_with(text: 'Focus')[2].click.links
agent.page.link_with(href: '/some-href')
agent.page.link_with(href: '/some-href').click
agent.page.links_with(href: '/some-href')
agent.page.links_with(href: '/some-href').click

形式

  • submit
  • field_with
  • checkbox_with
  • radiobuttons_with
  • file_uploads

让我们来看看表格!

some_scraper.rb
require 'mechanize'

agent = Mechanize.new

google_url = "http://google.com/"

page = agent.get(google_url)

forms = page.forms

puts forms.inspect

输出量

[#<Mechanize::Form
# Attention!!
 {name "f"}
# Attention!!
 {method "GET"}
 {action "/search"}
 {fields
  [hidden:0x3fea91d2eb08 type: hidden name: ie value: ISO-8859-1]
  [hidden:0x3fea91d2e964 type: hidden name: hl value: es]
  [hidden:0x3fea91d2e7e8 type: hidden name: source value: hp]
  [hidden:0x3fea91d2e5f4 type: hidden name: biw value: ]
  [hidden:0x3fea91d2e428 type: hidden name: bih value: ]
# Attention!!
  [text:0x3fea91d2e248 type:  name: q value: ]
# Attention!!
  [hidden:0x3fea91d2bcb4 type: hidden name: gbv value: 1]}
 {radiobuttons}
 {checkboxes}
 {file_uploads}
 {buttons
  [submit:0x3fea91d2e0f4 type: submit name: btnG value: Buscar con Google]
  [submit:0x3fea91d2be80 type: submit name: btnI value: Voy a tener suerte]}>
]

因为我们使用forms方法,所以即使在仅返回一个表单的情况下,我们也会得到一个返回的数组。 现在我们知道该表单的名称为"f" ,我们可以使用单数form来进行细化。

...

{name "f"}

...
some_scraper.rb
require 'mechanize'

agent = Mechanize.new

google_url = "http://google.com/"

page = agent.get(google_url)

search_form = page.form('f')

puts search_form.inspect

使用form('f') ,我们选择了要使用的特定表单。 结果,我们将不会返回数组。

输出量

#<Mechanize::Form
# Attention!!
 {name "f"}
# Attention!!
 {method "GET"}
 {action "/search"}
 {fields
  [hidden:0x3ffe9ce85ba4 type: hidden name: ie value: ISO-8859-1]
  [hidden:0x3ffe9ce859d8 type: hidden name: hl value: es]
  [hidden:0x3ffe9ce857bc type: hidden name: source value: hp]
  [hidden:0x3ffe9ce85618 type: hidden name: biw value: ]
  [hidden:0x3ffe9ce853e8 type: hidden name: bih value: ]
# Attention!!
  [text:0x3ffe9ce851cc type:  name: q value: ]
# Attention!!
  [hidden:0x3ffe9ce84bdc type: hidden name: gbv value: 1]}
 {radiobuttons}
 {checkboxes}
 {file_uploads}
 {buttons
  [submit:0x3ffe9ce85078 type: submit name: btnG value: Buscar con Google]
  [submit:0x3ffe9ce84e48 type: submit name: btnI value: Voy a tener suerte]}>

我们还可以识别文本输入字段的名称( q )。

...

[text:0x3ffe9ce851cc type:  name: q value: ]

...

我们可以通过该名称作为目标,并像Ruby属性一样设置其值。 我们需要做的就是为其提供新的价值。 您可以从上面的输出示例中看到默认情况下为空。

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

google_url = "http://google.com/"

page = agent.get(google_url)

search_form = page.form('f')
search_form.q = 'New Google Search'

puts search_form.inspect

输出量

#<Mechanize::Form
 {name "f"}
 {method "GET"}
 {action "/search"}
 {fields
  [hidden:0x3fcb85b6a784 type: hidden name: ie value: ISO-8859-1]
  [hidden:0x3fcb85b6a57c type: hidden name: hl value: es]
  [hidden:0x3fcb85b6a3b0 type: hidden name: source value: hp]
  [hidden:0x3fcb85b6a16c type: hidden name: biw value: ]
  [hidden:0x3fcb85b67f20 type: hidden name: bih value: ]
# Attention!!
  [text:0x3fcb85b67d18 type:  name: q value: New Google Search]
# Attention!!
  [hidden:0x3fcb85b67728 type: hidden name: gbv value: 1]}
 {radiobuttons}
 {checkboxes}
 {file_uploads}
 {buttons
  [submit:0x3fcb85b67b9c type: submit name: btnG value: Buscar con Google]
  [submit:0x3fcb85b67994 type: submit name: btnI value: Voy a tener suerte]}>

如您在上面所看到的,文本字段的值已更改为New Google Search 。 现在,我们只需要submit表单并从Google返回的页面收集结果。 这再简单不过了。 这次让我们搜索其他东西!

some_scraper.rb

require 'mechanize'

agent = Mechanize.new

google_url = "http://google.com/"
page = agent.get(google_url)

search_form = page.form('f')
search_form.q = 'GitHub TouchFart'

page = agent.submit(search_form)

pp page.search('h3.r').map(&:text)

在这里,我使用CSS选择器h3.r标识了搜索结果标题,并映射了其text ,并漂亮地打印了结果。 不是那么难吗? 当然,这是一个简单的例子,但是请考虑一下您可以使用的无尽可能性!

输出量

["GitHub - hungtruong/TouchFart: A fart app for the new Macbook ...",
 "TouchFart/TouchFart at master · hungtruong/TouchFart · GitHub",
 "Commits · hungtruong/TouchFart · GitHub",
 "Projects · hungtruong/TouchFart · GitHub",
 "Pull Requests · hungtruong/TouchFart · GitHub",
 "Issues · hungtruong/TouchFart · GitHub",
 "TouchFart/license.txt at master · hungtruong/TouchFart · GitHub",
 "Add autoplay attribute to <audio> tag and touchfart (er ... - GitHub",
 "Find file - File Finder · GitHub",
 "Fart app for the new Macbook Pro's Touch... #3860 on topic touchfart ..."]

机械化具有不同的输入字段供您使用。 您甚至可以上传文件!

  • field_with
  • checkbox_with
  • radiobuttons_with
  • file_uploads

您还可以通过单选按钮和复选框的名称来标识它们,并使用(您猜对了)check对其进行check

some_scraper.rb
form.radiobuttons_with(:name => 'gender')[3].check

form.checkbox_with(:name => 'coder').check

选项标签使用户可以从下拉列表中选择一项。 同样,我们按名称定位它们,然后选择所需的选项号。

some_scraper.rb
form.field_with(:name => 'countries').options[22].select

通过将文件设置为Ruby属性,文件上传的工作类似于将文本输入到表单中。 您确定上载字段,然后指定要传输的文件路径(文件名)。 听起来比实际要复杂。 我们来看一下!

some_scraper.rb
form.file_uploads.first.file_name = "some-path/some-image.jpg"

最后的想法

看,毕竟没有魔术! 您现在已经准备好独自享受一些乐趣。 当然,还有更多有关Nokogiri和Mechanize的知识,但是与其花太多时间在不必要的方面,不如尝试它,并在遇到初学者文章所未涵盖的问题时,多看一些文档。

我希望您可以看到这颗宝石多么美丽简单,它提供了多少能量。 众所周知,从流行文化到现在,这也应承担责任。 在法律框架内以及您无权访问API时使用它。 您可能不会经常使用这些工具,但是当您有一些实际的刮削需求时,它们会派上用场吗。

按照承诺,在下一篇文章中,我们将介绍一个真实的示例,在该示例中,我将从播客站点中抓取数据。 我将从一个旧的Sinatra网站提取它,并将其转移到我的新Middleman网站,该网站在每个情节中都使用.markdown文件。 我们将提取日期,剧集编号,受访者姓名,标题,子标题等。 到时候那里见!

翻译自: https://code.tutsplus.com/articles/building-your-first-web-scraper-2--cms-27566

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值