[转] Scraping Yahoo! Search with Web::Scraper

from http://menno.b10m.net/blog/blosxom/perl 该文章是用来解析取得到的html的资料,有用到xpath的概念 Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. Tatsuhiko Miyagawa comes to the rescue! His Web::Scraper makes scraping the web easy and fast. Since the documentation is scarce (there are the POD and the slides of a presentation I missed), I'll post this blog entry in which I'll show how to effectively scrape Yahoo! Search. First we'll define what we want to see. We'll going to run a query for 'Perl'. From that page, we want to fetch the following things: * title (the linked text) * url (the actual link) * description (the text beneath the link) So let's start our first little script: [code] use Data::Dumper;#该模块用来输出相关的结构 use URI; use Web::Scraper; my $yahoo = scraper { process "a.yschttl", 'title' => 'TEXT', 'url' => '@href'; process "div.yschabstr", 'description' => "TEXT"; result 'description', 'title', 'url'; }; print Dumper $yahoo->scrape(URI->new("http://search.yahoo.com/search?p=Perl")); [/code] Now what happens here? The important stuff can be found in the process statements. Basically, you may translate those lines to "Fetch an A-element with the CSS class named 'yschttl' and put the text in 'title', and the href value in url. Then fetch the text of the div with the class named 'yschabstr' and put that in description. The result looks something like this: $VAR1 = { 'url' => 'http://www.perl.com/', 'title' => 'Perl.com', 'description' => 'Central resource for Perl developers. It contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by Clay Irving.' }; Fun and a good start, but hey, do we really get only one result for a query on 'Perl'? No way! We need a loop! The slides tell you to append '[]' to the key, to enable looping. The process lines then look like this: process "a.yschttl", 'title[]' => 'TEXT', 'url[]' => '@href'; process "div.yschabstr", 'description[]' => "TEXT"; And when we run it now, the result looks like this: $VAR1 = { 'url' => [ 'http://www.perl.com/', 'http://www.perl.org/', 'http://www.perl.com/download.csp', ... ], 'title' => [ 'Perl.com', 'Perl Mongers', 'Getting Perl', ... ], 'description' => [ 'Central resource for Perl developers. It contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by Clay Irving.', 'Nonprofit organization, established to support the Perl community.', 'Instructions on downloading a Perl interpreter for your computer platform. ... On CPAN, you will find Perl source in the /src directory. ...', ... ] }; That looks a lot better! We now get all the search results and could loop through the different arrays to get the right title with the right url. But still we shouldn't be satisfied, for we don't want three arrays, we want one array of hashes! For that we need a little trickery; we need another process line! All the stuff we grab already is located in a big ordered list (the OL-element), so let's find that one first, and for each list element (LI) find our title,url and description. For this we don't use the CSS selectors, but we'll go for the XPath selectors (heck, we can do both, so why not?). To grab an XPath I really suggest firebug , a FireFox addon. With the easy point and click interface, you can grab the path within seconds. use Data::Dumper; use URI; use Web::Scraper; my $yahoo = scraper { process "/html/body/div[5]/div/div/div[2]/ol/li", 'results[]' => scraper { process "a.yschttl", 'title' => 'TEXT', 'url' => '@href'; process "div.yschabstr", 'description' => "TEXT"; result 'description', 'title', 'url'; }; result 'results'; }; print Dumper $yahoo->scrape( URI->new("http://search.yahoo.com/search?p=Perl") ); You see that we switched our title, url and description fields back to the old notation (without []), for we don't want to loop those fields. We've moved the looping a step higher, being to the li-elements. Then we open another scraper which will dump the hashes into the results array (note the '[]' in 'results[]'). The result is exactly what we wanted: $VAR1 = [ { 'url' => 'http://www.perl.com/', 'title' => 'Perl.com', 'description' => 'Central resource for Perl developers. It contains the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by Clay Irving.' }, { 'url' => 'http://www.perl.org/', 'title' => 'Perl Mongers', 'description' => 'Nonprofit organization, established to support the Perl community.' }, { 'url' => 'http://www.perl.com/download.csp', 'title' => 'Getting Perl', 'description' => 'Instructions on downloading a Perl interpreter for your computer platform. ... On CPAN, you will find Perl source in the /src directory. ...' }, ... ]; Again Tatsuhiko impresses me with a Perl module. Well done! Very well done! Update: Tatsuhiko had some wise words on this article: A couple of things: You might just skip result() stuff if you're returning the entire hash, which is the default. (The API is stolen from Ruby's one that needs result() for some reason, but my perl port doesn't require) Now with less code :) The use of nested scraper in your example seems pretty good, but using hash reference could be also useful, like: my $yahoo = scraper { process "a.yschttl", 'results[]', { title => 'TEXT', url => '@href', }; }; This way you'll get title and url from TEXT and @href from a.yschttl, which would be handier if you don't need the description. TIMTOWTDI :)
阅读更多
想对作者说点什么?

博主推荐

换一批

没有更多推荐了,返回首页