抓取页面常用方法是Curl,本文介绍另一个很好用的类。官网:
http://snoopy.sourceforge.net/
本文来自Snoopy官方文档,一起来学习下:
NAME:
Snoopy - the PHP net client v2.0.0
SYNOPSIS(简单示例):
include "Snoopy.class.php";
$snoopy = new Snoopy;
$snoopy->fetchtext("http://www.php.net/");
print $snoopy->results;
$snoopy->fetchlinks("http://www.phpbuilder.com/");
print $snoopy->results;
$submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
$submit_vars["q"] = "amiga";
$submit_vars["submit"] = "Search!";
$submit_vars["searchhost"] = "Altavista";
$snoopy->submit($submit_url,$submit_vars);
print $snoopy->results;
$snoopy->maxframes=5;
$snoopy->fetch("http://www.ispi.net/");
echo "<PRE>\n";
echo htmlentities($snoopy->results[0]);
echo htmlentities($snoopy->results[1]);
echo htmlentities($snoopy->results[2]);
echo "</PRE>\n";
$snoopy->fetchform("http://www.altavista.com");
print $snoopy->results;
DESCRIPTION:
What is Snoopy?
Snoopy is a PHP class that simulates a web browser. It automates the
task of retrieving web page content and posting forms, for example.
史努比是一个模拟web浏览器的PHP类。它能实现自动搜索网页内容和提交表单。
Some of Snoopy's features:
* easily fetch the contents of a web page
* easily fetch the text from a web page (strip html tags)
* easily fetch the the links from a web page
* supports proxy hosts
* supports basic user/pass authentication
* supports setting user_agent, referer, cookies and header content
* supports browser redirects, and controlled depth of redirects
* expands fetched links to fully qualified URLs (default)
* easily submit form data and retrieve the results
* supports following html frames (added v0.92)
* supports passing cookies on redirects (added v0.92)
* 轻松抓取网页内容
* 轻松从网页获取文本(过滤 html标签)
* 轻松从网页获取链接
* 支持代理主机
* 支持基本 用户名/密码 认证
* 支持设置user_agent,refer,Cookie和header 内容
* 支持浏览器重定向和受控重定向深度
* 将指向完全授权网址的抓取链接展开(默认)
* 轻松提交表单数据和取回结果
* 支持以下html框架(v0.92添加)
* 支持在重定向上传递Cookie(v0.92添加)
REQUIREMENTS(依赖):
Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),
and the OpenSSL extension for fetching HTTPS requests.
史努比抓取HTTPS请求时需要安装PHP的PCRE和OpenSSL扩展。
CLASS METHODS(类方法):
fetch($URI)
-----------
This is the method used for fetching the contents of a web page.
$URI is the fully qualified URL of the page to fetch.
The results of the fetch are stored in $this->results.
If you are fetching frames, then $this->results
contains each frame fetched in an array.
这是用于获取网页内容的方法。
$URI是个完全授权的URL的网页。
获取的结果存储在$this->results中。
如果您正在获取frames,那么$this->results
的结果数组中包含每个frame提取的内容。
fetchtext($URI)
---------------
This behaves exactly like fetch() except that it only returns
the text from the page, stripping out html tags and other
irrelevant data.
这方法很像fetch()方法,但是它只返回
从网页的文本,剥离出了HTML标签等
不相关的数据。
fetchform($URI)
---------------
This behaves exactly like fetch() except that it only returns
the form elements from the page, stripping out html tags and other
irrelevant data.
这个方法跟fetch()也很像,但是它只返回符合条件的form元素。
fetchlinks($URI)
----------------
This behaves exactly like fetch() except that it only returns
the links from the page. By default, relative links are
converted to their fully qualified URL form.
这个方法跟fetch()也很像,但是它只返回符合条件的超链接。
默认情况下,链接会被转换为正常的网址形式。
submit($URI,$formvars)
----------------------
This submits a form to the specified $URI. $formvars is an
array of the form variables to pass.
此方法会向指定url提交form表单,$formvars是一个要传递变量的数组。
submittext($URI,$formvars)
--------------------------
This behaves exactly like submit() except that it only returns
the text from the page, stripping out html tags and other
irrelevant data.
类似与submit()方法,但是只返回文本内容。
submitlinks($URI)
----------------
This behaves exactly like submit() except that it only returns
the links from the page. By default, relative links are
converted to their fully qualified URL form.
类似与submit()方法,但是只返回链接。
CLASS VARIABLES: (default value in parenthesis)
$host the host to connect to
$port the port to connect to
$proxy_host the proxy host to use, if any
$proxy_port the proxy port to use, if any
proxy can only be used for http URLs, but not https
$agent the user agent to masqerade as (Snoopy v0.1)
$referer referer information to pass, if any
$cookies cookies to pass if any
$rawheaders other header info to pass, if any
$maxredirs maximum redirects to allow. 0=none allowed. (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks whether or not to expand links to fully qualified URLs (true)
$user authentication username, if any
$pass authentication password, if any
$accept http accept types (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
$error where errors are sent, if any
$response_code responde code returned from server
$headers headers returned from server
$maxlength max return data length
$read_timeout timeout on read operations (requires PHP 4 Beta 4+)
set to 0 to disallow timeouts
$timed_out true if a read operation timed out (requires PHP 4 Beta 4+)
$maxframes number of frames we will follow
$status http status of fetch
$temp_dir temp directory that the webserver can write to. (/tmp)
$curl_path system path to cURL binary, set to false if none
(this variable is ignored as of Snoopy v1.2.6)
$cafile name of a file with CA certificate(s)
$capath name of a correctly hashed directory with CA certificate(s)
if either $cafile or $capath is set, SSL certificate
verification is enabled
EXAMPLES(案例):
Example: fetch a web page and display the return headers and
the contents of the page (html-escaped):
include "Snoopy.class.php";
$snoopy = new Snoopy;
$snoopy->user = "joe";
$snoopy->pass = "bloe";
if($snoopy->fetch("http://www.slashdot.org/"))
{
echo "response code: ".$snoopy->response_code."<br>\n";
while(list($key,$val) = each($snoopy->headers))
echo $key.": ".$val."<br>\n";
echo "<p>\n";
echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";
Example: submit a form and print out the result headers
and html-escaped page:
include "Snoopy.class.php";
$snoopy = new Snoopy;
$submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
$submit_vars["q"] = "amiga";
$submit_vars["submit"] = "Search!";
$submit_vars["searchhost"] = "Altavista";
if($snoopy->submit($submit_url,$submit_vars))
{
while(list($key,$val) = each($snoopy->headers))
echo $key.": ".$val."<br>\n";
echo "<p>\n";
echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";
Example: showing functionality of all the variables:
include "Snoopy.class.php";
$snoopy = new Snoopy;
$snoopy->proxy_host = "my.proxy.host";
$snoopy->proxy_port = "8080";
$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
$snoopy->referer = "http://www.microsnot.com/";
$snoopy->cookies["SessionID"] = 238472834723489l;
$snoopy->cookies["favoriteColor"] = "RED";
$snoopy->rawheaders["Pragma"] = "no-cache";
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = false;
$snoopy->user = "joe";
$snoopy->pass = "bloe";
if($snoopy->fetchtext("http://www.phpbuilder.com"))
{
while(list($key,$val) = each($snoopy->headers))
echo $key.": ".$val."<br>\n";
echo "<p>\n";
echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";
Example: fetched framed content and display the results
include "Snoopy.class.php";
$snoopy = new Snoopy;
$snoopy->maxframes = 5;
if($snoopy->fetch("http://www.ispi.net/"))
{
echo "<PRE>".htmlspecialchars($snoopy->results[0])."</PRE>\n";
echo "<PRE>".htmlspecialchars($snoopy->results[1])."</PRE>\n";
echo "<PRE>".htmlspecialchars($snoopy->results[2])."</PRE>\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";
COPYRIGHT(版权):
Copyright(c) 1999,2000 ispi. All rights reserved.
This software is released under the GNU General Public License.
Please read the disclaimer at the top of the Snoopy.class.php file.
THANKS(鸣谢):
Special Thanks to:
Peter Sorger sorgo@cool.sk help fixing a redirect bug
Andrei Zmievski andrei@ispi.net implementing time out functionality
Patric Sandelin patric@kajen.com help with fetchform debugging
Carmelo carmelo@meltingsoft.com misc bug fixes with frames