PHP抓取网页内容汇总

最新推荐文章于 2024-09-21 13:09:33 发布

binger819623

最新推荐文章于 2024-09-21 13:09:33 发布

阅读量2.4k

点赞数

分类专栏： PHP 文章标签： php stream url file cookies login

PHP 专栏收录该内容

188 篇文章 1 订阅

订阅专栏

首先，放一些比较小巧的程序，但是功能和实际效果有待完善。

①、使用php获取网页内容
http://hi.baidu.com/quqiufeng/blog/item/7e86fb3f40b598c67d1e7150.html
header("Content-type: text/html; charset=utf-8");
1、
$xhr = new COM("MSXML2.XMLHTTP");
$xhr->open("GET","http://localhost/xxx.php?id=2",false);
$xhr->send();
echo $xhr->responseText

2、file_get_contents实现
<?php
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>

3、fopen()实现
<?
if ($stream = fopen('http://www.sohu.com', 'r')) {
    // print all the page starting at the offset 10
    echo stream_get_contents($stream, -1, 10);
    fclose($stream);
}

if ($stream = fopen('http://www.sohu.net', 'r')) {
    // print the first 5 bytes
    echo stream_get_contents($stream, 5);
    fclose($stream);
}
?>

②、使用php获取网页内容
http://www.blogjava.net/pts/archive/2007/08/26/99188.html
简单的做法:
<?php
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>
或者:
<?
if ($stream = fopen('http://www.sohu.com', 'r')) {
    // print all the page starting at the offset 10
    echo stream_get_contents($stream, -1, 10);
    fclose($stream);
}

if ($stream = fopen('http://www.sohu.net', 'r')) {
    // print the first 5 bytes
    echo stream_get_contents($stream, 5);
    fclose($stream);
}
?>

③、PHP获取网站内容，保存为TXT文件源码
http://blog.chinaunix.net/u1/44325/showart_348444.html
<?
$my_book_url='http://book.yunxiaoge.com/files/article/html/4/4550/index.html';
ereg("http://book.yunxiaoge.com/files/article/html/[0-9]+/[0-9]+/",$my_book_url,$myBook);
$my_book_txt=$myBook[0];
$file_handle = fopen($my_book_url, "r");//读取文件
unlink("test.txt");
while (!feof($file_handle)) { //循环到文件结束
    $line = fgets($file_handle); //读取一行文件
    $line1=ereg("href=/"[0-9]+.html",$line,$reg); //分析文件内部书的文章页面
       $handle = fopen("test.txt", 'a');
   if ($line1) {
     $my_book_txt_url=$reg[0]; //另外赋值,给抓取分析做准备
   $my_book_txt_url=str_replace("href=/"","",$my_book_txt_url);
      $my_book_txt_over_url="$my_book_txt$my_book_txt_url"; //转换为抓取地址
      echo "$my_book_txt_over_url</p>"; //显示工作状态
      $file_handle_txt = fopen($my_book_txt_over_url, "r"); //读取转换后的抓取地址
      while (!feof($file_handle_txt)) {
       $line_txt = fgets($file_handle_txt);
       $line1=ereg("^&nbsp.+",$line_txt,$reg); //根据抓取内容标示抓取
       $my_over_txt=$reg[0];
       $my_over_txt=str_replace("    ","    ",$my_over_txt); //过滤字符
       $my_over_txt=str_replace("<br />","",$my_over_txt);
       $my_over_txt=str_replace("<script. language=/"javascript/">","",$my_over_txt);
       $my_over_txt=str_replace(""","",$my_over_txt);
       if ($line1) {
         $handle1=fwrite($handle,"$my_over_txt/n"); //写入文件
       }
      }
    }
}
fclose($file_handle_txt);
fclose($handle);
fclose($file_handle); //关闭文件
echo "完成</p>";
?>

下面是比较嚣张的方法。
这里使用一个名叫Snoopy的类。
先是在这里看到的：
PHP中获取网页内容的Snoopy包
http://blog.declab.com/read.php/27.htm
然后是Snoopy的官网：
http://sourceforge.net/projects/snoopy/
这里有一些简单的说明：
代码收藏-Snoopy类及简单的使用方法
http://blog.passport86.com/?p=161
下载：http://sourceforge.net/projects/snoopy/

今天才发现这个好东西，赶紧去下载了来看看，是用的parse_url
还是比较习惯curl

snoopy是一个php类，用来模仿web浏览器的功能，它能完成获取网页内容和发送表单的任务。
下面是它的一些特征：
1、方便抓取网页的内容
2、方便抓取网页的文字（去掉HTML代码）
3、方便抓取网页的链接
4、支持代理主机
5、支持基本的用户/密码认证模式
6、支持自定义用户agent,referer,cookies和header内容
7、支持浏览器转向，并能控制转向深度
8、能把网页中的链接扩展成高质量的url（默认）
9、方便提交数据并且获取返回值
10、支持跟踪HTML框架（v0.92增加）
11、支持再转向的时候传递cookies

具体使用请看下载文件中的说明。

<?php
include “ Snoopy.class.php “ ;
$snoopy = new Snoopy ;
$snoopy -> fetchform ( “ http://www.phpx.com/happy/logging.php?action=login “ ) ;
print $snoopy -> results ;
?>

<?php
include “ Snoopy.class.php “ ;
$snoopy = new Snoopy ;
$submit_url = “ http://www.phpx.com/happy/logging.php?action=login “ ; $submit_vars [ " loginmode " ] = “ normal “ ;
$submit_vars [ " styleid " ] = “ 1 “ ;
$submit_vars [ " cookietime " ] = “ 315360000 “ ;
$submit_vars [ " loginfield " ] = “ username “ ;
$submit_vars [ " username " ] = “ ******** “ ; //你的用户名
$submit_vars [ " password " ] = “ ******* “ ; //你的密码
$submit_vars [ " questionid " ] = “ 0 “ ;
$submit_vars [ " answer " ] = “” ;
$submit_vars [ " loginsubmit " ] = “ 提   交 “ ;
$snoopy -> submit ( $submit_url , $submit_vars ) ;
print $snoopy -> results ;