PHP正则采集常用方法

最新推荐文章于 2021-04-05 12:28:14 发布

@航空母舰

最新推荐文章于 2021-04-05 12:28:14 发布

阅读量365

点赞数

分类专栏：采集文章标签： Linux 编程 ASP PHP Windows

本文链接：https://blog.csdn.net/hudeyong926/article/details/99540281

版权

采集专栏收录该内容

10 篇文章 0 订阅

订阅专栏

正则表达式是一个非常有用的编程技能。一般来说，简单的抓取一个HTML页面的某一条信息，比如<title>标题</title>，是很容易实现的。但是，我们往往要抓取某一个列表页面里的多个重复的<div></div>块里的特定内容，并且<div></div>块还有嵌套的使用，我们抓取的则是每个重复<div></div>块里的多个信息。同时，网页源文件不同于一般的字符串，其还存在大量的回车、换行和制表符，这些都造成了匹配失败。而初学者往往无法判断到底是哪个环节出现了问题，并且看到高度技巧化的正则表达式会感到非常沮丧，从而导致放弃问题的解决。需要先去除所有的换行符、制表符、回车等等，对于便于阅读的html源文件由于上述符号的存在会造成无法匹配

$str = preg_replace("/[\t\n\r]+/", "", $str);

某些数据如果用XPath表达式也不好取,或者取出来的数据还需要加工的,用正则表达式处理,用preg_match_all进行抽取,用preg_replace进行替换
用strip_tags()函数去除HTML、XML以及PHP的标签,加参数可以保留标签去除,如处理文章内容

strip_tags($str, "<p><img><strong>")

后留后面参数中的标签

一些常用的正则表达式

$str = preg_replace("/<(\/?body.*?)>/si", "", $str); //过滤body标签
$str = preg_replace("/<(\/?form.*?)>/si", "", $str); //过滤form标签
$str = preg_replace("/cookie/si", "COOKIE", $str); //过滤COOKIE标签
$str = preg_replace("/<(object.*?)>(.*?)<(\/object.*?)>/si", "", $str); //过滤object标签
$str = preg_replace("/<(\/?objec.*?)>/si", "", $str); //过滤object标签
$str = preg_replace("/<(noframes.*?)>(.*?)<(\/noframes.*?)>/si", "", $str); //过滤noframes标签
$str = preg_replace("/<(\/?noframes.*?)>/si", "", $str); //过滤noframes标签
$str = preg_replace("/<(i?frame.*?)>(.*?)<(\/i?frame.*?)>/si", "", $str); //过滤frame标签
$str = preg_replace("/<(\/?i?frame.*?)>/si", "", $str); //过滤frame标签
$str = preg_replace("/on([a-z]+)\s*=/si", "On\\1=", $str); //过滤script标签
$str = preg_replace("/&#/si", "&＃", $str); //过滤script标签，如javAsCript:alert(

PHP采集程序中常用的函数

<?php
//相对路径转化成绝对路径
function relative_to_absolute($content, $feed_url) {
    preg_match('/(http|https|ftp):\/\//', $feed_url, $protocol);
    $server_url = preg_replace("/(http|https|ftp|news):\/\//", "", $feed_url);
    $server_url = preg_replace("/\/.*/", "", $server_url);

    if ($server_url == '') {
        return $content;
    }

    if (isset($protocol[0])) {
        $new_content = preg_replace('/href="\//', 'href="' . $protocol[0] . $server_url . '/', $content);
        $new_content = preg_replace('/src="\//', 'src="' . $protocol[0] . $server_url . '/', $new_content);
    } else {
        $new_content = $content;
    }
    return $new_content;
}

//取得所有链接
function get_all_url($code) {
    preg_match_all('/<a\s+href=["|\']?([^>"\' ]+)["|\']?\s*[^>]*>([^>]+)<\/a>/i', $code, $arr);
    return array('name' => $arr[2], 'url' => $arr[1]);
}

/*   get_web_tags('id="nav"', 'ul', 'http://mail.163.com/html/mail_intro/', false, true)
 *   $param string $tag_attr 标签属性 标签属性可用于精确匹配标签，可为：id="main",class="p",name="task",border="0px"等,可为''。
 *   $param string $tag       标签名  标签名可为任意HTML标签，包括div,ul,table等。
 *   $param string $url       标签名  用于非输入HTML源码情况下获取HTML源码。这个较为通用。
 *   $param string $data      任意用于测试用的HTML源码都可以。
 */
function get_web_tags($tag_attr, $tag = 'div', $url = false, $data = false, $first = false) {
    //默认采用URL获取数据
    if ($url !== false) {
        $data = file_get_contents($url);
    }
    //页面编码判定及转码
    $charset_pos = stripos($data, 'charset');
    if ($charset_pos) {
        if (stripos($data, 'charset=utf-8', $charset_pos)) {
            $data = iconv('utf-8', 'utf-8', $data);
        } else if (stripos($data, 'charset=gb2312', $charset_pos)) {
            $data = iconv('gb2312', 'utf-8', $data);
        } else if (stripos($data, 'charset=gbk', $charset_pos)) {
            $data = iconv('gbk', 'utf-8', $data);
        }
    }

    //匹配命中标签至数组$hits
    preg_match_all('/<' . $tag . '[^<]*?' . $tag_attr . '/i', $data, $hits, PREG_OFFSET_CAPTURE);
    if (count($hits[0]) === 0) { //未命中，直接返回
        return '没有匹配项！';
    }

    preg_match_all('/<' . $tag . '/i', $data, $pre_matches, PREG_OFFSET_CAPTURE); //获取所有HTML标签前缀
    preg_match_all('/<\/' . $tag . '/i', $data, $suf_matches, PREG_OFFSET_CAPTURE); //获取所有HTML标签后缀

    //判断是否<div></div>格式，是则添加结束标签，否则为false;  注：img、input等可能不是这种格式，此时$suf_matches[0]为空。
    if (!empty($suf_matches[0])) $endTag = '</' . $tag . '>';
    else $endTag = false;

    //合并所有HTML标签
    $htmltags = array();
    if ($endTag !== false) {
        foreach ($pre_matches[0] as $index => $pre_div) {
            $htmltags[(int)$pre_matches[0][$index][1]] = 'p';
            $htmltags[(int)$suf_matches[0][$index][1]] = 's';
        }
    } else {
        foreach ($pre_matches[0] as $index => $pre_div) {
            //非<div></div>格式，获取前缀下标后的第一个>作为标签结束
            $suf_matches[0][$index][1] = stripos($data, '>', $pre_matches[0][$index][1]) + 1;

            $htmltags[(int)$pre_matches[0][$index][1]] = 'p';
            $htmltags[(int)$suf_matches[0][$index][1]] = 's';
        }
    }
    //对所有HTML标签按index进行排序
    $sort = array_keys($htmltags);
    asort($sort);

    //开始获取命中字符串
    $hitTagStrings = array();
    foreach ($hits[0] as $hit) {
        $hit = $hit[1]; //获取命中index

        $count = count($sort); //循环控制，$count--避免无限循环
        foreach ($pre_matches[0] as $index => $pre_div) {
            //最后一个$pre_matches[0][$index+1]会造成数组出界，因此设置其index等于总长度
            if (!isset($pre_matches[0][$index + 1][1])) $pre_matches[0][$index + 1][1] = strlen($data);

            //<div $hit <div+1    时div被命中
            if (($pre_matches[0][$index][1] <= $hit) && ($hit < $pre_matches[0][$index + 1][1])) {
                $deeper = 0;
                //弹出被命中HTML标签前的所有HTML标签
                while (array_shift($sort) != $pre_matches[0][$index][1] && ($count--)) continue;
                //对剩余HTML标签进行匹配，若下一个为前缀(p)，则向下一层，$deeper加1，
                //否则后退一层，$deeper减1，$deeper为0则命中匹配结束标记，计算div长度
                foreach ($sort as $key) {
                    if ($htmltags[$key] == 'p') { //进入子层
                        $deeper++;
                    } else if ($deeper == 0) { //碰到结束标记
                        $length = $key - $pre_matches[0][$index][1]; //长度等于结束标记index 减去 前缀index
                        break;
                    } else { //碰到子层结束标记
                        $deeper--;
                    }
                }
                $hitTagStrings[] = substr($data, $pre_matches[0][$index][1], $length) . $endTag;
                break;
            }
        }
        //若只获取第一个匹配项，退出循环
        if ($first && count($hitTagStrings) == 1) break;
    }

    return $hitTagStrings;
}

//HTML表格的每行转为CSV格式数组
function get_tr_array($table) {
    $table = preg_replace("'<td[^>]*?>'si", '"', $table);
    $table = str_replace("</td>", '",', $table);
    $table = str_replace("</tr>", "{tr}", $table);
    //去掉 HTML 标记
    $table = preg_replace("'<[\/\!]*?[^<>]*?>'si", "", $table);
    //去掉空白字符
    $table = preg_replace("'([\t\r\n])[\s]+'", "", $table);
    //$table = str_replace(" ", "", $table);

    $table = explode(",{tr}", $table);
    $table = str_replace("{tr}", "", $table);
    array_pop($table);
    return $table;
}

//将HTML表格的每行每列转为数组，采集表格数据
function get_td_array($table) {
    $table = preg_replace("'<table[^>]*?>'si", "", $table);
    $table = preg_replace("'<tr[^>]*?>'si", "", $table);
    $table = preg_replace("'<td[^>]*?>'si", "", $table);
    $table = str_replace("</tr>", "{tr}", $table);
    $table = str_replace("</td>", "{td}", $table);
    //去掉 HTML 标记
    $table = preg_replace("'<[\/\!]*?[^<>]*?>'si", "", $table);
    //去掉空白字符
    $table = preg_replace("'([\t\r\n])[\s]+'", "", $table);
    //$table = str_replace(" ", "", $table);

    $table = explode('{tr}', $table);
    array_pop($table);
    $td_array = [];
    foreach ($table as $key => $tr) {
        $td = explode('{td}', $tr);
        array_pop($td);
        $td_array[] = trim($td);
    }
    return $td_array;
}

//返回字符串中的所有单词 $distinct=true 去除重复
function split_en_str($str, $distinct = true) {
    preg_match_all('/([a-zA-Z]+)/', $str, $match);
    if ($distinct == true) {
        $match[1] = array_unique($match[1]);
    }
    sort($match[1]);
    return $match[1];
}

function filter_html_tag($str) {
    $str = preg_replace("/<[ ]+/si", "<", $str); //过滤<__("<"号后面带空格)
    $str = preg_replace("/<(title.*?)>(.*?)<(\/title.*?)>/si", "", $str); //过滤title标签
    $str = preg_replace("/<(\/?title.*?)>/si", "", $str);
    $str = preg_replace("/<\!--.*?-->/si", "", $str); //注释
    $str = preg_replace("/<(\!.*?)>/si", "", $str); //过滤DOCTYPE
    $str = preg_replace("/<(\/?html.*?)>/si", "", $str); //过滤html标签
    $str = preg_replace("/<(\/?head.*?)>/si", "", $str); //过滤head标签
    $str = preg_replace("/<(\/?meta.*?)>/si", "", $str); //过滤meta标签
    $str = preg_replace("/<(\/?link.*?)>/si", "", $str); //过滤link标签
    $str = preg_replace("/<(script.*?)>(.*?)<(\/script.*?)>/si", "", $str); //过滤script标签
    $str = preg_replace("/<(\/?script.*?)>/si", "", $str); //过滤script标签
    $str = preg_replace("/javascript/si", "Javascript", $str); //过滤script标签
    $str = preg_replace("/vbscript/si", "Vbscript", $str); //过滤script标签
    $str = preg_replace("/<(applet.*?)>(.*?)<(\/applet.*?)>/si", "", $str); //过滤applet标签
    $str = preg_replace("/<(\/?applet.*?)>/si", "", $str); //过滤applet标签
    $str = preg_replace("/<(style.*?)>(.*?)<(\/style.*?)>/si", "", $str); //过滤style标签
    $str = preg_replace("/<(\/?style.*?)>/si", "", $str); //过滤style标签
    return $str;
}

function get_tag_data($str, $tag = 'title') {
    preg_match("/<($tag.*?)>(.*?)<(\/$tag.*?)>/si", $str, $title);
    return $title['2'];
}

PHP下载CSS文件中的图片

<?
function getImagesFromCssFile() {
//note 设置PHP超时时间
    set_time_limit(0);

//note 取得样式文件内容
    $styleFileContent = file_get_contents('images/style.css');

//note 匹配出需要下载的URL地址
    preg_match_all("/url\(.+?\)/", $styleFileContent, $imagesURLArray);

//note 循环需要下载的地址，逐个下载
    $imagesURLArray = array_unique($imagesURLArray[0]);
    foreach ($imagesURLArray as $imagesURL) {
        $imagesURL = str_ireplace(array("url(", ")", "'",'"'), '', $imagesURL);  //url(" ") url('')
        if (preg_match('/^http.*/', $imagesURL)) {   //跳过网络图片
            continue;
        }
        file_put_contents(basename($imagesURL), file_get_contents($imagesURL));
    }
}

<?php
/*完成网页内容捕获功能*/
function get_img_url($site_name) {
    $site_fd = fopen($site_name, "r");
    $site_content = "";
    while (!feof($site_fd)) {
        $site_content .= fread($site_fd, 1024);
    }
    /*利用正则表达式得到图片链接*/
    $reg_tag = '/<img.*?\"([^\"]*(jpg|bmp|jpeg|gif)).*?>/';
    $ret = preg_match_all($reg_tag, $site_content, $match_result);
    fclose($site_fd);
    return $match_result[1];
}

/* 对图片链接进行修正 */
function revise_site($site_list, $base_site) {
    foreach ($site_list as $site_item) {
        if (preg_match('/^http/', $site_item)) {
            $return_list[] = $site_item;
        } else {
            $return_list[] = $base_site . "/" . $site_item;
        }
    }
    return $return_list;
}

/*得到图片名字，并将其保存在指定位置*/
function get_pic_file($pic_url_array, $pos) {
    $reg_tag = '/.*\/(.*?)$/';
    $count = 0;
    foreach ($pic_url_array as $pic_item) {
        $ret = preg_match_all($reg_tag, $pic_item, $t_pic_name);
        $pic_name = $pos . $t_pic_name[1][0];
        $pic_url = $pic_item;
        print("Downloading " . $pic_url . "\n");
        $img_read_fd = fopen($pic_url, "r");
        $img_write_fd = fopen($pic_name, "w");
        $img_content = "";
        while (!feof($img_read_fd)) {
            $img_content .= fread($img_read_fd, 1024);

        }
        fwrite($img_write_fd, $img_content);
        fclose($img_read_fd);
        fclose($img_write_fd);
        print("[OK]\n");
    }
    return 0;
}

function main() {
    /* 待抓取图片的网页地址 */
    $site_name = "http://image.cn.yahoo.com";
    $img_url = get_img_url($site_name);
    $img_url_revised = revise_site($img_url, $site_name);
    $img_url_unique = array_unique($img_url_revised); //unique array
    get_pic_file($img_url_unique, "./");
}

main();
?>

此程序略有不足，如果图片在网站服务器上不同目次下但文件名是相同的，此时图片有可能是不一样的，但在最后生存时，后面得到的图片会将前边已生存的图片覆盖掉，如在http://example.com/网站上有图片链接http://example.com/pic/test1.jpg和http: //example.com/pic/new/test1.jpg那么在下载时这两张图片只有一张生存，另外一张就被覆盖掉，修正的方法是在每派生的存前先检索当前目次下是否已有此文件名，有的话对将派生的存的图片从头命名即可。

<?php
//取得页面所有的图片地址

function getimages($str){

    $match_str = "/((http://)+([^ rn()^$!`"'|[]{}<>]*)((.gif)|(.jpg)|(.bmp)|(.png)|(.GIF)|(.JPG)|(.PNG)|(.BMP)))/";

    preg_match_all ($match_str,$str,$out,PREG_PATTERN_ORDER);

    return $out;

}?>