php simple_html_dom网页采集

最新推荐文章于 2023-04-18 16:10:35 发布

var13966498715

最新推荐文章于 2023-04-18 16:10:35 发布

阅读量451

点赞数

分类专栏： php 文章标签： php 正则

本文链接：https://blog.csdn.net/var13966498715/article/details/47278311

版权

php 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

对于php采集网站内容来说，使用 simple_html_dom 能够是开发周期变短，省去要写很多正则的麻烦，下面就来介绍这个类如何使用
1、去网上下载这个包，并且引用
2、代码部分

require ‘simple_html_dom.php’；
class basketballList{
    private $_url = 'http://cnba.cc/';//采集的网址`这里写代码片`
    private $html = '';//操作采集网页内容的对象
    /**
     * 获取要采集的信息
     */
    public function getData(){
        $this->html = file_get_html($this->_url);//使用类中的方法加载页面
        $this->html || die('无法获取网页内容.');
        //采集id=tabcontent1下面的表格id=mytable下的内容。并且是第0个出现的位置（这个需要灵活运用，simple_html_dom和jquery的获取id方式类似，获取id的方式是使用#，获取class的方式是使用 “ . ”一个空格就代表一个元素）
        $contents = $this->html->find("#tabcontent1 table#mytable",0);
        $currentYear = date('Y');
        $html = $contents->find('tr');//此处代表获取到table里面所有的tr标签的一个数组集合，并在下面循环
        $num = count($html)-1;
        foreach($html as $key=>$val) {
            //排除掉第一个不是采集的信息，和最后一个不是采集的信息
            if($key > 0 && $key<$num){
                $gameTime = $val->find('th',0);//寻找tr标签里面的第一个th标签，注意一定要加上0，否则下面获取内容将不可使用
                $gameTime = $gameTime->innertext;//获取第一个th标签里面的所有内容
                $data['gameTime'] = $currentYear.'-'.str_replace('月','-',str_replace('日','',$gameTime));//将月、日替换成可以识别的字符
                $data['title'] = $data['gameTime'];//比赛标题
                $data['gameTime'] = strtotime($data['gameTime']);
                $videoList[1] = array();//集锦，可能不存在，而下面需要合并，所以需要现在声明，以下雷同
                $videoList[2] = array();
                foreach($val->find('td') as $k=>$v){
                    if($k < 3){
                        $data['title'] .= ' '.strip_tags(trim($v->innertext));//三个值拼凑为标题
                    } elseif($k == 4){
                        $aObj = $v->find('a',0);
                    } elseif($k == 5){
                        $aObj = $v->find('a',0);
                        $href = $aObj->href;
                        $data['title_md5'] = md5($this->_url.$href);
                    }
                }
                $data['videoList'] = array_merge($videoList[1],$videoList[2]);
                if(empty($data['videoList'])) continue;
                $this->_insertOrUpdateMysql($data);
            }
        }
        echo 'collect ok';
    }


    /*
     * 删除占用的对象，释放内存
     */
    public function __destruct(){
        unset($this->html);
    }
}