c 微博内容html解析,抓取微博热词，使用simple_html_dom来操作html数据

最新推荐文章于 2022-10-12 20:15:00 发布

weixin_39882623

最新推荐文章于 2022-10-12 20:15:00 发布

阅读量387

点赞数

文章标签： c 微博内容html解析

一直以来使用php解析html文档树都是一个难题。Simple HTML DOM parser 很好地解决了这个问题。可以通过这个php类来解析html文档，对其中的html元素进行操作 (PHP5+以上版本)。

解析器不仅仅只是帮助我们验证html文档；更能解析不符合W3C标准的html文档。它使用了类似jQuery的元素选择器，通过元素的id，class，tag等等来查找定位；同时还提供添加、删除、修改文档树的功能。和jq一样的操作还是很方便的。

有三种方式调用这个类：

从url中加载html文档

从字符串中加载html文档

从文件中加载html文档

// 新建一个Dom实例

$html =new simple_html_dom();

// 从url中加载

$html->load_file();

// 从字符串中加载

$html->load('

从字符串中加载html文档演示');

//从文件中加载

$html->load_file('path/file/test.html');

查找html元素

可以使用find函数来查找html文档中的元素。返回的结果是一个包含了对象的数组。我们使用HTML DOM解析类中的函数来访问这些对象，下面给出几个示例

//查找html文档中的超链接元素

$a =$html->find('a');

//查找文档中第(N)个超链接，如果没有找到则返回空数组.

$a =$html->find('a', 0);

// 查找id为main的div元素

$main =$html->find('div[id=main]',0);

// 查找所有包含有id属性的div元素

$divs =$html->find('div[id]');

// 查找所有包含有id属性的元素

$divs =$html->find('[id]');

// 查找id='#container'的元素

$ret =$html->find('#container');

// 找到所有class=foo的元素

$ret =$html->find('.foo');

// 查找多个html标签

$ret =$html->find('a, img');

// 还可以这样用

$ret =$html->find('a[title], img[title]');

// 返回父元素

$e->parent;

// 返回子元素数组

$e->children;

// 通过索引号返回指定子元素

$e->children(0);

// 返回第一个资源速

$e->first_child ();

// 返回最后一个子元素

$e->last _child ();

// 返回上一个相邻元素

$e->prev_sibling ();

//返回下一个相邻元素

$e->next_sibling ();

元素属性操作

使用简单的正则表达式来操作属性选择器。

[attribute] – 选择包含某属性的html元素

[attribute=value] – 选择所有指定值属性的html元素

[attribute!=value]- 选择所有非指定值属性的html元素

[attribute^=value] -选择所有指定值开头属性的html元素

[attribute$=value] 选择所有指定值结尾属性的html元素

[attribute*=value] -选择所有包含指定值属性的html元素

如何避免解析器消耗过多内存

有时候可能Simple HTML DOM解析器消耗内存过多。如果php脚本占用内存太多，会导致网站停止响应等一系列严重的问题。解决的方法也很简单，在解析器加载html文档并使用完成后，记得清理掉这个对象就可以了。

$html->clear();

下面看看微博热词抓取的源码示例

header('Content-Type:text/html;charset=gbk');

include "simple_html_dom.php";

class Tmemcache {

protected $memcache;

function __construct($cluster) {

$this->memcache =new Memcache;

foreach ($cluster['memcached']as $server) {

$this->memcache->addServer($server['host'],$server['port']);

}

function fetch($cache_key) {

return $this->memcache->get($cache_key);

}

function store($cache_key,$val,$expire = 7200) {

$this->memcache->set($cache_key,$val, MEMCACHE_COMPRESSED,$expire);

}

function flush() {

$this->memcache->flush();

}

function delete($cache_key,$timeout = 0) {

$this->memcache->delete($cache_key,$timeout);

}

function unicode_hex_2_gbk($name) {

$a = json_decode('{"a":"' .$name .'"}');

if (isset($a) &&is_object($a)) {

return iconv('UTF-8','GBK//IGNORE',$a->a);

return $a->a;

}

return null;

}

function curl_fetch($url,$time = 3) {

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL,$url);

curl_setopt($ch, CURLOPT_TIMEOUT,$time);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$data = curl_exec($ch);

$errno = curl_errno($ch);

if ($errno > 0) {

$err ="[CURL] url:{$url} ; errno:{$errno} ; info:" . curl_error($ch) .";";

echo $err;

$data = false;

}

curl_close($ch);

return $data;

}

$cluster["memcached"] =array(

array("host" =>"10.11.1.1","port" => 11211),

);

//$memcache = new Tmemcache($cluster);

$cache_key = md5("weibo" .$url);

//$str = $memcache->fetch($cache_key);

//if (!isset($_GET["nocache"]) && !empty($str)) {

// echo $str;

// exit;

//}

$content = curl_fetch($url);

if ($content === false)

exit;

$html = str_get_html($content);

$a =$html->find('script', 8);

//测试

$a =str_replace(array('\\"','\\/', "\\n", "\\t"), array('"','/',"",""),$a);

$pos =strpos($a,'

');

$a =substr($a,$pos);

//echo "

//echo ($a);

//echo "

$html = str_get_html($a);

$arr =array();

foreach ($html->find('table[id=event]', 0)->find('.rank_content')as $element) {

$arr[] = unicode_hex_2_gbk($element->find("a", 0)->plaintext);

}

$html->clear();

$str = implode(",",$arr);

//if (!isset($_GET["nocache"]))

// $memcache->store($cache_key, $str, 3600);

echo $str;

weixin_39882623

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
c 微博内容html解析,抓取微博热词，使用simple_html_dom来操作html数据

一直以来使用php解析html文档树都是一个难题。Simple HTML DOM parser很好地解决了这个问题。可以通过这个php类来解析html文档，对其中的html元素进行操作 (PHP5+以上版本)。解析器不仅仅只是帮助我们验证html文档；更能解析不符合W3C标准的html文档。它使用了类似jQuery的元素选择器，通过元素的id，class，tag等等来查找定位；同时还提供添加、删...
复制链接

扫一扫