使用PHP简单HTML DOM解析器解析网页

Gervinho to Arsenal

For those of you who have had the pleasure of following me on Twitter (...), you probably know that I'm a complete soccer (football) fanatic.  I even started a separate Twitter account to voice my footy musings.  If you follow football yourself, you'll know that we've just started the international transfer window and there are a billion rumors about a billion players going to a billion clubs.  It's enough to drive you mad but I simply HAVE TO KNOW who will be in the Arsenal and Liverpool first teams next season.

对于那些你们谁曾跟随的乐趣我的Twitter (...),你可能知道,我是一个完整的足球(足球)的狂热。 我什至还开设了一个单独的Twitter帐户来表达我的沉思。 如果您自己跟随足球运动,您会知道我们才刚刚开始国际转会窗口,并且有十亿谣言称十亿球员将进入十亿俱乐部。 这足以使您发疯,但我只想知道下个赛季谁将进入阿森纳和利物浦一线队。

The problem I run into, besides all of the rubbish reports making waved, is that I don't have time to check every website on the hour.  Twitter is a big help, but there's nothing better during this time than an official report from each club's website.  To keep an eye on those reports, I'm using the power of PHP Simple HTML DOM Parser to write a tiny PHP script that shoots me an email whenever a specific page is updated.

除了挥舞所有垃圾报告外,我遇到的问题是我没有时间在一个小时内检查每个网站。 Twitter是一个很大的帮助,但是在此期间,没有什么比每个俱乐部网站上的官方报告更好的了。 为了密切关注这些报告,我使用PHP Simple HTML DOM解析器的功能编写了一个小PHP脚本,每当更新特定页面时,该脚本就会向我发送电子邮件。

PHP简单HTML DOM解析器 (PHP Simple HTML DOM Parser)

PHP Simple HTML DOM Parser is a dream utility for developers that work with both PHP and the DOM because developers can easily find DOM elements using PHP. Here are a few sample uses of PHP Simple HTML DOM Parser:

PHP简单HTML DOM解析器对于使用PHP和DOM的开发人员来说是理想的实用程序,因为开发人员可以使用PHP轻松找到DOM元素。 以下是PHP Simple HTML DOM解析器的一些示例用法:


// Include the library
include('simple_html_dom.php');
 
// Retrieve the DOM from a given URL
$html = file_get_html('https://davidwalsh.name/');

// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e) 
    echo $e->href . '<br>';

// Retrieve all images and print their SRCs
foreach($html->find('img') as $e)
    echo $e->src . '<br>';

// Find all images, print their text with the "<>" included
foreach($html->find('img') as $e)
    echo $e->outertext . '<br>';

// Find the DIV tag with an id of "myId"
foreach($html->find('div#myId') as $e)
    echo $e->innertext . '<br>';

// Find all SPAN tags that have a class of "myClass"
foreach($html->find('span.myClass') as $e)
    echo $e->outertext . '<br>';

// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
    echo $e->innertext . '<br>';
    
// Extract all text from a given cell
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';


Like I said earlier, this library is a dream for finding elements, just as the early JavaScript frameworks and selector engines have become. Armed with the ability to pick content from DOM nodes with PHP, it's time to analyze websites for changes.

就像我之前说的,这个库是寻找元素的梦想,就像早期JavaScript框架和选择器引擎一样。 借助使用PHP从DOM节点中选择内容的能力,现在该对网站进行更改分析了。

剧本 (The Script)

The following script checks two websites for changes:

以下脚本检查两个网站的更改:


// Pull in PHP Simple HTML DOM Parser
include("simplehtmldom/simple_html_dom.php");

// Settings on top
$sitesToCheck = array(
					// id is the page ID for selector
					array("url" => "http://www.arsenal.com/first-team/players", "selector" => "#squad"),
					array("url" => "http://www.liverpoolfc.tv/news", "selector" => "ul[style='height:400px;']")
				);
$savePath = "cachedPages/";
$emailContent = "";

// For every page to check...
foreach($sitesToCheck as $site) {
	$url = $site["url"];
	
	// Calculate the cachedPage name, set oldContent = "";
	$fileName = md5($url);
	$oldContent = "";
	
	// Get the URL's current page content
	$html = file_get_html($url);
	
	// Find content by querying with a selector, just like a selector engine!
	foreach($html->find($site["selector"]) as $element) {
		$currentContent = $element->plaintext;;
	}
	
	// If a cached file exists
	if(file_exists($savePath.$fileName)) {
		// Retrieve the old content
		$oldContent = file_get_contents($savePath.$fileName);
	}
	
	// If different, notify!
	if($oldContent && $currentContent != $oldContent) {
		// Here's where we can do a whoooooooooooooole lotta stuff
		// We could tweet to an address
		// We can send a simple email
		// We can text ourselves
		
		// Build simple email content
		$emailContent = "David, the following page has changed!\n\n".$url."\n\n";
	}
	
	// Save new content
	file_put_contents($savePath.$fileName,$currentContent);
}

// Send the email if there's content!
if($emailContent) {
	// Sendmail!
	mail("david@davidwalsh.name","Sites Have Changed!",$emailContent,"From: alerts@davidwalsh.name","\r\n");
	// Debug
	echo $emailContent;
}


The code and comments are self-explanatory.  I've set the script up such that I get one "digest" alert if many of the pages change.  The script is the hard part -- to enact the script, I've set up a CRON job to run the script every 20 minutes.

代码和注释不言自明。 我设置了脚本,以便在许多页面发生更改时得到一个“摘要”警报。 脚本是很难的部分-要编写脚本,我设置了一个CRON作业,每20分钟运行一次脚本。

This solution isn't specific to just spying on footy -- you could use this type of script on any number of sites.  This script, however, is a bit simplistic in all cases.  If you wanted to spy on a website that had extremely dynamic code (i.e. a timestamp was in the code), you would want to create a regular expressions that would isolate the content to just the block you're looking for. Since each website is constructed differently, I'll leave it up to you to create page-specific isolators. Have fun spying on websites though...and be sure to let me know if you hear a good, reliable footy rumor!

该解决方案不仅仅只是涉足间谍活动-您可以在任意数量的站点上使用这种类型的脚本。 但是,该脚本在所有情况下都有些简化。 如果您想监视具有动态代码的网站(即代码中包含时间戳),则需要创建一个正则表达式,将内容隔离到您要查找的块中。 由于每个网站的结构都不相同,因此我将由您自行创建特定于页面的隔离器。 不过,在网站上进行间谍活动还是很有趣的……如果您听到良好,可靠的脚步谣言,请务必让我知道!

翻译自: https://davidwalsh.name/php-notifications

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值