该采集链接是从Snoopy中提取出来的,也是一个很好的函数,可以根据URL是相对链接还是绝对链接采集到链接,如果是相对链接会根据相对链接和主域名,返回绝对链接,也支持不同端口。
/*===================================================================*
Function:_expandlinks
Purpose:expand each link into a fully qualified URL
Input:$linksthe links to qualify
$URIthe full URI to get the base from
Output:$expandedLinksthe expanded links
*===================================================================*/
function _expandlinks($links,$URI)
{
$URI_PARTS = parse_url($URI);
$host = $URI_PARTS["host"];
preg_match("/^[^?]+/",$URI,$match);
$match = preg_replace("|/[^/.]+.[^/.]+$|","",$match[0]);
$match = preg_replace("|/$|","",$match);
$match_part = parse_url($match);
$match_root =
$match_part["scheme"]."://".$match_part["host"];
$search = array( "|^http://".preg_quote($host)."|i",
"|^(/)|i",
"|^(?!http://)(?!mailto:)|i",
"|/./|",
"|/[^/]+/../|"
);
$replace = array("",
$match_root."/",
$match."/",
"/",
"/"
);
$expandedLinks = preg_replace($search,$replace,$links);
return $expandedLinks;
}
//以下是测试内容
$r = _expandlinks('asd/asd.html','http://www.361way.com/');
echo $r;
//output http://www.361way.com/asd/asd.html
echo '
';
$r = _expandlinks('http://www.361way.com/asd.html','http://www.361way.com/');
echo $r;
//output http://www.361way.com/asd.html
echo '
';
$r = _expandlinks('asd.html','http://www.361way.com:8080/');
echo $r;
//output http://www.361way.com:8080/asd.html
?>
经过测试,可以知道:第一个参数$links是链接的url
比较你采到网站中链接是测试
主站域名是http://www.test.com/ 此函数会根据相对路径关系,反回绝对路径http://www.test.com/asd.html