sitemap(网站地图)
1、sitemap什么用,为什么要用到这个?
Sitemap
是一个网站的全部URL列表,应该自动不定期更新这个列表,以使得使用 sitemap
的第三方搜索引擎、订阅软件可以即时发现你网站中的新的URL。 Sitemap
是SEO中的首要任务,网站站长应向搜索引擎定期提交更新的URL列表,这就是网站地图 (Sitemap)
,以让搜索引擎可以全面获得网站的网址信息和即时更新信息。 因此sitemap对于一个网站来说,它是十分重要的,同时定期更新网站地图,也是必不可少的环节。有些网站,内容增加了不少,但站点地图还是很老旧的,这样使用站点地图的软件就难以快速发现自己网站中的新增的网址。
通俗点讲,sitemap是网站地图,就是网站全部链接的集合页面,有利于百度/谷歌抓取和收录
2、文档目录:
配置文件 - config/config.ini.php
sitemap主文件 - SiteMap.class.php
3、主文件代码
主文件代码
<?php
/**
* the script's main function is to help us to generate the target web's sitemap.xml file
*
* @category sitemap
* @version 1.0
*/
namespace Maweibinguo\SiteMap;
class SiteMap
{
const SCHEMA = 'http://www.sitemaps.org/schemas/sitemap/0.9';
/**
* @var webUrlList
* @access public
*/
public $webUrlList = array();
/**
* @var siteMapList
* @access public
*/
public $siteMapList = array();
/**
* @var isUseCookie
* @access public
*/
public $isUseCookie = false;
/**
* @var cookieFilePath
* @access public
*/
public $cookieFilePath = '';
/**
* @var xmlWriter
* @access private
*/
private $_xmlWriter = '';
/**
* init basic config
*
* @access public
*/
public function __construct()
{
$this->_xmlWriter = new \XMLWriter();
$result = $this->_enviromentTest();
}
/**
* test the enviroment for the script
*
* @access pirvate
*/
private function _enviromentTest()
{
$sapiType = \php_sapi_name ();
if( strtolower($sapiType) != 'cli' ) {
echo ' The Script Must Run In Command Lines ', "\r\n";
exit();
}
}
/**
* load the configValue for genrating sitemap by configname
*
* @param string $configName
* @return string $configValue
* @access public
*/
public function loadConfig($configName)
{
/* init return value */
$configValue = '';
/* load config value */
$configPath = __DIR__ . '/config/config.ini.php';
if(file_exists( $configPath )) {
require $configPath;
} else {
echo "Can not find config file", "\r\n";
exit();
}
$configValue = $$configName;
/* return config value */
return $configValue;
}
/**
* generate sitemap.xml for the web
*
* @param siteMapList
* @access public
*/
public function generateSiteMapXml($siteMapList)
{
/* init return result */
$result = false;
if( !is_array($siteMapList) || count($siteMapList) <= 0 ) {
echo 'The SiteMap Cotent Is Empty',"\r\n";
exit();
}
/* check the parameter */
$siteMapPath = $this->loadConfig('SITEMAPPATH');
if(!file_exists($siteMapPath)) {
$commandStr = "touch ${siteMapPath}";
exec($commandStr);
}
if( !is_writable($siteMapPath) ) {
echo 'Is Not Writeable',"\r\n";
exit();
}
$this->_xmlWriter->openURI($siteMapPath);
$this->_xmlWriter->startDocument('1.0', 'UTF-8');
$this->_xmlWriter->setIndent(true);
$this->_xmlWriter->startElement('urlset');
$this->_xmlWriter->writeAttribute('xmlns', self::SCHEMA);
foreach($siteMapList as $siteMapKey => $siteMapItem) {
$this->_xmlWriter->startElement('url');
$this->_xmlWriter->writeElement('loc',$siteMapItem['Url']);
$this->_xmlWriter->writeElement('title',$siteMapItem['Title']);
$changefreq = !empty($siteMapItem['ChangeFreq']) ? $siteMapItem['ChangeFreq'] : 'Daily';
$this->_xmlWriter->writeElement('changefreq',$changefreq);
$priority = !empty($siteMapItem['Priority']) ? $siteMapItem['Priority'] : 0.5;
$this->_xmlWriter->writeElement('priority',$priority);
$this->_xmlWriter->writeElement('lastmod',date('Y-m-d',time()));
$this->_xmlWriter->endElement();
}
$this->_xmlWriter->endElement();
/* return return */
return $result;
}
/**
* start to send request to the target url, and get the reponse
*
* @param string $targetUrl
* @return mixed $returnData
* @access public
*/
public function sendRequest($url)
{
/* init return value */
$responseData = false;
/* check the parameter */
if( !filter_var($url, FILTER_VALIDATE_URL) ) {
return $responseData;
}
$connectTimeOut = $this->loadConfig('CURLOPT_CONNECTTIMEOUT');
if( $connectTimeOut === false ) {
return $responseData;
}
$timeOut = $this->loadConfig('CURLOPT_TIMEOUT');
if( $timeOut === false ) {
return $responseData;
}
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $url);
curl_setopt($handle, CURLOPT_HEADER, false);
curl_setopt($handle, CURLOPT_AUTOREFERER, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER , true);
curl_setopt($handle, CURLOPT_CONNECTTIMEOUT, $connectTimeOut);
curl_setopt($handle, CURLOPT_TIMEOUT, $timeOut);
curl_setopt($handle, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 5.01; Windows NT 5.0)" );
$headersItem = array( 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Connection: Keep-Alive' );
curl_setopt($handle, CURLOPT_HTTPHEADER, $headersItem);
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, 1);
$cookieList = $this->loadConfig('COOKIELIST');
$isUseCookie = $cookieList['IsUseCookie'];
$cookieFilePath = $cookieList['CookiePath'];
if($isUseCookie) {
if(!file_exists($cookieFilePath)) {
$touchCommand = " touch {$cookieFilePath} ";
exec($touchCommand);
}
curl_setopt($handle, CURLOPT_COOKIEFILE, $cookieFilePath);
curl_setopt($handle, CURLOPT_COOKIEJAR, $cookieFilePath);
}
$responseData = curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode != 200) {
$responseData = false;
}
curl_close($handle);
/* return response data */
return $responseData;
}
/**
* get the sitemap content of the url, it contains url, title, priority, changefreq
*
* @param string $url
* @access public
*/
public function generateSiteMapList($url)
{
$content = $this->sendRequest($url);
if($content !== false) {
$tagsList = $this->_parseContent($content, $url);
$urlItem = $tagsList['UrlItem'];
$title = $tagsList['Title'];
$siteMapItem = array( 'Url' => trim($url),
'Title' => trim($title) );
$priority = $this->_calculatePriority($siteMapItem['Url']);
$siteMapItem['Priority'] = $priority;
$changefreq = $this->_calculateChangefreq($siteMapItem['Url']);
$siteMapItem['ChangeFreq'] = $changefreq;
$this->siteMapList[] = $siteMapItem;
foreach($urlItem as $nextUrl) {
if( !in_array($nextUrl, $this->webUrlList) ) {
$skipUrlList = $this->loadConfig('SKIP_URLLIST');
foreach($skipUrlList as $keyWords) {
if( stripos($nextUrl, $keyWords) !== false ) {
continue 2;
}
}
$this->webUrlList[] = $nextUrl;
echo $nextUrl,"\r\n";
$this->generateSiteMapList($nextUrl);
}
}
}
}
/**
*teChangefreq get sitemaplist of the web
*
* @access public
* @return array $siteMapList
*/
public function getSiteMapList()
{
return $this->siteMapList;
}
/**
* calate the priority of the targeturl
*
* @param string $targetUrl
* @return float $priority
* @access private
*/
private function _calculatePriority($targetUrl)
{
/* init priority */
$priority = 0.5;
/* calculate the priority */
if( filter_var($targetUrl, FILTER_VALIDATE_URL) ) {
$priorityList = $this->loadConfig('PRIORITYLIST');
foreach($priorityList as $priorityKey => $priorityValue) {
if(stripos($targetUrl, $priorityKey) !== false) {
$priority = $priorityValue;
break;
}
}
}
/* return priority */
return $priority;
}
/**
* calate the changefreq of the targeturl
*
* @param string $targetUrl
* @return float $changefreq
* @access private
*/
private function _calculateChangefreq($targetUrl)
{
/* init changefreq*/
$changefreq = 'Daily';
/* calculate the priority */
if( filter_var($targetUrl, FILTER_VALIDATE_URL) ) {
$changefreqList = $this->loadConfig('CHANGEFREQLIST');
foreach($changefreqList as $changefreqKey => $changefreqValue) {
if(stripos($targetUrl, $changefreqKey) !== false) {
$changefreq = $changefreqValue;
break;
}
}
}
/* return priority */
return $changefreq;
}
/**
* format url
*
* @param $url
* @param $orginUrl
* @access private
* @return $formatUrl
*/
private function _formatUrl($url, $originUrl)
{
/* init url */
$formatUrl = '';
/* format url */
if( !empty($url) && !empty($originUrl) ) {
$badUrlItem = array( '\\',
'/' ,
'javascript',
'javascript:;',
'' );
$formatUrl = trim($url);
$formatUrl = trim($formatUrl, '#');
$formatUrl = trim($formatUrl, '\'');
$formatUrl = trim($formatUrl, '"');
if(stripos($formatUrl, 'http') === false && !in_array($formatUrl, $badUrlItem)) {
if(strpos($formatUrl, '/') === 0) {
$domainName = $this->loadConfig('DOMAIN_NAME');
$formatUrl = $domainName . trim($formatUrl, '/');
} else {
$formatUrl = substr( $originUrl, 0, strrpos($originUrl, '/') ) .'/'. $formatUrl;
}
} elseif( stripos($formatUrl, 'http') === false && in_array($formatUrl, $badUrlItem) ) {
$formatUrl = '';
}
}
/* return url */
return $formatUrl;
}
/**
* check domain is right
*
* @param $url
* @return $url
* @access private
*/
private function _checkDomain($url)
{
/* init url */
$result = false;
/* check domain */
if($url) {
$domainName = $this->loadConfig('DOMAIN_NAME');
if( stripos($url, $domainName) === false ) {
return $result;
}
$result = true;
}
/* return url */
return $result;
}
/**
* parse the response content, so that we can get the urls
*
* @param string $content
* @param string $originUrl
* @return array $urlItem
* @access public
*/
public function _parseContent($content, $originUrl)
{
/* init return data */
$tagsList = array();
/* start parse */
if( !empty($content) && !empty($originUrl) ) {
$domainName = $this->loadConfig('DOMAIN_NAME');
/* get the attribute of href for tags <a> */
$regStrForTagA = '#<\s*a\s+href\s*=\s*(".*?"|\'.*?\')#um';
if( preg_match_all($regStrForTagA, $content, $matches) ) {
$urlItem = array_unique($matches[1]);
foreach($urlItem as $urlKey => $url) {
$formatUrl = $this->_formatUrl($url, $originUrl);
if( empty($formatUrl) ) {
unset($urlItem[$urlKey]);
continue;
}
$result = $this->_checkDomain($formatUrl);
if($result === false) {
unset($urlItem[$urlKey]);
continue;
}
$urlItem[$urlKey] = $formatUrl;
}
}
$tagsList['UrlItem'] = $urlItem;
/* get the title tags content */
$regStrForTitle = '#<\s*title\s*>(.*?)<\s*\/\s*title\s*>#um';
if( preg_match($regStrForTitle, $content, $matches) ) {
$title = $matches[1];
}
$tagsList['Title'] = $title;
}
/* return tagsList */
return $tagsList;
}
}
/* here is a example */
$startTime = microtime(true);
echo "/***********************************************************************/","\r\n";
echo "/* start to run {$startTime} */","\r\n";
echo "/***********************************************************************/","\r\n\r\n";
$siteMap = new SiteMap();
$domain = $siteMap->loadConfig('DOMAIN_NAME');
$siteMap->generateSiteMapList($domain);
$siteMapList = $siteMap->getSiteMapList();
$siteMap->generateSiteMapXml($siteMapList);
$endTime = microtime(true);
$takeTime = $endTime - $startTime;
echo "/***********************************************************************/","\r\n";
echo "/* Had Done, \t it total take {$takeTime} */","\r\n";
echo "/***********************************************************************/","\r\n";
?>
配置文件代码
<?php
//curl连接时间
$CURLOPT_CONNECTTIMEOUT = 5;
//curl请求超时时间
$CURLOPT_TIMEOUT = 10;
//域名(需要获取数据的域名)
$DOMAIN_NAME = 'http://www.example.com/';
//设置跳过的地址关键字(域名中带有这些关键词的都过滤掉,不记录下来)
$SKIP_URLLIST = array(
'addtocart'
);
//设置cookie
$COOKIELIST = array(
'IsUseCookie' => true,
'CookiePath' => '/tmp/sitemapcookie'
);
//sitemap文件的保存地址
$SITEMAPPATH = './sitemap.xml';
//根据连接关键字设置priority(此数据的重要性)
$PRIORITYLIST = array(
'product' => '0.8',
'device' => '0.6',
'intelligent' => '0.4',
'course' => '0.2'
);
//根据连接关键字设置CHANGEFREQ(此数据的更新频率)
$CHANGEFREQLIST = array(
'product' => 'Always',
'device' => 'Hourly',
'intelligent' => 'Daily',
'course' => 'Weekly',
'login' => 'Monthly',
'about' => 'Yearly'
);
?>
文件中的大致内容
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.xxx.com/</loc>
<title>fadsfsd浮动发送扥阿飞啊大 撒扥森阿斯扥!</title>
<changefreq>Yearly</changefreq>
<priority>0.5</priority>
<lastmod>2020-05-25</lastmod>
</url>
<url>
<loc>http://www.xxx.com/#</loc>
<title>花萼让团卷共扥广泛僧结婚!</title>
<changefreq>Yearly</changefreq>
<priority>0.5</priority>
<lastmod>2020-05-25</lastmod>
</url>
</urlset>
4、程序逻辑
该操作是利用 PHP
的 curl
来进行抓取操作的.
- 获取
config.ini.php
配置文件的域名,然后对其进行页面抓取(整个页面内容获取) - 把获取过来的页面title/当前url/页面url存放到数组中
- 利用当前url中的匹配配置文件文件的优先级和更新常态,也一并记录到数组中
- 然后利用
子url
在一层一层的递归下去,[这一步可能会出现的问题:] - 最后全部采集完成,在吧这些数据转存到.xml文件中
[PHP Fatal error: Uncaught Error: Maximum function nesting level of '256' reached, aborting]
[修改php.ini即可](http://www.04007.cn/article/757.html)
5、注意点
- php的level记得要调高一点,不然递归执行不下去
- 需要用命令模式去执行代码
- 可以设置定时任务,定时的去执行
- 最后生成的文件,记得给他可操作/修改的权限.
其他
在线生成
- XML-Sitemaps
免费500个页面
、国外网站
、交钱的话可以很棒
- 网站地图制作
建议采用这个:暂时还不清楚总共能获取多少,但是我现在能捕获到全部1100
- 免费站点地图生成器
免费5000个
、国外网站
、需要注册
、高级帐户最多25000个