php dom html,PHP: DOMDocument::loadHTML - Manual

If you are loading html content from any website, in "utf-8" encoding, when meta width content-type is not first child of HEAD, it would not be acknowledged by parser (encoding); So you can make this fix:

function domLoadHTML($html)

{$testDOM = new DOMDocument('1.0', 'UTF-8');

$testDOM->loadHTML($html);

$charset = NULL;

$searchInElemnt = function(&$item) use (&$searchInElemnt, &$charset)

{if($item->childNodes)

{foreach($item->childNodes as $childItem)

{switch($childItem->nodeName)

{case 'html':

case 'head':

$searchInElemnt($childItem);

break;

case 'meta':

$attributes = array();

foreach ($childItem->attributes as $attr)

{$attributes[mb_strtoupper($attr->localName)] = $attr->nodeValue;

}

if(array_key_exists('HTTP-EQUIV', $attributes) && (mb_strtoupper($attributes['HTTP-EQUIV']) == 'CONTENT-TYPE') && array_key_exists('CONTENT', $attributes) && preg_match('~[\s]*;[\s]*charset[\s]*=[\s]*([^\s]+)~', $attributes['CONTENT'], $matches))

{$charset = preg_replace('~[\s\']~', '', $matches[1]);

}

}

}

}

};

$searchInElemnt($testDOM);

if(isset($charset))

{$dom = new DOMDocument('1.0', $charset);

$dom->loadHTML('<?xml encoding="'.$charset.'">'.$html);

foreach ($dom->childNodes as $item)

if($item->nodeType == XML_PI_NODE)

{$dom->removeChild($item);

}

$dom->encoding = $charset;

}

else

{$dom = $testDOM;

}

return $dom;

};

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值