php抓取dom处理后数据,PHP简单DOMDocument抓取排除td类

我只是试图获取所有的< td>位于< tr>内部的元素数据元素.我的问题是因为我试图抓取的表结构是我需要排除所有具有COLLSPAN属性的元素,即< td collspan = 12>

从下面的代码可以看出,获取表数据非常简单,但是由于表结构的原因,我需要排除所有collspan属性.

$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url

$game_doc = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)) { //if any html is actually returned

$game_doc->loadHTML($html);

libxml_clear_errors(); //remove error

$xpath = new DOMXPath($game_doc);

// Modify the XPath query to match the content

foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {

$cells = $rows->getElementsByTagName('td');

//$cells2 = $rows->getElementsByTagName('th');

echo '

';

//@ signs are added due to table structure

//Get scrapped columns

echo $dayDateBye[] = $cells->item(0)->textContent;

echo $homeTeam[] = $cells->item(1)->textContent;

echo $awayTeam[] = $cells->item(2)->textContent;

echo $venue[] = $cells->item(3)->textContent;

echo $timeGMT[] = $cells->item(5)->textContent;

echo $timeZA[] = $cells->item(10)->textContent;

echo '

';

}

}

在这里,您可以看到表格结构,其中显示了5个奇数行的灯具,然后在新的一周开始时更改了结构.我可以识别的跳过结构变化的元素都是< td collspan = 12>.元素.这很棘手,因为TD元素没有类名,而仅具有用于标识它的元素.

FbhYv.png

5hEf8.png

任何输入表示赞赏.

解决方法:

您可以按标签长度跳过那些

$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url

$game_doc = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)) { //if any html is actually returned

$game_doc->loadHTML($html);

libxml_clear_errors(); //remove error

$xpath = new DOMXPath($game_doc);

// Modify the XPath query to match the content

foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {

$cells = $rows->getElementsByTagName('td');

if( $cells->length > 1 ){

//$cells2 = $rows->getElementsByTagName('th');

echo '

';

//@ signs are added due to table structure

//Get scrapped columns

echo $dayDateBye[] = $cells->item(0)->textContent;

echo $homeTeam[] = $cells->item(1)->textContent;

echo $awayTeam[] = $cells->item(2)->textContent;

echo $venue[] = $cells->item(3)->textContent;

echo $timeGMT[] = $cells->item(5)->textContent;

echo $timeZA[] = $cells->item(10)->textContent;

echo '

';

}

}

}

?>

标签:web-scraping,domdocument,html,php

来源: https://codeday.me/bug/20191025/1928820.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值