错误记录：贴吧爬取

最新推荐文章于 2022-07-07 13:16:00 发布

WitsMakeMen

最新推荐文章于 2022-07-07 13:16:00 发布

阅读量2.3k

点赞数

文章标签： class 正则表达式浏览器 filter java url

本文链接：https://blog.csdn.net/WitsMakeMen/article/details/7393872

版权

1：为什么匹配不出url链接：找到原因了，不是正则表达式的问题原来从浏览器里贴出了的源代码和调用java包返回的源代码不同（原因不明）如下

浏览器：

<tbody id="normalthread_14777">
<tr>
<td class="icn">
<a href="thread-14777-1-1.html" title="新窗口打开" target="_blank">
<img src="static/image/common/folder_common.gif" />
</a>
</td>
<th class="common">
<a href="thread-14777-1-1.html" style="font-weight: bold;color: #2B65B7" οnclick="atarget(this)" class="xst" >青岛5所高中要搬迁，19中搬迁鳌山卫，配套山大青岛校区</a>
</th>
<td class="by">
<cite>
<a href="home.php?mod=space&uid=3324" c="1">number11</a></cite>
<em><span>2012-2-29</span></em>
</td>
<td class="num"><a href="thread-14777-1-1.html" class="xi2">6</a><em>657</em></td>
<td class="by">
<cite><a href="home.php?mod=space&username=%E5%8D%81%E5%85%AD%E7%9A%84%E6%9C%88%E4%BA%AE" c="1">十六的月亮</a></cite>
<em><a href="forum.php?mod=redirect&tid=14777&goto=lastpost#lastpost">2012-3-6 20:46:40</a></em>
</td>
</tr>
</tbody>

java包返回的：

<tr>
<td class="icn">
<a href="thread-15648-1-1.html" title="有新回复 - 新窗口打开" target="_blank">
<img src="static/image/common/folder_new.gif" />
</a>
</td>
<th class="new">
<a href="thread-15648-1-1.html" οnclick="atarget(this)" class="xst" >山东大学校董会名誉主席梁振英当选香港第四任特首</a>
<a href="forum.php?mod=redirect&tid=15648&goto=lastpost#lastpost" class="xi1">New</a>
</th>
<td class="by">
<cite>
<a href="home.php?mod=space&uid=2109" c="1">liuzhiwu</a></cite>
<em><span class="xi1">2012-3-26</span></em>
</td>
<td class="num"><a href="thread-15648-1-1.html" class="xi2">0</a><em>1</em></td>
<td class="by">
<cite><a href="home.php?mod=space&username=liuzhiwu" c="1">liuzhiwu</a></cite>
<em><a href="forum.php?mod=redirect&tid=15648&goto=lastpost#lastpost"><span title="2012-3-26 10:28:41">4 分钟前</span></a></em>
</td>
</tr>

</tbody>

其中链接前的class标签不同，浏览器重的是common而java返回的是new。

所以使正则表达式不能正确的匹配出链接来

2错误2：解析得到了url链接，却无法解析得到帖子内容（http://bbs.sdu.edu.cn/forum-242-2.html）

解析得到的url为：http://bbs.sdu.edu.cn/forum.php?mod=viewthread&tid=15626&extra=page%3D1%26filter%3Dsortid%26sortid%3D4%26sortid%3D4

用正则表达式解析式将&自动转译成&使得不能正常返回网页内容。

用java中的字符串处理将&在转译成&就会得到如下正确地址：

http://bbs.sdu.edu.cn/forum.php?mod=viewthread&tid=15626&extra=page%3D1%26filter%3Dsortid%26sortid%3D4%26sortid%3D4

WitsMakeMen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
错误记录：贴吧爬取

1：为什么匹配不出url链接：找到原因了，不是正则表达式的问题原来从浏览器里贴出了的源代码和调用java包返回的源代码不同（原因不明）如下浏览器：青岛5所高中要搬迁，19中搬迁鳌山卫，配套山大青岛校区number112012-2-296657十六的月亮2012-3-6 20:46:40java包返回的
复制链接

扫一扫