关于beautifulsoup的一个bug 由于<!–[if lte IE 6]><![endif]–>无法正常解析标签）

最新推荐文章于 2020-12-07 12:33:57 发布

Ink_cherry

最新推荐文章于 2020-12-07 12:33:57 发布

阅读量1.4k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/Ink_cherry/article/details/78113762

版权

python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

问题：我想爬个一个新浪旅游博客博主的文章（以前看过nodejs爬的，感觉文章好，而且页面布局比较适合练习爬虫）

然后准备解析标签的时候，发现find或者find_all找不到对应标签。甚至写成find('a'),find('p')都找不到标签并且能find('head')却不能find('body')

然后我用

print soup.prettify()

找了一下输出信息

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   博文_柳絮同学_新浪博客
  </title>
  <meta content="IE=EmulateIE8,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="webkit" name="renderer"/>
  <meta content="博文...." name="keywords">
   <meta content="博文..." name="description"/>
   <!--–[if lte IE 6]-->
   <script type="text/javascript">
    try{
document.execCommand("BackgroundImageCache", false, true);
}catch(e){}
   </script>
  </meta>
 </head>
</html>

确实是只能find（‘head') 标签，但是head里面的东西比如style标签就爬不了

我尝试把页面内容放在文件里，删掉了head所有内容，发现可以正常用soup解析页面了

是head导致的吗，但是查阅网上并没有说什么body无法正常用bs匹配的信息

果然，我只留一个<head></head>空head标签，仍然可以正常解析

那么，肯定是head里面的内容所导致的。

通过排除删查

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>博文_柳絮同学_新浪博客</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE8,chrome=1" />
<meta name="renderer" content="webkit">
<meta name="keywords" content="..." />
<meta name="description" content="..." />
<!–[if lte IE 6]>
<script type="text/javascript">
try{
document.execCommand("BackgroundImageCache", false, true);
}catch(e){}
</script>
<![endif]–>
<script type="text/javascript">
window.staticTime=new Date().getTime();
</script>
<link rel="pingback" href="http://upload.move.blog.sina.com.cn/blog_rebuild/blog/xmlrpc.php" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://upload.move.blog.sina.com.cn/blog_rebuild/blog/xmlrpc.php?rsd" />
<link href="http://blog.sina.com.cn/blog_rebuild/blog/wlwmanifest.xml" type="application/wlwmanifest+xml" rel="wlwmanifest" />
<link rel="alternate" type="application/rss+xml" href="http://blog.sina.com.cn/rss/1776757314.xml" title="RSS" />
<link href="http://simg.sinajs.cn/blog7style/css/conf/blog/articlelist.css" type="text/css" rel="stylesheet" /><style id="tplstyle" type="text/css">@charset "utf-8";@import url("http://simg.sinajs.cn/blog7newtpl/css/4/4_2/t.css");
</style>
<style id="positionstyle"  type="text/css">
.sinabloghead .blogtoparea{ left:135px;top:58px;}
.sinabloghead .blognav{ left:150px;top:60%;}
</style>
<style id="bgtyle"  type="text/css">
</style>
<style id="headtyle"  type="text/css">
</style>
<style id="navtyle"  type="text/css">
</style>
</head>

在第一个script标签外层有个 <！[if lte IE 6><![endif]->所导致的，这是一个鉴别浏览器的前端页面写法

在网上查询，beautifulsoup 3.0.6以下无法解析，包含这段以及类似的鉴别代码，都会是bs解析产生bug。

需要通过正则消除即可。

https://stackoverflow.com/questions/132488/regex-to-remove-conditional-comments

给出的几个正则匹配，或是自己写一个。


"<!--\[if\s(?:[^<]+|<(?!!\[endif\]-->))*<!\[endif\]-->"
"?s:<!--\[if\s.*?<!\[endif\]-->"
 "<!--\[if IE\]>.*?<!\[endif\]-->"

Ink_cherry

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录