用BeautifulSoup解析Html格式的Json字符串（处理新浪微博解析不到关注情况）

最新推荐文章于 2022-09-08 01:34:01 发布

绝对不要看眼睛里的郁金香

最新推荐文章于 2022-09-08 01:34:01 发布

阅读量3.6k

点赞数

【背景】

已从一个url中获得返回的json字符串：

拿新浪微博举例，获取json的方法为：使用正则表达式

def get_servertime(self):
url = 'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=dW5kZWZpbmVk&client=ssologin.js(v1.3.18)&_=1329806375939'
data = urllib2.urlopen(url).read()
p = re.compile(' (.∗) ')
try:
json_data = p.search(data).group(1)
data = json.loads(json_data)
servertime = str(data['servertime'])
nonce = data['nonce']
return servertime, nonce
except:
print 'Get severtime error!'
return None

{"code":"A00006",data:"\t<li id=\"cmt_1932099\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932099\"><a href=\"http:\/\/blog.sina.com.cn\/u\/1612702675\" target=\"_blank\">\u5343\u5bfb<\/a><\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-29 09:52:17<\/em>&nbsp<a id=\"56c89b680102dynu_1932099\" οnclick=\"comment_report(’56c89b680102dynu_1932099′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932099\"><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li>\t<li class=\"SG_j_linedot1\" id=\"cmt_1932970\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932970\">\u65b0\u6d6a\u7f51\u53cb<\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-30 13:50:50<\/em>&nbsp<a id=\"56c89b680102dynu_1932970\" οnclick=\"comment_report(’56c89b680102dynu_1932970′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932970\">\u51e4\u59d0\u603b\u8ba9\u6211\u60f3\u5230\u90a3\u4e2a\u6e2f\u5267\u91cc\u7684\u5468\u661f\u661f\u60f3\u51fa\u540d\u3001\u6709\u4f5c\u4e3a\u7684\u5c0f\u4eba\u7269\uff0c\u54c4\u7b11\u4e2d\u603b\u6709\u6cea\u6c34\u3002\u4e00\u4e2a\u4eba\u6c11\u6559\u5e08\u90fd\u8fd9\u6837\u4e86\u3002<img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><br><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li><div><input id=\"v1x\" type=\"hidden\" value=\"d64ae94d73690823f92c64e8868d3660\"\/><input id=\"v2x\" type=\"hidden\" value=\"\"\/><\/div>"}

可以很清楚看到，其中就一个code键和一个data键，其中Data键的值，是对应的带反斜杠格式的Html源码。

现在需要从data键值的html源码中提取出对应的id或class，比如SG_revert_Cont等，所以，希望可以通过BeautifulSoup来处理该html源码。

【解决过程】

1.先获得对应的反斜杠格式的html源码：

\t<li id=\"cmt_1932099\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932099\"><a href=\"http:\/\/blog.sina.com.cn\/u\/1612702675\" target=\"_blank\">\u5343\u5bfb<\/a><\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-29 09:52:17<\/em>&nbsp<a id=\"56c89b680102dynu_1932099\" οnclick=\"comment_report(’56c89b680102dynu_1932099′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932099\"><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li>\t<li class=\"SG_j_linedot1\" id=\"cmt_1932970\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932970\">\u65b0\u6d6a\u7f51\u53cb<\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-30 13:50:50<\/em>&nbsp<a id=\"56c89b680102dynu_1932970\" οnclick=\"comment_report(’56c89b680102dynu_1932970′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932970\">\u51e4\u59d0\u603b\u8ba9\u6211\u60f3\u5230\u90a3\u4e2a\u6e2f\u5267\u91cc\u7684\u5468\u661f\u661f\u60f3\u51fa\u540d\u3001\u6709\u4f5c\u4e3a\u7684\u5c0f\u4eba\u7269\uff0c\u54c4\u7b11\u4e2d\u603b\u6709\u6cea\u6c34\u3002\u4e00\u4e2a\u4eba\u6c11\u6559\u5e08\u90fd\u8fd9\u6837\u4e86\u3002<img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><br><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li><div><input id=\"v1x\" type=\"hidden\" value=\"d64ae94d73690823f92c64e8868d3660\"\/><input id=\"v2x\" type=\"hidden\" value=\"\"\/><\/div>

然后用BeautifulSoup去解析，再通过

soup.findAll(attrs={"class":"SG_revert_Cont"});

无法得到需要的内容。

2.后来进过一番折腾，终于找到办法了，那就是，先给原先反斜杠的html字符串，处理为正常的字符串，

然后再添加个普通的html的头和尾，即：

dataStr = dataStr.replace("\\t", "\t");
dataStr = dataStr.replace("\\r\\n", "\r\n");
dataStr = dataStr.replace("\\/", "/");
dataStr = dataStr.replace('\\"', '"');
logging.debug("after html parse: \n%s", dataStr);

fakeHead = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Fake Title</title>
<body>
""";
fakeTail = """
</body>
</head>
</html>
""";

dataStr = fakeHead + dataStr + fakeTail;

soup = BeautifulSoup(dataStr);

然后此时再使用：

soup.findAll(attrs={"class":"SG_revert_Cont"});

就可以得到我们所需要的commentList了。就可以接着像正常的soup类型的变量一样来处理，可以得到我所需要的信息了。

【总结】

如果得到了反斜杠类型的html源码，但只是部分内容，却想要方便的解析其中内容，

可以考虑，先将其（1）转换为普通的不带反斜杠的html源码，然后再（2）添加一个伪（fake）的html的head和tail信息，伪装为一个普通的html源码，然后再用BeautifulSoup去处理，就可以得到期望的soup，可以进行信息提取处理了。

绝对不要看眼睛里的郁金香

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
用BeautifulSoup解析Html格式的Json字符串（处理新浪微博解析不到关注情况）

【背景】已从一个url中获得返回的json字符串：{"code":"A00006",data:"\t\r\n\t\t\r\n\t\t\t\r\n\t\t\t\t\r\n\t\t\t\r\n\t\t\r\n\t\t\r\n\t\t\t<a href=\"http:\/\/blog.sina.com.cn\/u\/1612702675\" target=\"_blank
复制链接

扫一扫