【背景】
已从一个url中获得返回的json字符串:
拿新浪微博举例,获取json的方法为:使用正则表达式
- def get_servertime(self):
- url = 'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=dW5kZWZpbmVk&client=ssologin.js(v1.3.18)&_=1329806375939'
- data = urllib2.urlopen(url).read()
- p = re.compile(' (.∗) ')
- try:
- json_data = p.search(data).group(1)
- data = json.loads(json_data)
- servertime = str(data['servertime'])
- nonce = data['nonce']
- return servertime, nonce
- except:
- print 'Get severtime error!'
- return None
{"code":"A00006",data:"\t<li id=\"cmt_1932099\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932099\"><a href=\"http:\/\/blog.sina.com.cn\/u\/1612702675\" target=\"_blank\">\u5343\u5bfb<\/a><\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-29 09:52:17<\/em> <a id=\"56c89b680102dynu_1932099\" οnclick=\"comment_report(’56c89b680102dynu_1932099′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932099\"><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li>\t<li class=\"SG_j_linedot1\" id=\"cmt_1932970\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932970\">\u65b0\u6d6a\u7f51\u53cb<\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-30 13:50:50<\/em> <a id=\"56c89b680102dynu_1932970\" οnclick=\"comment_report(’56c89b680102dynu_1932970′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932970\">\u51e4\u59d0\u603b\u8ba9\u6211\u60f3\u5230\u90a3\u4e2a\u6e2f\u5267\u91cc\u7684\u5468\u661f\u661f\u60f3\u51fa\u540d\u3001\u6709\u4f5c\u4e3a\u7684\u5c0f\u4eba\u7269\uff0c\u54c4\u7b11\u4e2d\u603b\u6709\u6cea\u6c34\u3002\u4e00\u4e2a\u4eba\u6c11\u6559\u5e08\u90fd\u8fd9\u6837\u4e86\u3002<img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><br><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li><div><input id=\"v1x\" type=\"hidden\" value=\"d64ae94d73690823f92c64e8868d3660\"\/><input id=\"v2x\" type=\"hidden\" value=\"\"\/><\/div>"} |
可以很清楚看到,其中就一个code键和一个data键,其中Data键的值,是对应的带反斜杠格式的Html源码。
现在需要从data键值的html源码中提取出对应的id或class,比如SG_revert_Cont等,所以,希望可以通过BeautifulSoup来处理该html源码。
【解决过程】
1.先获得对应的反斜杠格式的html源码:
\t<li id=\"cmt_1932099\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932099\"><a href=\"http:\/\/blog.sina.com.cn\/u\/1612702675\" target=\"_blank\">\u5343\u5bfb<\/a><\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-29 09:52:17<\/em> <a id=\"56c89b680102dynu_1932099\" οnclick=\"comment_report(’56c89b680102dynu_1932099′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932099\"><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li>\t<li class=\"SG_j_linedot1\" id=\"cmt_1932970\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932970\">\u65b0\u6d6a\u7f51\u53cb<\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-30 13:50:50<\/em> <a id=\"56c89b680102dynu_1932970\" οnclick=\"comment_report(’56c89b680102dynu_1932970′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932970\">\u51e4\u59d0\u603b\u8ba9\u6211\u60f3\u5230\u90a3\u4e2a\u6e2f\u5267\u91cc\u7684\u5468\u661f\u661f\u60f3\u51fa\u540d\u3001\u6709\u4f5c\u4e3a\u7684\u5c0f\u4eba\u7269\uff0c\u54c4\u7b11\u4e2d\u603b\u6709\u6cea\u6c34\u3002\u4e00\u4e2a\u4eba\u6c11\u6559\u5e08\u90fd\u8fd9\u6837\u4e86\u3002<img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" οnclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><br><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li><div><input id=\"v1x\" type=\"hidden\" value=\"d64ae94d73690823f92c64e8868d3660\"\/><input id=\"v2x\" type=\"hidden\" value=\"\"\/><\/div> |
然后用BeautifulSoup去解析,再通过
soup.findAll(attrs={"class":"SG_revert_Cont"});
无法得到需要的内容。
2.后来进过一番折腾,终于找到办法了,那就是,先给原先反斜杠的html字符串,处理为正常的字符串,
然后再添加个普通的html的头和尾,即:
dataStr = dataStr.replace("\\t", "\t");
dataStr = dataStr.replace("\\r\\n", "\r\n");
dataStr = dataStr.replace("\\/", "/");
dataStr = dataStr.replace('\\"', '"');
logging.debug("after html parse: \n%s", dataStr);
fakeHead = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Fake Title</title>
<body>
""";
fakeTail = """
</body>
</head>
</html>
""";
dataStr = fakeHead + dataStr + fakeTail;
soup = BeautifulSoup(dataStr);
然后此时再使用:
soup.findAll(attrs={"class":"SG_revert_Cont"});
就可以得到我们所需要的commentList了。就可以接着像正常的soup类型的变量一样来处理,可以得到我所需要的信息了。
【总结】
如果得到了反斜杠类型的html源码,但只是部分内容,却想要方便的解析其中内容,
可以考虑,先将其(1)转换为普通的不带反斜杠的html源码,然后再(2)添加一个伪(fake)的html的head和tail信息,伪装为一个普通的html源码,然后再用BeautifulSoup去处理,就可以得到期望的soup,可以进行信息提取处理了。