【问题】
已经通过Python中的BeautifulSoup获得了对应的soup:
LINE 253 : INFO foundDescription=
BAD CREDIT
NO CREDIT
NO PROBLEM!!!
CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820
现在,想要得到其中的description的内容,并且过滤掉其中的br等标签。
【解决过程】
1.当然最土,最笨的办法就是,手动用正则去除掉对应的br标签。
但是想要找个更好的办法。
2.后来从:
发现BeautifulSoup竟然有个renderContents,所以去参考官网文档:
找到对应的解释后,所以去试试:descContents = foundDescription.renderContents();
logging.info("descContents=%s", descContents);
结果是LINE 257 : INFO descContents=
BAD CREDIT
NO CREDIT
NO PROBLEM!!!
CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820
还是有对应的br标签。
3.所以,看来还是算了,还是自己手动暂时此处用正则去处理算了。
写成:descContents = crifanLib.soupContentsToUnicode(foundDescription.contents);
#descContents = foundDescription.renderContents();
logging.info("descContents=%s", descContents);
descHtmlDecoded = crifanLib.decodeHtmlEntity(descContents);
logging.info("descHtmlDecoded=%s", descHtmlDecoded);
descHtmlFiltered = re.sub("
", "", descHtmlDecoded);
descHtmlFiltered = re.sub("
", "", descHtmlFiltered);
logging.info("descHtmlFiltered=%s", descHtmlFiltered);
效果是:LINE 262 : INFO descHtmlFiltered=
BAD CREDIT
NO CREDIT
NO PROBLEM!!!
CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820
基本满足此处需求了。
就此这么着吧。
等遇到更复杂的,再想更好的办法。
【总结】
暂时只能还是通过正则去处理html的tag。
【后记 2013-05-03】
1.后来继续试了试:VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br', 'a'];
soup = BeautifulSoup(origHtml);
for tag in soup.findAll(True):
if tag.name not in VALID_TAGS:
tag.hidden = True;
filteredHtml = soup.renderContents();
logging.info("processed, filteredHtml=%s", filteredHtml);
结果,只是起到,过滤到非法的tag,而不是把tag去掉,保留tag内的内容。
2.后来只能是自己手动删除tag,而保留其中的内容了:def filterHtmlTag(origHtml):
"""
filter html tag, but retain its contents
eg:
Brooklyn, NY 11220
Brooklyn, NY 11220
Bayridgenissan42@yahoo.com
stores.ebay.com
stores.ebay.com
www.carfaxonline.com
www.carfaxonline.com
"""
#logging.info("html tag, origHtml=%s", origHtml);
filteredHtml = origHtml;
#Method 1: auto remove tag use re
#remove br
filteredHtml = re.sub("
", "", filteredHtml, flags=re.I);
filteredHtml = re.sub("
", "", filteredHtml, flags=re.I);
#logging.info("remove br, filteredHtml=%s", filteredHtml);
#remove a
filteredHtml = re.sub("]+>(?P[^<>]+?)", "\g", filteredHtml, flags=re.I);
#logging.info("remove a, filteredHtml=%s", filteredHtml);
#remove b,strong
filteredHtml = re.sub("(?P[^<>]+?)", "\g", filteredHtml, re.I);
filteredHtml = re.sub("(?P[^<>]+?)", "\g", filteredHtml, flags=re.I);
#logging.info("remove b,strong, filteredHtml=%s", filteredHtml);
return filteredHtml;
3.以后会继续更新此函数的。