目的:把40道题从网页中整理出来
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div class="TiMu" style="position:relative">
<div class="Zy_TItle clearfix">
<i class="fl">print(title.i.string)1</i>
<div class="clearfix" style="line-height: 35px; font-size: 14px;padding-right:15px;width:516px;">
<div style="width:80%;height:100%;float:left;">
【单选题】<p>下列不属于水平分隔线<hr>标记的属性是( )。</p>
</div>
<div style="width:20%;height:100%;float:right;">
</div>
</div>
</div>
<form action="http://fanyapbl.fy.chaoxing.com/questionError/addQuestion" method="post" id="questionErrorForm1"
target="_blank">
<ul class="Zy_ulTop">
<li class="clearfix"><i class="fl">A、</i><a href="javascript:void(0)" class="fl" style="word-break: break-all;text-decoration: none;">
<p>width</p>
</a></li>
<li class="clearfix"><i class="fl">B、</i><a href="javascript:void(0)" class="fl" style="word-break: break-all;text-decoration: none;">
<p>height</p>
</a></li>
<li class="clearfix"><i class="fl">C、</i><a href="javascript:void(0)" class="fl" style="word-break: break-all;text-decoration: none;">
<p>size</p>
</a></li>
<li class="clearfix"><i class="fl">D、</i><a href="javascript:void(0)" class="fl" style="word-break: break-all;text-decoration: none;">
<p>color</p>
</a></li>
</ul>
<div class="Py_answer clearfix">
<span>正确答案: B </span>
<span>我的答案:B</span>
</div>
</form>
<i class="fr dui"></i>
<span style="font-size:14px;top:25px;float:right;">得分:
<span style="color:#db2727;">2.5分</span></span>
</div>
</body>
</html>
<div class="TiMu">
<div class="Zy_TItle clearfix"></div>
<from></from>
<span></span>
</div>
分析页面:
页面结构杂,每一题的结构相似 遍历所有的class="TiMu",得到40个*class="TiMu"
div.TiMu>div.Zy_TItle+from+span
<div class="TiMu">
<div class="Zy_TItle"></div>
<from></from>
<span></span>
</div>
代码:
from bs4 import BeautifulSoup
import re
#格式化html
#打开文件
op = open('test.html',mode='rb')
#转化成bs对象
soup1 = BeautifulSoup(op,"html5lib")
[s.extract() for s in soup1('script')]
[s.extract() for s in soup1('style')]
for child in soup1.find_all('div',class_='TiMu',limit=1):
a = re.compile(r'\n| |\xa0|\\xa0|\u3000|\\u3000|\\u0020|\u0020|\t|\r')
clean_soup = a.sub('', str(soup1))
soup2 = BeautifulSoup(clean_soup,"html5lib")
soupstr = soup2.text
print(soupstr)