1. 问题出现
最近做身份证识别,ocr识别返回一堆html格式的字符串:
<div class='ocr_page' id='page_1' title='image ""; bbox 0 0 648 648; ppageno 0'>
<div class='ocr_carea' id='block_1_2' title="bbox 29 213 648 332">
<p class='ocr_par' id='par_1_2' lang='eng' title="bbox 29 213 648 332">
<span class='ocr_line' id='line_1_5' title="bbox 29 213 648 332; baseline -0.011 -55; x_size 34.444443; x_descenders 8.6111107; x_ascenders 8.6111107"><span class='ocrx_word' id='word_1_13' title='bbox 29 245 327 280; x_wconf 18'><strong><em>ASCXVTHQHUUFWWXHS</em></strong></span> <span class='ocrx_word' id='word_1_14' title='bbox 362 213 648 332; x_wconf 58'><strong><em>u</em></strong></span>
</span>
</p>
</div>
<div class='ocr_carea' id='block_1_3' title="bbox 87 394 611 429">
<p class='ocr_par' id='par_1_3' lang='eng' ti