python字符串截取及Html解析

最新推荐文章于 2023-03-17 12:37:51 发布

fjssharpsword

最新推荐文章于 2023-03-17 12:37:51 发布

阅读量2.1k

点赞数

分类专栏： Windows

本文链接：https://blog.csdn.net/fjssharpsword/article/details/90291218

版权

Windows 专栏收录该内容

60 篇文章 1 订阅

订阅专栏

场景：一串字符串，包括html代码，包括特定符号，目标是提取特定符号中间的子字符串，并且解析html代码提取相关属性的值。

安装:pip install BeautifulSoup4

代码参考：

import re
from bs4 import BeautifulSoup
from os.path import basename, splitext
string1='CO潴留时可出现以下哪些症状或体征<http://h5.img.zhukao.com.cn/Z/%E5%85%A8%E5%9B%BD/Z_%E5%85%A8%E5%9B%BD_%E5%BF%83%E8%A1%80%E7%AE%A1%E5%86%85%E7%A7%91/images/2.gif>|A.神志淡漠|B.腱反射消'
string2='心血管内科|常见症状与体征(X型题)|多选题|对于漏出性腹腔积液特点正确的是|A.外观多为透明淡黄色|B.胸腔积液／血清LDH比值大于0．6|C.细胞计数常低于200×10<img layer-src=\'http://h5.img.zhukao.com.cn/Z/全国/Z_全国_心血管内科/images/YZ71_~6.gif\' src=\'http://h5.img.zhukao.com.cn/Z/全国/Z_全国_心血管内科/images/YZ71_~6.gif\' style=\'max-width:200px; width:auto;\'>／L|D.比重小于1．018|E.血清腹腔积液白蛋白浓度梯度多大于11g／L|正确答案：A D E解析：腹腔积液漏出液细胞计数常低于100×10／L。'
string=string1+string2
p=re.compile(r'[<](.*?)[>]', re.S) 
matches = re.findall(p,string)
imgs=[]
for match in matches:
    print (match)
    if match.find('img')==0:#包含html代码，img在字符串首位
        soup = BeautifulSoup("<"+match+">")
        img = soup.find_all('img')
        src=img[0].get('src')
        imgs.append(src)
    else:imgs.append(match)#不包含html代码     
print (imgs)

正则compile设置的提取<和>之间的子字符串。

执行结果：

http://h5.img.zhukao.com.cn/Z/%E5%85%A8%E5%9B%BD/Z_%E5%85%A8%E5%9B%BD_%E5%BF%83%E8%A1%80%E7%AE%A1%E5%86%85%E7%A7%91/images/2.gif
img layer-src='http://h5.img.zhukao.com.cn/Z/全国/Z_全国_心血管内科/images/YZ71_~6.gif' src='http://h5.img.zhukao.com.cn/Z/全国/Z_全国_心血管内科/images/YZ71_~6.gif' style='max-width:200px; width:auto;'
['http://h5.img.zhukao.com.cn/Z/%E5%85%A8%E5%9B%BD/Z_%E5%85%A8%E5%9B%BD_%E5%BF%83%E8%A1%80%E7%AE%A1%E5%86%85%E7%A7%91/images/2.gif', 'http://h5.img.zhukao.com.cn/Z/全国/Z_全国_心血管内科/images/YZ71_~6.gif']

fjssharpsword

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python字符串截取及Html解析

场景：一串字符串，包括html代码，包括特定符号，目标是提取特定符号中间的子字符串，并且解析html代码提取相关属性的值。安装:pip installBeautifulSoup4代码参考：import refrom bs4 import BeautifulSoupfrom os.path import basename, splitextstring1='CO潴留时可出现以下...
复制链接

扫一扫