怎么批量删除html里的字段,Python，从字符串中删除所有html标记

最新推荐文章于 2023-12-22 12:25:13 发布

美剧商务英语口语

最新推荐文章于 2023-12-22 12:25:13 发布

阅读量119

点赞数

文章标签：怎么批量删除html里的字段

使用正则表达式：re.sub('', '', text)

使用BeautifulSoup:(来自here的解决方案)import urllib

from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)

# kill all script and style elements

for script in soup(["script", "style"]):

script.extract() # rip it out

# get text

text = soup.get_text()

# break into lines and remove leading and trailing space on each

lines = (line.strip() for line in text.splitlines())

# break multi-headlines into a line each

chunks = (phrase.strip() for line in lines for phrase in line.split(" "))

# drop blank lines

text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

使用NLTK：import nltk

from urllib import urlopen

url = "https://stackoverflow.com/questions/tagged/python"

html = urlopen(url).read()

raw = nltk.clean_html(html)

print(raw)

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

关注关注