I have a text like this:
text = """
"""using pure Python, with no external module I want to have this:
>>> print remove_tags(text)
Title A long text..... a link
I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+
How can I do that?
解决方案
Using a regex
Using a regex you can clean everything inside <> :
import re
def cleanhtml(raw_html):
cleanr = re.compile('<.>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
Using BeautifulSoup
You could also use BeautifulSoup additionnal package to find out all the raw text
You will need to explicitly set a parser when calling BeautifulSoup
I recommand "lxml" as mentionned in alternative answers (puch more robist than the default one (i.e available without additionnal install) 'html.parser'
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text
But it doesn't prevent you from using external libraries, so I recommend the first solution.