Say I am a very complex HTML page consisting usual HTML tags, CSS & JS in the middle. We might see all worst cases.
All I want is strip all the above tags/ code and return "text".
In simple terms:
This might contain JS, CSS etc. etc..
I am trying to use BeautifulSoup but its not removing JS from the code.. Now ,I am thinking to use Regex.. but not sure how to do
edit1
Here is my try on a simple bootstrap html page...
from bs4 import BeautifulSoup as bs
import requests
bs( requests.get(MY-URL).text ).get_text()
$ return text
html
Home
Le styles
body {
padding-top: 10%;
padding-left: 30%;
}
HTML5 shim, for IE6-8 support of HTML5 elements
[if lt IE 9]>
Home | Under Construction
Sample Page 1
The app
might
face some ........
Firefox
. Ple..
/container
var _gaq = _gaq || [];
_gaq.push(['_trackPageview']);
(function() {
var ga = do...............
})();
解决方案
Django using this function to strip tags from text:
def strip_tags(value):
"""Returns the given HTML with all tags stripped."""
return re.sub(r']*?>', '', force_unicode(value))
(You won't need the force_unicode part)