I am trying to convert an html block to text using Python.
Input:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Desired output:
Lorem
ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo
ligula eget dolor. Aenean massa
Consectetuer adipiscing elit.
Some
Link Aenean commodo ligula eget dolor. Aenean massa
Aenean
massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean
commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit
amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor.
Aenean massa
Consectetuer adipiscing elit. Aenean commodo
ligula eget dolor. Aenean massa
I have tried using html2text module without much success (i am quite new to python :))
here is what i have tried:
#!/usr/bin/env python
import urllib2
import html2text
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())
txt = soup.find('div', {'class' : 'body'})
print html2text.html2text(txt)
the "txt" object produces the html block above. I'd like to convert it to text and print it on the screen.
Any help with the piece of code would be much appreciated.
解决方案
What am I missing? soup.get_text() gives exactly the same output you wanted...
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())
output
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
EDIT - And to keep newlines, as pointed out by @t-8ch:
print(soup.get_text('\n'))
PS! To be exact you can replace newline with a double one -- then it is identical to your example :)
soup.get_text().replace('\n','\n\n')