python如何处理表格_如何处理表格/列表/标题等?

您可以使用像python-goose这样的工具,它旨在从html页面中提取文章。在

另外,我做了以下小程序,效果不错:from html5lib import parse

with open('page.html') as f:

doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)

html = doc.getroot()

body = html.xpath('//body')[0]

def sanitize(element):

"""Retrieve all the text contained in an element as a single line of

text. This must be executed only on blocks that have only inlines

as children

"""

# join all the strings and remove \n

out = ' '.join(element.itertext()).replace('\n', ' ')

# replace multiple space with a single space

out = ' '.join(out.split())

return out

def parse(element):

# those elements can contain other block inside them

if element.tag in ['div', 'li', 'a', 'body', 'ul']:

if element.text is None or element.text.isspace():

for child in element.getchildren():

yield from parse(child)

else:

yield sanitize(element)

# those elements are "guaranteed" to contains only inlines

elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:

yield sanitize(element)

else:

try:

print('> ignored', element.tag)

except:

pass

for e in filter(lambda x: len(x) > 80, parse(body)):

print(e)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值