通过html查找element,如何使用ElementTree解析HTML来查找特定的RegEx?

lxml用于按标记搜索html。例如,如果要定位所有

标记并获取其文本:import xml.etree.ElementTree as et

tree = et.parse('data.html')

html_tag = tree.getroot()

for li in html_tag.iter('li'):

text = li.text

print(text)

--output:--

Name: $name

Age: $age

如果目标文本可以位于任何标记中,则可以执行以下操作:import xml.etree.ElementTree as et

import re

tree = et.parse('data.html')

html_tag = tree.getroot()

pattern = r"""

\$

.*?

\b

"""

for tag in html_tag.iter('*'): # '*' => all tags

text = tag.text.strip()

if text:

match_list = re.findall(pattern, text, flags=re.X)

print (match_list)

--output:--

['$name']

['$age']How do I store these values into a data structure that I could iterate

through in the future?

您可以使用shelve模块:$ cat data.html

  • Name: $name
  • Age: $age
  • Dogs: $dog1, $dog2

import xml.etree.ElementTree as et

import re

import shelve

import collections as coll

tree = et.parse('data.html')

html_tag = tree.getroot()

pattern = r"""

\$ #Match a literal $ sign...

.+? #followed by any character, 1 or more times, non-greedy

\b #followed by the (first) word boundary

"""

results = coll.defaultdict(list)

for tag in html_tag.iter('*'):

text = tag.text.strip()

if text:

match_list = re.findall(pattern, text, flags=re.X)

if match_list:

results['data.html'].extend(match_list)

print(results)

with shelve.open('mydb.db') as db:

db['html vars'] = results

with shelve.open('mydb.db') as db:

for key, val in db['html vars'].items():

print("{}: {}".format(key, val))

--output:--

defaultdict(, {'data.html': ['$name', '$age', '$dog1', '$dog2']})

data.html: ['$name', '$age', '$dog1', '$dog2']

如果您的最终目标是替换html中的那些变量,那么您的格式符合python的template格式:import string

with open('data.html') as f:

template = string.Template(f.read())

values = {

'name': 'socal_javaguy',

'age': 25,

'dog1': 'Rover',

'dog2': 'Jane',

}

results = template.substitute(values)

print(results)

--output:--

  • Name: socal_javaguy
  • Age: 25
  • Dogs: Rover, Jane
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值