假设有一个链接“http://www.someHTMLPageWithTwoForms.com”,它基本上是一个有两种形式的HTML页面(比如Form 1和Form 2).我有这样的代码……
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
h = httplib2.Http('.cache')
response, content = h.request('http://www.someHTMLPageWithTwoForms.com')
for field in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
if field.has_key('name'):
print field['name']
这将返回属于我的HTML页面的表单1和表单2的所有字段名称.有什么方法我只能获得属于特定表单的字段名称(仅限表单2)?
解决方法:
使用lxml进行这种解析也很容易(我个人更喜欢使用BeautifulSoup,因为它支持Xpath).例如,以下代码段将打印属于名为“form2”的表单的所有字段名称(如果有的话):
# you can ignore this part, it's only here for the demo
from StringIO import StringIO
HTML = StringIO("""
""")
# here goes the useful code
import lxml.html
tree = lxml.html.parse(HTML) # you can pass parse() a file-like object or an URL
root = tree.getroot()
for form in root.xpath('//form[@name="form2"]'):
for field in form.getchildren():
if 'name' in field.keys():
print field.get('name')
标签:python,parsing
来源: https://codeday.me/bug/20190626/1294869.html