Python中的re标准库利用正则表达式处理VOC2007标注数据的xml文件

代码:正则表达式处理xml文件.py


# xml文件内容如下:
"""
<annotation>
	<folder>VOC2007</folder>
	<filename>000005.jpg</filename>
	<source>
		<database>The VOC2007 Database</database>
		<annotation>PASCAL VOC2007</annotation>
		<image>flickr</image>
		<flickrid>325991873</flickrid>
	</source>
	<owner>
		<flickrid>archintent louisville</flickrid>
		<name>?</name>
	</owner>
	<size>
		<width>500</width>
		<height>375</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>chair</name>
		<pose>Rear</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>263</xmin>
			<ymin>211</ymin>
			<xmax>324</xmax>
			<ymax>339</ymax>
		</bndbox>
	</object>
	<object>
		<name>chair</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>165</xmin>
			<ymin>264</ymin>
			<xmax>253</xmax>
			<ymax>372</ymax>
		</bndbox>
	</object>
	<object>
		<name>chair</name>
		<pose>Unspecified</pose>
		<truncated>1</truncated>
		<difficult>1</difficult>
		<bndbox>
			<xmin>5</xmin>
			<ymin>244</ymin>
			<xmax>67</xmax>
			<ymax>374</ymax>
		</bndbox>
	</object>
	<object>
		<name>chair</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>241</xmin>
			<ymin>194</ymin>
			<xmax>295</xmax>
			<ymax>299</ymax>
		</bndbox>
	</object>
	<object>
		<name>chair</name>
		<pose>Unspecified</pose>
		<truncated>1</truncated>
		<difficult>1</difficult>
		<bndbox>
			<xmin>277</xmin>
			<ymin>186</ymin>
			<xmax>312</xmax>
			<ymax>220</ymax>
		</bndbox>
	</object>
</annotation>

"""


import re
# xmlPath = r'000020.xml' # 内容相对少
xmlPath = r'000005.xml' # 内容相对多
xmlFile = open(xmlPath)
xml = xmlFile.read()
# print(xml)
# print(type(xml))  # <class 'str'>

s = r"""<object>
		<name>(.*?)</name>.*?""" + \
	r"""<difficult>(.*?)</difficult>.*?""" + \
	r"""<xmin>(.*?)</xmin>.*?""" + \
	r"""<ymin>(.*?)</ymin>.*?""" + \
	r"""<xmax>(.*?)</xmax>.*?""" + \
	r"""<ymax>(.*?)</ymax>.*?""" + \
	r"""</bndbox>.*?</object>"""

pattern = re.compile(s,re.S)
items = re.findall(pattern, xml)

for item in items:
    difficult = item[1]
    classType = item[0]
    xmin = int(item[2])
    ymin = int(item[3])
    xmax = int(item[4])
    ymax = int(item[5])
    info  = \
    "difficult:{0:^4}classType:{1:^10}xmin:{2:^6}ymin:{3:^6}xmax:{4:^6}ymax:{5:^6}".format(
    difficult,classType,xmin,ymin,xmax,ymax)
    print(info)

xmlFile.close()

控制台输出信息:

Windows PowerShell
版权所有 (C) Microsoft Corporation。保留所有权利。

尝试新的跨平台 PowerShell https://aka.ms/pscore6

PS C:\Users\chenxuqi\Desktop\新建文件夹\test>  & 'D:\Python\Python37\python.exe' 'c:\Users\chenxuqi\.vscode\extensions\ms-python.python-2020.11.358366026\pythonFiles\lib\python\debugpy\launcher' '53259' '--' 'c:\Users\chenxuqi\Desktop\新建文件夹\test\正则表达式处理xml文件.py'
difficult: 0  classType:  chair   xmin: 263  ymin: 211  xmax: 324  ymax: 339  
difficult: 0  classType:  chair   xmin: 165  ymin: 264  xmax: 253  ymax: 372  
difficult: 1  classType:  chair   xmin:  5   ymin: 244  xmax:  67  ymax: 374  
difficult: 0  classType:  chair   xmin: 241  ymin: 194  xmax: 295  ymax: 299  
difficult: 1  classType:  chair   xmin: 277  ymin: 186  xmax: 312  ymax: 220  
PS C:\Users\chenxuqi\Desktop\新建文件夹\test> 

参考链接: 崔庆才 python3网络爬虫开发实战
参考链接: 北京理工大学-Python网络爬虫与信息提取

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值