python 怎么样去txt中提取xml_如何使用Python将XML文件解析为txt文件

最新推荐文章于 2021-12-14 10:34:50 发布

协和临床营养科陈

最新推荐文章于 2021-12-14 10:34:50 发布

阅读量102

点赞数

文章标签： python 怎么样去txt中提取xml

本文链接：https://blog.csdn.net/weixin_42541463/article/details/113495706

版权

How do I parse an XML file so it removes all the html tags and all other unnecessary tags like ?

0001

Civil Rights History Project

Interview completed by the Southern Oral History Program

under contract to the

Smithsonian Institution’s National Museum of African American History & Culture

and the Library of Congress, 2013

Interviewee: Oliver W. Hill, Jr.

Interview Date: August 17, 2013

Location: Richmond, Virginia

Interviewer: David Cline

Videographer: John Bishop

Length: 01:13:30

Unidentified Announcer: From the Library of Congress and the Smithsonian National Museum of African American History and Culture.

David Cline: The way that I like to work, if it’s okay with you, is I work pretty chronologically. I mean, I always think that the work that people end up doing in their lives is always shaped by their families and where they come from and the communities in which they’re raised and all of that. So, I take a pretty, you know, sort of traditional family history approach, at least to where we start.

Oliver Hill: That sounds good.

David Cline: Great. And then, and certainly, in your case, I know there will be a lot of rich relevant memories pertinent to the Civil Rights Movement, but we’ll talk about other things, as well.

Oliver Hill: Okay.

0002

David Cline: And then carry on through your life and to your own career, and where you see these things sort of playing out.

Oliver Hill: Okay.

David Cline: If that sounds good to you.

Oliver Hill: Sounds good.

David Cline: It’s really—it’s your interview. You’re in charge. If you want to take a break at any point, if you want to take it in a completely different direction, it’s up to you. We’re very informal in that way.

Current output

Which is not removing the empty lines although I tried using regex, and has these unwanted tags

-->

0001

Civil Rights History Project

Interview completed by the Southern Oral History Program

under contract to the

Smithsonian Institution’s National Museum of African American History & Culture

and the Library of Congress, 2013

Interviewee: Oliver W. Hill, Jr.

Interview Date: August 17, 2013

Location: Richmond, Virginia

Interviewer: David Cline

Videographer: John Bishop

Length: 01:13:30

Unidentified Announcer: From the Library of Congress and the Smithsonian National Museum of African American History and Culture.

Oliver Hill: That sounds good.

David Cline: Great. And then, and certainly, in your case, I know there will be a lot of rich relevant memories pertinent to the Civil Rights Movement, but we’ll talk about other things, as well.

Oliver Hill: Okay.

0002

David Cline: And then carry on through your life and to your own career, and where you see these things sort of playing out.

Oliver Hill: Okay.

David Cline: If that sounds good to you.

Oliver Hill: Sounds good.

当前代码

import os

import re

def remove_html_tags(data):

cleanr = re.compile('<.>')

cleantext = re.sub(cleanr, '', data)

return cleantext

with open('2015669201.xml', 'r') as inFile:

data = inFile.read().splitlines()

for line in data:

cleaned_data = remove_html_tags(line)

print(re.sub(r'\n\s*\n','\n',cleaned_data,re.MULTILINE))

如何解决此问题，使其仅包含HTML文件中的文本，而没有其他html标签或其他内容？

协和临床营养科陈

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 怎么样去txt中提取xml_如何使用Python将XML文件解析为txt文件

How do I parse an XML file so it removes all the html tags and all other unnecessary tags like ?0001Civil Rights History ProjectInterview completed by the Southern Oral History Programunder contract t...
复制链接

扫一扫