python 怎么样去txt中提取xml_如何使用Python将XML文件解析为txt文件

How do I parse an XML file so it removes all the html tags and all other unnecessary tags like ?

0001

Civil Rights History Project

Interview completed by the Southern Oral History Program

under contract to the

Smithsonian Institution’s National Museum of African American History & Culture

and the Library of Congress, 2013

Interviewee: Oliver W. Hill, Jr.

Interview Date: August 17, 2013

Location: Richmond, Virginia

Interviewer: David Cline

Videographer: John Bishop

Length: 01:13:30

Unidentified Announcer: From the Library of Congress and the Smithsonian National Museum of African American History and Culture.

David Cline: The way that I like to work, if it’s okay with you, is I work pretty chronologically. I mean, I always think that the work that people end up doing in their lives is always shaped by their families and where they come from and the communities in which they’re raised and all of that. So, I take a pretty, you know, sort of traditional family history approach, at least to where we start.

Oliver Hill: That sounds good.

David Cline: Great. And then, and certainly, in your case, I know there will be a lot of rich relevant memories pertinent to the Civil Rights Movement, but we’ll talk about other things, as well.

Oliver Hill: Okay.

0002

David Cline: And then carry on through your life and to your own career, and where you see these things sort of playing out.

Oliver Hill: Okay.

David Cline: If that sounds good to you.

Oliver Hill: Sounds good.

David Cline: It’s really—it’s your interview. You’re in charge. If you want to take a break at any point, if you want to take it in a completely different direction, it’s up to you. We’re very informal in that way.

Current output

Which is not removing the empty lines although I tried using regex, and has these unwanted tags

-->

0001

Civil Rights History Project

Interview completed by the Southern Oral History Program

under contract to the

Smithsonian Institution’s National Museum of African American History & Culture

and the Library of Congress, 2013

Interviewee: Oliver W. Hill, Jr.

Interview Date: August 17, 2013

Location: Richmond, Virginia

Interviewer: David Cline

Videographer: John Bishop

Length: 01:13:30

Unidentified Announcer: From the Library of Congress and the Smithsonian National Museum of African American History and Culture.

David Cline: The way that I like to work, if it’s okay with you, is I work pretty chronologically. I mean, I always think that the work that people end up doing in their lives is always shaped by their families and where they come from and the communities in which they’re raised and all of that. So, I take a pretty, you know, sort of traditional family history approach, at least to where we start.

Oliver Hill: That sounds good.

David Cline: Great. And then, and certainly, in your case, I know there will be a lot of rich relevant memories pertinent to the Civil Rights Movement, but we’ll talk about other things, as well.

Oliver Hill: Okay.

0002

David Cline: And then carry on through your life and to your own career, and where you see these things sort of playing out.

Oliver Hill: Okay.

David Cline: If that sounds good to you.

Oliver Hill: Sounds good.

David Cline: It’s really—it’s your interview. You’re in charge. If you want to take a break at any point, if you want to take it in a completely different direction, it’s up to you. We’re very informal in that way.

当前代码

import os

import re

def remove_html_tags(data):

cleanr = re.compile('<.>')

cleantext = re.sub(cleanr, '', data)

return cleantext

with open('2015669201.xml', 'r') as inFile:

data = inFile.read().splitlines()

for line in data:

cleaned_data = remove_html_tags(line)

print(re.sub(r'\n\s*\n','\n',cleaned_data,re.MULTILINE))

如何解决此问题,使其仅包含HTML文件中的文本,而没有其他html标签或其他内容?

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值