如何将html文件转为txt文件格式,python如何将html文件转换为可读的txt文件?

我有很多html文件是这样的:

Summary:

 According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.

On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.


INDUSTRY CLASSIFICATION:

SIC Code:

0000

Sector:

N/A

Industry:

N/A

我要做的是取出文件中间的文本并将其转换为人类可读的格式。

在本例中,它是:According to the complaint filed January 04, 2011, over a

six-week period in December 2007 and January 2008, six healthcare

related hedge funds managed by Defendant FrontPoint Partners LLC

("FrontPoint") sold more than six million shares of Human Genome

Sciences, Inc. ("HGSI") common stock while their portfolio manager

possessed material negative non-public information concerning the

HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.

On March 2, 2011, the plaintiffs filed a First Amended Class

Action Complaint, amending the named defendants and securities

violations. On March 22, 2011, a motion for appointment as lead

plaintiff and for approval of selection of lead counsel was filed.

The defendants responded to the First Amended Complaint by filing a

motion to dismiss on March 28, 2011.

我知道我必须做三件事,它们是:取出文件中间的文字

将"
"替换为"\n"

将" "替换为" "(一个空格)

我知道后两件事很简单,只是在Python中使用replace方法,但我不知道如何实现第一个目标。在

我对正则表达式和beauthoulsoup有点了解,但我不知道如何将它们应用到这个问题上。在

有人能帮我吗?在

谢谢,我很抱歉我的英语很差。在

@Paul:我只想要一节总结。我的老师(他对计算机不太了解)给了我很多html文件,并让我把它们转换成适合数据挖掘的格式(我的老师尝试用SAS来做这件事)。

我不知道SAS,但我想它可能用来处理很多txt文件,所以我想把这些html文件转换成普通的txt文件。在

@欧文:我需要处理很多html文件,我觉得这个问题不太难处理,所以我想用Python直接解决。在

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值