【宾州中文树库CTB】数据读取

CTB8.0 共有如下类型文件:


   Newswire: [0001-0325, 0400-0454, 0500-0540, 0600-0885, 0900-0931, 4000-4050]——后缀.nw.raw
   Magazine articles: [0590-0596, 1001-1151]——后缀.mz.raw
   Broadcast news:[2000-3145, 4051-4111]
   Broadcast conversations: [4112-4197]
   Weblogs: [4198-4411]
   Discussion forums: [5000-5558] 

其中,可以作为gold数据的文件有:

The following is a list of files that are double-annotated and can be
regarded as gold standard files.

CTB-1 (69 files, 22,316 words)
chtb_001.fid - chtb_043.fid
chtb_144.fid - chtb_169.fid

CTB-3 (32 files, 12,027 words)
chtb_900.fid - chtb_931.fid

CTB-4 (7 files, 13,828 words)
chtb_1018.fid
chtb_1020.fid
chtb_1036.fid
chtb_1044.fid
chtb_1060.fid
chtb_1061.fid
chtb_1072.fid

CTB-5 (6 files, 15,052 words)
chtb_1118.fid
chtb_1119.fid
chtb_1132.fid
chtb_1141.fid
chtb_1142.fid
chtb_1148.fid

Total: 114 files, 63,223 words (12.46% of the corpus)

 

各文件的内容提取

chtb_0001.nw.raw ~ chtb_0931.nw.raw

示例:

<S ID=1>
上海浦东开发与法制建设同步
</S>

chtb_4000.nw.raw ~ chtb_4050.nw.raw

示例:

<seg id="4">
韩国国立兽医科学检疫院检测后确认,该养鸭场发现了高致病性禽流感病毒。
</seg>

 

chtb_1001.mz.raw ~ chtb_1151.mz.raw

示例:

<S ID=18718>
文.谢淑芬图.薛继光
</S>

 

chtb_2000.bn.raw ~ chtb_3145.bn.raw

示例:

<TEXT>
当年朝鲜战争中的韩国难民生还者拒绝接受美国总统克林顿星期四的声明,克林顿对当年美国军人打死韩国平民表示遗憾。
代表这些生还者的发言人表示:“克林顿的声明只是文过饰非。”
并誓言要将此案送交国际法庭。
克林顿总统在声明中对1950年7月老根米村附近发生的事件深表遗憾。
说那次的事件留下战争悲剧和战争创伤的痛苦记忆。
后来国防部长科恩表示:“美国将为当年伤亡的平民树立一座纪念碑,并且设立一个奖学金纪念战争死难者。”
死难者家属要求美国明确道歉,并且给予直接赔偿。
</TEXT>

chtb_4051.bn.raw ~ chtb_4111.bn.raw

示例:

<segment id="10" start="303.376" end="305.365385407">
这个草案已经是修改过一次的。
</segment>

chtb_4112.bc.raw ~ chtb_4197.bc.raw

示例:

EMPTY

父母外出打工后,孩子留在了农村的家中。

他们被称为留守儿童。

外出打工的父母也是很无奈的,实际上他做出这种选择是非常无奈的。

中国一点二亿农民常年在外地务工,产生了近两千万留守儿童。

我想跟他们说,我好想他们。

他们日复一日年复一年的在孤独中期盼着被爱,在残缺中守望着亲情。

对孩子的情感交流,我觉得特别重要。

在这个条件容许的时候,多回家看看孩子。

EMPTY

 

chtb_4198.wb.raw ~ chtb_4411.wb.raw

示例:

<seg id="4">
在TVBS与辩论会同时播出的节目上,李敖说,他要告公视的状纸都写好了。
</seg>

 

chtb_5000.raw ~ chtb_5558.raw

示例:

<su id=p1su3>——“su id=”后面跟的内容可以是多样的


怎么有关方面就不明白呢?

 

 

 

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值