如何建语料库_如何为python-NLTK建立翻译语料库？

最新推荐文章于 2023-05-14 20:26:56 发布

weixin_39943678

最新推荐文章于 2023-05-14 20:26:56 发布

阅读量450

点赞数 1

文章标签：如何建语料库

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39943678/article/details/113714444

版权

对于类似翻译的数据集，NLTK可以使用AlignedCorpusReader读取单词对齐句子的语料库。文件必须具有以下格式：first source sentence

first target sentence

first alignment

second source sentence

second target sentence

second alignment

这意味着假设标记被空格隔开，句子以不同的行开始。例如，假设您有如下目录结构：

^{pr2}$

其中文件的内容是：# en-es.txt

This is an example

Esto es un ejemplo

0-0 1-1 2-2 3-3

以及# en-pt.txt

This is an example

Esto é um exemplo

0-0 1-1 2-2 3-3

可以使用以下脚本加载此玩具示例：# reader.py

from nltk.corpus.reader.aligned import AlignedCorpusReader

reader = AlignedCorpusReader('./data', '.*', '.txt', encoding='utf-8')

for sentence in reader.aligned_sents():

print(sentence.words)

print(sentence.mots)

print(sentence.alignment)

输出['This', 'is', 'an', 'example']

['Esto', 'es', 'un', 'ejemplo']

0-0 1-1 2-2 3-3

['This', 'is', 'an', 'example']

['Esto', 'é', 'um', 'exemplo']

0-0 1-1 2-2 3-3

行reader = AlignedCorpusReader('./data', '.*', '.txt', encoding='utf-8')创建AlignedCorpusReader的一个实例，该实例读取./data'目录中以'.txt'结尾的所有文件。它还指定文件的编码是'utf-8'。AlignedCorpusReader的其他参数是word_tokenizer和{}，word_tokenizer被设置为WhitespaceTokenizer()，而{}被设置为RegexpTokenizer('\n', gaps=True)。在

更多信息可以在文档(1和2)中找到。在

weixin_39943678

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
如何建语料库_如何为python-NLTK建立翻译语料库？

对于类似翻译的数据集，NLTK可以使用AlignedCorpusReader读取单词对齐句子的语料库。文件必须具有以下格式：first source sentencefirst target sentencefirst alignmentsecond source sentencesecond target sentencesecond alignment这意味着假设标记被空格隔开，句子以不同的行...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。