python单词分配_python – 对单词和字符进行分组和分类

最新推荐文章于 2024-08-21 16:36:17 发布

weixin_39788986

最新推荐文章于 2024-08-21 16:36:17 发布

阅读量765

点赞数

文章标签： python单词分配

本文链接：https://blog.csdn.net/weixin_39788986/article/details/111432771

版权

本文介绍了如何使用Python处理hunspell字典格式的数据，目的是按标签分组单词。代码示例中展示了如何使用collections的defaultdict进行分组，以及尝试解决双字母标签的问题，但遇到了三字母标签的挑战。尽管已经对标签进行了排序，但无法正确处理所有三字母的组合。

摘要由CSDN通过智能技术生成

我需要拆分斜线然后报告标签.这是hunspell字典格式.我试图在github上找到一个可以做到这一点的类,但找不到一个.

# vi test.txt

test/S

boy

girl/SE

home/

house/SE123

man/E

country

wind/ES

代码：

from collections import defaultdict

myl=defaultdict(list)

with open('test.txt') as f :

for l in f:

l = l.rstrip()

try:

tags = l.split('/')[1]

myl[tags].append(l.split('/')[0])

for t in tags:

myl[t].append( l.split('/')[0])

except:

pass

输出：

defaultdict(list,

{'S': ['test', 'test', 'girl', 'house', 'wind'],

'SE': ['girl'],

'E': ['girl', 'house', 'man', 'man', 'wind'],

'': ['home'],

'SE123': ['house'],

'1': ['house'],

'2': ['house'],

'3': ['house'],

'ES': ['wind']})

SE组应该有3个单词’girl’,’wind’和’house’.应该没有ES组,因为它包含在内且与“SE”相同,SE123应保持不变.我怎么做到这一点？

更新：

我设法添加了双字母,但如何添加3,4,5克？

from collections import defaultdict

import nltk

myl=defaultdict(list)

with open('hi_IN.dic') as f :

for l in f:

l = l.rstrip()

try:

tags = l.split('/')[1]

ntags=''.join(sorted(tags))

myl[ntags].append(l.split('/')[0])

for t in tags:

myl[t].append( l.split('/')[0])

bigrm = list(nltk.bigrams([i for i in tags]))

nlist=[x+y for x, y in bigrm]

for t1 in nlist:

t1a=''.join(sorted(t1))

myl[t1a].append(l.split('/')[0])

except:

pass

我想如果我在源代码处对标签进行排序会有所帮助：

with open('test1.txt', 'w') as nf:

with open('test.txt') as f :

for l in f:

l = l.rstrip()

try:

tags = l.split('/')[1]

except IndexError:

nline= l

else:

ntags=''.join(sorted(tags))

nline= l.split('/')[0] + '/' + ntags

nf.write(nline+'\n')

这将创建一个带有已排序标签的新文件test1.txt.但是三卦问题仍未解决.

我下载了一个示例文件：

使用“grep”命令的报告是正确的.

!grep 'P.*U' index1.dic

CPU/M

GPU

aware/PU

cleanly/PRTU

common/PRTUY

conscious/PUY

easy/PRTU

faithful/PUY

friendly/PRTU

godly/PRTU

grateful/PUY

happy/PRTU

healthy/PRTU

holy/PRTU

kind/PRTUY

lawful/PUY

likely/PRTU

lucky/PRTU

natural/PUY

obtrusive/PUY

pleasant/PTUY

prepared/PU

reasonable/PU

responsive/PUY

righteous/PU

scrupulous/PUY

seemly/PRTU

selfish/PUY

timely/PRTU

truthful/PUY

wary/PRTU

wholesome/PU

willing/PUY

worldly/PTU

worthy/PRTU

在排序标签文件上使用bigrams的python报告不包含上面提到的所有单词.

myl['PU']

['aware',

'aware',

'conscious',

'faithful',

'grateful',

'lawful',

'natural',

'obtrusive',

'prepared',

'reasonable',

'responsive',

'righteous',

'scrupulous',

'selfish',

'truthful',

'wholesome',

'willing']

weixin_39788986

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫