python字典合并与排序_Python：将字典与添加值合并但保留其他字段

最新推荐文章于 2024-05-12 18:15:45 发布

weixin_39971435

最新推荐文章于 2024-05-12 18:15:45 发布

阅读量348

点赞数

文章标签： python字典合并与排序

本文链接：https://blog.csdn.net/weixin_39971435/article/details/111443659

版权

我在与下面格式：文本文件P＞＜／

word_form root_form morphological_form frequency

……1百万items withP＞＜／

but some of the Word _ forms an apostrophe(contain)，我不知道别人会喜欢，count of them as the same情况下文字，就是说我会喜欢这些两类：MERGE线P＞＜／

cup'board cup blabla 12

cupboard cup blabla2 10

(frequencies added into this one)：P＞＜／

cupboard cup blabla2 22

在我搜索解决方案给Python 2.7 to that was to，我的第一想法读文本文件存储在两个不同的茶，茶字词典的话apostrophe和没有去，我的话apostrophe over the dictionary of these words are already，试验中如果没有if the dictionary they are the apostrophe，实施时，简单的if notLY add this with在线apostrophe removed。这里是我的代码：P＞＜／

class Lemma:

"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""

def __init__(self,lop):

self.word_form = lop[0]

self.root = lop[1]

self.morph = lop[2]

self.freq = int(lop[3])

def Reader(filename):

"""Keeps the lines of a file in memory for a single reading, memory efficient"""

with open(filename) as f:

for line in f:

yield line

def get_word_dict(filename):

'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''

'''Works in a reasonable time'''

'''This step can be done writing line by line, avoiding all storage in memory'''

word_dict = {}

word_dict_striped = {}

# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe

with open('word_dict.txt', 'wb') as f:

with open('word_dict_striped.txt', 'wb') as g:

reader = Reader(filename)

for line in reader:

items = line.split("\t")

word_form = items[0]

if"'" in word_form:

# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped

items[0] = word_form.replace("'","")

items[2] = items[2].replace("\+Apos","")

g.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))

word_dict_striped({items[0] : Lemma(items)})

else:

# we just add the lemma to the dictionary word_dict

f.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))

word_dict.update({items[0] : Lemma(items)})

return word_dict, word_dict_striped

def merge_word_dict(word_dict, word_dict_striped):

'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''

''' Does not run in reasonable time on the whole list '''

with open('word_compiled_dict.txt', 'wb') as f:

for word in word_dict_striped.keys():

if word in word_dict.keys():

word_dict[word].freq += word_dict_striped[word].freq

f.write("%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))

else:

word_dict.update(word_dict_striped[word])

print"Number of words:",

print(len(word_dict))

for x in word_dict:

print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq

return word_dict

这个解决方案工作在合理的时间到the storage of the whether双光子写入词典，在线模式的textfiles在线存储的商店或任何avoid to them as the program对象字典中。but the ends of the双词典得以永远！P＞＜／

for the function是更新词典重写工作是会instead of one but the count增频光子。我看见一些解决方案得以学院词典与加成与计数器：Python的：SUM(MERGE elegantly of values)与词典合并和sum of双词典how to sum的元素如何在Python中单光子合并表达词典吗？is there any to语言词典(双路组合键，在出现增values for both)？但他们似乎工作词典are of the only when the form(Word，whereas count)的，想在其他领域Carry the the dictionary作为好。P＞＜／

我给你开的想法或问题reframing of the，因为是我的目标to have this program to obtain一次性运行在文件列表中merged this text，谢谢提前！P＞＜／

你不能简单地用一个空字符串替换所有的撇号来删除它们吗？像这样：word_form = items[0].replace("'","")。

但我会有两行字相同，这些频率不会被添加，对吗？

对于一个给定的单词，是否最多有两行可以组合，或者可能更多？需要组合的那些是否必须相邻？如果要合并两行，是否保证其他所有内容(计数除外)都相同？

是的，对于一个给定的单词，最多可以组合两行，只有一个带撇号的版本，一个不带撇号的版本。但不，要组合的不一定是相邻的。不，如果两行合并，第3列实际上是不同的，但理想情况下，不带撇号的行的第3列应该是保守的(如示例所示)。

哦，还有一件事，除了第一个词以外，还有其他地方有撇号吗？(如Sven所说，从一开始就用空字符串替换它们好吗)

没有撇号只出现在第一列，谢谢您对这个问题的关注。

我假设您与Python没有特别的联系，这是一次性的事情。如果下一部分工作正常，我将发布一个答案来完成它，但我想尝试删除撇号，然后对文件进行排序，以使事情更简单。先做sed"s/'//" filename >newfile，再做sort newfile >newfile2。newfile2包含已排序的单词(您可以删除newfile)，希望它不会花太长时间完成：)

抱歉，这个愚蠢的问题，但您的意思是在控制台中执行这些命令？

正确的。。。抱歉，我猜你为什么用bash。如果是，那么在控制台/终端/任何东西中是。如果你在电脑上，那么…等一下。

这是一些或多或少能满足你需要的东西。只需更改顶部的文件名。它不会修改原始文件。

input_file_name ="input.txt"

output_file_name ="output.txt"

def custom_comp(s1, s2):

word1 = s1.split()[0]

word2 = s2.split()[0]

stripped1 = word1.translate(None,"'")

stripped2 = word2.translate(None,"'")

if stripped1 > stripped2:

return 1

elif stripped1 < stripped2:

return -1

else:

if"'" in word1:

return -1

else:

return 1

def get_word(line):

return line.split()[0].translate(None,"'")

def get_num(line):

return int(line.split()[-1])

print"Reading file and sorting..."

lines = []

with open(input_file_name, 'r') as f:

for line in sorted(f, cmp=custom_comp):

lines.append(line)

print"File read and sorted"

combined_lines = []

print"Combining entries..."

i = 0

while i < len(lines) - 1:

if get_word(lines[i]) == get_word(lines[i+1]):

total = get_num(lines[i]) + get_num(lines[i+1])

new_parts = lines[i+1].split()

new_parts[-1] = str(total)

combined_lines.append("".join(new_parts))

i += 2

else:

combined_lines.append(lines[i].strip())

i += 1

print"Entries combined"

print"Writing to file..."

with open(output_file_name, 'w+') as f:

for line in combined_lines:

f.write(line +"

print"Finished"

它对单词进行排序，使间距有点混乱。如果这很重要，请告诉我，它可以调整。

另一件事是它对整个事情进行分类。对于只有100万行，可能不会花费太长时间，但再次告诉我这是否是一个问题。

非常感谢你在不到一分钟内给出的答案！我对它做了一些修改，即使没有要合并的带撇号的条目，也要插入不带撇号的条目，我意识到我必须运行程序几次，因为有些情况下要合并的行超过两行(我的错，我不知道有)，但有一个完成了所有操作的程序！

weixin_39971435

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python字典合并与排序_Python：将字典与添加值合并但保留其他字段

我在与下面格式：文本文件P＞＜／word_form root_form morphological_form frequencyword_form root_form morphological_form frequencyword_form root_form morphological_form frequency……1百万items withP＞＜／but some of the Word ...
复制链接

扫一扫

python字典合并与排序_Python：将字典与添加值合并但保留其他字段

“相关推荐”对你有帮助么？