我在与下面格式:文本文件P></
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
……1百万items withP></
but some of the Word _ forms an apostrophe(contain),我不知道别人会喜欢,count of them as the same情况下文字,就是说我会喜欢这些两类:MERGE线P></
cup'board cup blabla 12
cupboard cup blabla2 10
(frequencies added into this one):P></
cupboard cup blabla2 22
在我搜索解决方案给Python 2.7 to that was to,我的第一想法读文本文件存储在两个不同的茶,茶字词典的话apostrophe和没有去,我的话apostrophe over the dictionary of these words are already,试验中如果没有if the dictionary they are the apostrophe,实施时,简单的if notLY add this with在线apostrophe removed。这里是我的代码:P></
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if"'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos","")
g.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write("%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write("%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print"Number of words:",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
这个解决方案工作在合理的时间到the storage of the whether双光子写入词典,在线模式的textfiles在线存储的商店或任何avoid to them as the program对象字典中。but the ends of the双词典得以永远!P></
for the function是更新词典重写工作是会instead of one but the count增频光子。我看见一些解决方案得以学院词典与加成与计数器:Python的:SUM(MERGE elegantly of values)与词典合并和sum of双词典how to sum的元素如何在Python中单光子合并表达词典吗?is there any to语言词典(双路组合键,在出现增values for both)?但他们似乎工作词典are of the only when the form(Word,whereas count)的,想在其他领域Carry the the dictionary作为好。P></
我给你开的想法或问题reframing of the,因为是我的目标to have this program to obtain一次性运行在文件列表中merged this text,谢谢提前!P></
你不能简单地用一个空字符串替换所有的撇号来删除它们吗?像这样:word_form = items[0].replace("'","")。
但我会有两行字相同,这些频率不会被添加,对吗?
对于一个给定的单词,是否最多有两行可以组合,或者可能更多?需要组合的那些是否必须相邻?如果要合并两行,是否保证其他所有内容(计数除外)都相同?
是的,对于一个给定的单词,最多可以组合两行,只有一个带撇号的版本,一个不带撇号的版本。但不,要组合的不一定是相邻的。不,如果两行合并,第3列实际上是不同的,但理想情况下,不带撇号的行的第3列应该是保守的(如示例所示)。
哦,还有一件事,除了第一个词以外,还有其他地方有撇号吗?(如Sven所说,从一开始就用空字符串替换它们好吗)
没有撇号只出现在第一列,谢谢您对这个问题的关注。
我假设您与Python没有特别的联系,这是一次性的事情。如果下一部分工作正常,我将发布一个答案来完成它,但我想尝试删除撇号,然后对文件进行排序,以使事情更简单。先做sed"s/'//" filename >newfile,再做sort newfile >newfile2。newfile2包含已排序的单词(您可以删除newfile),希望它不会花太长时间完成:)
抱歉,这个愚蠢的问题,但您的意思是在控制台中执行这些命令?
正确的。。。抱歉,我猜你为什么用bash。如果是,那么在控制台/终端/任何东西中是。如果你在电脑上,那么…等一下。
这是一些或多或少能满足你需要的东西。只需更改顶部的文件名。它不会修改原始文件。
input_file_name ="input.txt"
output_file_name ="output.txt"
def custom_comp(s1, s2):
word1 = s1.split()[0]
word2 = s2.split()[0]
stripped1 = word1.translate(None,"'")
stripped2 = word2.translate(None,"'")
if stripped1 > stripped2:
return 1
elif stripped1 < stripped2:
return -1
else:
if"'" in word1:
return -1
else:
return 1
def get_word(line):
return line.split()[0].translate(None,"'")
def get_num(line):
return int(line.split()[-1])
print"Reading file and sorting..."
lines = []
with open(input_file_name, 'r') as f:
for line in sorted(f, cmp=custom_comp):
lines.append(line)
print"File read and sorted"
combined_lines = []
print"Combining entries..."
i = 0
while i < len(lines) - 1:
if get_word(lines[i]) == get_word(lines[i+1]):
total = get_num(lines[i]) + get_num(lines[i+1])
new_parts = lines[i+1].split()
new_parts[-1] = str(total)
combined_lines.append("".join(new_parts))
i += 2
else:
combined_lines.append(lines[i].strip())
i += 1
print"Entries combined"
print"Writing to file..."
with open(output_file_name, 'w+') as f:
for line in combined_lines:
f.write(line +"
")
print"Finished"
它对单词进行排序,使间距有点混乱。如果这很重要,请告诉我,它可以调整。
另一件事是它对整个事情进行分类。对于只有100万行,可能不会花费太长时间,但再次告诉我这是否是一个问题。
非常感谢你在不到一分钟内给出的答案!我对它做了一些修改,即使没有要合并的带撇号的条目,也要插入不带撇号的条目,我意识到我必须运行程序几次,因为有些情况下要合并的行超过两行(我的错,我不知道有),但有一个完成了所有操作的程序!