Python 解析树状结构文件（算法优化）

最新推荐文章于 2023-05-24 17:20:56 发布

weixin_33774308

最新推荐文章于 2023-05-24 17:20:56 发布

阅读量251

点赞数

文章标签： python

原文链接：https://my.oschina.net/jeffyu/blog/61217

版权

为什么80%的码农都做不了架构师？>>>

背景：
基于博客《Python 解析树状结构文件》的算法优化

核心思想：

建立一个List用来存储父节点信息，每当读到以Tab+name 开头的行时，将这行父节点信息存储在prefixList[tab 的个数] 中，即prefixList[i] 存储 Tab 个数为 i 的父节点信息。

当读到以Tab+ptr 开头的行的时候，表明到达了子节点，那么它的父节点（前缀）必定为：preList[0] + ...+ preList[tab 的个数]，所以最终结果为：前缀 + 当前子节点信息。

当再次读到以Tab+name 开头的行时，表明对于接下来的子节点而言，其父节点中某个节点变化了，我们只要覆盖对应的prefixList[tab 的个数] 的值，因为不会有节点需要原来prefixList[tab 的个数] 的值。

实现：

现模拟debug trace 建一个文本文件1.txt，内容如下：

service[hi]
name: [1]
{
	name:[11]
	{	
		name: [111]
		{
			ptr->1111-->[value0]
			ptr->1112-->[value1]
		}
		name: [112]
		{
			name: [1121]
			{
				ptr->111211-->[value2]
			}

		}
	}
	name:[12]
	{
		ptr->121-->[value3]
	}
	name:[13]
	{
		ptr->131-->[value4]
	}
}
service[Jeff]
name: [1]
{
	name:[11]
	{	
		name: [111]
		{
			ptr->1111-->[value0]
			ptr->1112-->[value1]
		}
		name: [112]
		{
			name: [1121]
			{
				ptr->111211-->[value2]
			}

		}
	}
	name:[12]
	{
		ptr->121-->[value3]
	}
	name:[13]
	{
		ptr->131-->[value4]
	}
}

解析程序如下：

1.common.py

'''
Created on 2012-5-28

@author: Jeff_Yu
'''

def getValue(string,key1,key2):
    """
    get the value between key1 and key2 in string
    """
    index1 = string.find(key1)
    index2 = string.find(key2)
    
    value = string[index1 + 1 :index2]
    return value

def getFiledNum(string,key,begin):
    """
    get the number of key in string from begin position
    """
    keyNum = 0
    start = begin

    while True:
        index = string.find(key, start)
        if index == -1:
            break

        keyNum = keyNum + 1
        start = index + 1

    return keyNum

2. main.py

'''
Created on 2012-6-1

@author: Jeff_Yu
'''

import common

fileNameRead = "1.txt"
fileNameWrite = '%s%s' %("Result_",fileNameRead)
writeList = []
# the first name always start with 0 Tab
i = 0

fr = open(fileNameRead,'r')
fw = open(fileNameWrite,'w')

for data in fr:
    if not data:
        break
    
    # find the Service Name
    if data.startswith("service"):
        #for each service
        prefixList = list("0" * 30)
        prefixString = ""
        recordNum = ""
        
        index = data.find('\n')
        writeList.append('%s\n' %data[0:index])
        continue


    # find name
    if data.find("name") != -1:
        tabNumOfData = common.getFiledNum(data, '\t', 0)
        
        value = common.getValue(data, '[', ']')
        
        prefixList[tabNumOfData] = value + "."

    if data.find("ptr") != -1:
        tabNumOfLeaf = common.getFiledNum(data, '\t', 0)
        
        valueOfLeaf = common.getValue(data, '[', ']')
        nameOfLeaf = common.getValue(data, '>', '-->')
        LeafPartstring = nameOfLeaf + "[" + valueOfLeaf + "]"
        
        finalString = ""
        while i < tabNumOfLeaf:
            finalString = finalString + prefixList[i]
            i = i + 1
        
        i = 0
        
        finalString = finalString + LeafPartstring
        
        #append line to writeList
        writeList.append(finalString)
        writeList.append("\n")



# write writeList to result file
fw.writelines(writeList)


del prefixList
del writeList

fw.close()
fr.close()

解析结果Result_1.txt：

service[hi]
1.11.111.1111[value0]
1.11.111.1112[value1]
1.11.112.1121.111211[value2]
1.12.121[value3]
1.13.131[value4]
service[Jeff]
1.11.111.1111[value0]
1.11.111.1112[value1]
1.11.112.1121.111211[value2]
1.12.121[value3]
1.13.131[value4]

实际的trace文件比这个复杂，因为涉及公司信息，实现代码就不贴出来，但是核心思想和上面是一样的

这个版本效率大大提高，原来解析5M的文件要2分多钟，现在只要1秒钟

这个版本优化了：

1.字符串相加的部分改成 all = ‘%s%s%s%s’ % (str0, str1, str2, str3) 的形式。

2.要写入得内容保存在List中，最后用f.writelines(list)一起写入。

3. 这个算法减少了读文件的次数，及时保存读过的有用信息，避免往回读文件。

转载于:https://my.oschina.net/jeffyu/blog/61217