有规律的文本信息提取

最新推荐文章于 2024-08-12 00:10:28 发布

qq_281617953

最新推荐文章于 2024-08-12 00:10:28 发布

阅读量1.3k

点赞数

分类专栏：算法练习文章标签： re 文本信息

本文链接：https://blog.csdn.net/tortelee/article/details/80915395

版权

算法练习专栏收录该内容

13 篇文章 0 订阅

订阅专栏

文本的格式大概是这样的：

Energy Usage:
  ----------------------------------------------------------------
             Usage   Avg.     Kw-hr      Avg.      Peak      Cost
  Pump      Factor Effic.       /m3        Kw        Kw      /day
  ----------------------------------------------------------------
  6         100.00  75.00      0.34     81.19    156.78      0.00
  ----------------------------------------------------------------
                                         Demand Charge:      0.00
                                         Total Cost:         0.00
   
   
  Node Results at 0:00:00 hrs:
  ----------------------------------------------
                     Demand      Head  Pressure
  Node                  L/s         m         m
  ----------------------------------------------
  2                   20.00     27.81     27.81
  3                   20.00     31.85     31.85
  4                   15.00     32.01     32.01
  5                   30.00     37.15     37.15
  6                 -322.92      0.00      0.00  Reservoir
  1                  237.92     15.00     15.00  Tank
   
   
  Node Results at 1:00:00 hrs:
  ----------------------------------------------
                     Demand      Head  Pressure
  Node                  L/s         m         m
  ----------------------------------------------
  2                   20.00    105.42    105.42
  3                   20.00    105.42    105.42
  4                    0.00    105.51    105.51
  5                    0.00    105.60    105.60
  6                  -40.00      0.00      0.00  Reservoir
  1                    0.00     20.00     20.00  Tank
   
   
  Node Results at 2:00:00 hrs:
  ----------------------------------------------
                     Demand      Head  Pressure
  Node                  L/s         m         m
  ----------------------------------------------
  2                   20.00    105.42    105.42
  3                   20.00    105.42    105.42
  4                    0.00    105.51    105.51
  5                    0.00    105.60    105.60
  6                  -40.00      0.00      0.00  Reservoir
  1                    0.00     20.00     20.00  Tank
   
   
  Node Results at 3:00:00 hrs:
  ----------------------------------------------
                     Demand      Head  Pressure
  Node                  L/s         m         m
  ----------------------------------------------
  2                   20.00    101.58    101.58
  3                   20.00    101.60    101.60
  4                   15.00    101.63    101.63
  5                   30.00    101.85    101.85
  6                  -85.00      0.00      0.00  Reservoir
  1                    0.00     20.00     20.00  Tank
   
  Analysis ended Tue Jul  3 16:22:49 2018

业务的目标：1，提取Pressure的数据，只提取四行

2，压力多少时的时间是多少？

业务思路：

1，先用re正则表达式，确定文本的信息，所在的位置

pattern = re.compile(r'NodeResultsat')
match(pattern,joinedlines)

2，提取时间，用处在的位置

3，提取压强，就是把这一行的数字识别出来，然后添加进数组里面，这样想取最后一个数字就可以直接index查找得到。

pressure = get_pressure(joinedlines1)
ss.append(pressure[3])

这段代码的细节可以忽略，主要是提供文本提取信息的一个思路。首先，要找规律，然后借助re定位，最后用字符串处理的一些方法提取信息。

文件：点击打开链接

完整代码：

import re

file = open('d://QQ数据//1.txt')
lines = file.readlines()
#for line in lines:
 #   print('line:\n',line)
#粘合在一起用re
pattern = re.compile(r'NodeResultsat')

def joinlines(lines,replace = True):
    new = ''.join(lines)
    if replace ==True:
        new = new.replace(' ','')
    return new

#验证位置匹配
def match(pattern,test):
    if re.match(pattern,test):
        return True
    else:
        return False
#这里输入的是压缩过的字符串
def get_hour(str1):
    pattern = re.compile(r'NodeResultsat')
    if match(pattern,str1):
      #  print(str1[13])
        if str1[14]==':':
            return str1[13]  #一位数时间
        else:
            return str1[13]+str1[14]  #两位数时间
    else:
        print('没有检测到时间，报错！')

#这里是没有经过压缩的字符串
#先将数字提取出来，然后取最后一个   
def get_pressure(str1):
    #形式 '1    23.2   34.2   23.3'
    s = []
    yucun = []
    for i in range(len(str1)):
      #  print('轮数\n',i)
        if i == len(str1)-1:
            new = ''.join(yucun)
            s.append(new)
        if (str1[i]== ' ' and len(yucun)>0): 
            new = ''.join(yucun)
            s.append(new)
            yucun = []
            continue
        if (str1[i]== ' ' and len(yucun)==0): 
            continue
        else:
            yucun.append(str1[i])
    if len(s)<3:
        print('没有把四个数全部收纳')
    return s
#s1 = '1    32     34    12'
#print(get_pressure(s1))
def process(lines):
    num = len(lines)
    s = []
    hours = []
    for i in range(num):
        joinedlines = joinlines(lines[i])
        if match(pattern,joinedlines):
            hour = get_hour(joinedlines)  #得到这个的小时数
            hours.append(hour)
            s.append(i)  #记录行数
    ss = []
    for item in s:
       # print('item\n',item)
        for j in range(item+5,item+9,1):
            joinedlines1 = joinlines(lines[j],replace=False)
          #  print(joinedlines1)
            pressure = get_pressure(joinedlines1)
         #   print('pressure\n',pressure[3])
            ss.append(pressure[3])
    print(ss)
    print(hours)
    return ss
process(lines)