labels用python 怎么用_如何在python中使用適當的標簽從文本文件中提取數字-CSDN博客

本文链接：https://blog.csdn.net/weixin_29268331/article/details/114723590

boundary

layer 2

datatype 0

xy 15 525270 8663518 525400 8663518 525400 8664818 525660 8664818

525660 8663518 525790 8663518 525790 8664818 526050 8664818

526050 8663518 526180 8663518 526180 8665398 525980 8665598

525470 8665598 525270 8665398 525270 8663518

endel

I have coordinates of polygons in this format shown above. Each polygon starts with "boundary" and ends with "endel". I am having trouble extracting the layer number, number of points, and the coordinates into either a numpy array or a pandas dataframe.

我有上面顯示的這種格式的多邊形坐標。每個多邊形以“boundary”開頭,以“endel”結尾。我無法將層數,點數和坐標提取到numpy數組或pandas數據幀中。

To be specific to this example, I need the layer number (2), number of points (15), and the x-y coordinate pairs.

具體到這個例子,我需要層數(2),點數(15)和x-y坐標對。

with open('source1.txt', encoding="utf-8") as f:

for line in f:

line = f.readline()

srs= line.split("\t")

print(srs)

Doing this doesnt split the numbers even thoe they are separated by tabs

這樣做不會分割數字,即使它們被制表符分開

[' layer 255\n']

[' xy 5 0 0 22800000 0 22800000 22800000 0 22800000\n']

[' endel\n']

This is the result i got with that

這是我得到的結果

with open('source1.txt', encoding="utf-8") as f:

for line in f:

line = f.readline()

srs= line.split(" ")

print(srs)

This isnt what i wanted but i tried that too and yet got a bad split

這不是我想要的,但我也嘗試過,但卻分道揚..

['', '', '', '', '', '', '', '', 'layer', '255\n']

['', '', '', '', '', '', '', '', 'xy', '', '', '5', '', '', '0', '0', '', '', '22800000', '0', '', '', '22800000', '22800000', '', '', '0', '22800000\n']

['', '', '', '', '', '', '', '', 'endel\n']

I couldnt go to numpy part as im stuck in processing the string from the file

我不能去numpy部分,因為我卡在處理文件中的字符串

Edited as per request

根據要求編輯

2 个解决方案

You could use some trivial code such as:

您可以使用一些簡單的代碼,例如:

res = []

coords = []

xy = False

with open('data.txt') as f:

for line in f.readlines():

if 'layer' in line:

arr = line.split()

layer = int(arr[-1].strip())

elif 'xy' in line:

arr = line.split()

npoints = int(arr[1])

coords = arr[2:]

xy = True

elif 'endel' in line:

res.append([layer, npoints, coords[0:npoints]])

xy = False

coords = []

elif xy:

coords.extend(line.split())

print(res)

Then, you can convert the resulting list to numpy array, or whatever you like, but note that coords are still strings in the code above.

然后,您可以將結果列表轉換為numpy數組,或任何您喜歡的數組,但請注意,coords仍然是上面代碼中的字符串。

You can use a regex to parse that file into blocks of the relevant data then parse each block:

您可以使用正則表達式將該文件解析為相關數據的塊,然后解析每個塊:

for block in re.findall(r'^boundary([\s\S]+?)endel', f.read()):

m1=re.search(r'^\s*layer\s+(\d+)', block, re.M)

m2=re.search(r'^\s*datatype\s+(\d+)', block, re.M)

m3=re.search(r'^\s*xy\s+(\d+)\s+([\s\d]+)', block, re.M)

if m1 and m2 and m3:

layer=int(m1.group(1))

datatype=int(m2.group(1))

xy=int(m3.group(1))

coordinates=[(int(x),int(y)) for x,y in zip(*[iter(m3.group(2).split())]*2)]

else:

print "can't parse {}".format(block)

A variable number of coordinates are supported after the xy and it is trivial to test if the number of coordinates parsed is the number expected with len(coordinates)==xy.

在xy之后支持可變數量的坐標,並且測試解析的坐標數是否是len(coordinates)== xy所期望的數量是微不足道的。

As written, this requires reading the entire file into memory. If size is an issues, (and it usually is not for small to moderate size files), you can use mmap to make the file appear to be in memory.

如上所述,這需要將整個文件讀入內存。如果大小是一個問題,(並且它通常不適用於小到中等大小的文件),您可以使用mmap使文件看起來在內存中。