数据处理中的正则表达

最新推荐文章于 2024-05-10 15:07:37 发布

xiaoyaolangwj

最新推荐文章于 2024-05-10 15:07:37 发布

阅读量630

点赞数

分类专栏： # Halcon 数据处理工具文章标签：正则表达式内容提取 halcon

本文链接：https://blog.csdn.net/xiaoyaolangwj/article/details/116936231

版权

Halcon 同时被 2 个专栏收录

7 篇文章 25 订阅

订阅专栏

数据处理工具

6 篇文章 0 订阅

订阅专栏

最近在处理激光雷达点云数据时，点云数据在csv文件中长这个样子：

如果直接用Python处理csv非常方便。通过Pandas很容易处理：

import numpy as np
import pandas as pd
import open3d as o3d


pcd = o3d.geometry.PointCloud()    # pcd类型的数据。

with open("2.csv", encoding="utf-8") as f:
    data = pd.read_csv(f, header=None).values.tolist()
    print(data[:5])
    np_points = np.array(data)[:, 1:4]
    print(np_points.shape)
    pcd.points = o3d.utility.Vector3dVector(np_points)
    o3d.visualization.draw_geometries([pcd])
    # o3d.io.write_point_cloud('E:/project/3.ply', pcd)

通过python下pandas处理效率高，速度快，简单操作。

但是我需要研究学习通过halcon中正则表达实现对csv文件的正则匹配，所以需要看到halcon代码中的每个变量的变换情况，同时好好学习一下正则表达。

在halcon代码中：通过文件读取之后输出出来的是tuple的字符数组类型列表构成的向量VecOutLine。把向量转为tuple元组。如变量监控中的P。生成的字符串。这个字符串就需要我们来正则处理来提取数字，以构成X、Y、Z。

dev_update_off ()

Filename := './0.csv' // 点云数据的名称，.txt、.csv、.asc等格式的都可以

NumColumns := 5 //如果点云的每行数据有3个数字就写3 （只有xyz 的数据）

count_seconds (Then)
open_file (Filename, 'input', FileHandle)

VecOutLine.clear()
repeat
    fread_line (FileHandle, VecOutLine.at(VecOutLine.length()), IsEOF)
until (IsEOF)
convert_vector_to_tuple (VecOutLine, P)
S := P[354540]
stop()

P1 := split(P, '\n')
P2 := split(P1, ',')

P3 := number(regexp_replace(P2,'^\\s*(.+?)\\s*\n*$', '$1'))

// tuple_number负责字符转数字，tuple_string负责数字转字符。
// tuple_number (P2, P3)

// P4 := number(P2)
// convert_vector_to_tuple({P2},P4)


IndexIsString := find(type_elem(P3),H_TYPE_STRING)
if (IndexIsString > -1)
     throw ('could not convert "' + P3[IndexIsString] + '" to a number')
endif

X := P3[[1:NumColumns:|P3|-1]]
Y := P3[[2:NumColumns:|P3|-1]]
Z := P3[[3:NumColumns:|P3|-1]]
stop()

我碰到的问题是：

P3 := number(regexp_replace(P2,'^\\s*(.+?)\\s*\n*$', '$1'))

这一行代码。通过F1查询知道是替换。

通过查询正则表达语法，弄明白了，\代表转义符，匹配特殊字符。^和$表示待匹配表达式开头和结尾。（）子表达式捕获分组，分组匹配值保存供以后使用。圆括号中（.+?）：

+	匹配前面的子表达式一次或多次。要匹配 + 字符，请使用 \+。
.	匹配除换行符 \n 之外的任何单字符。要匹配 . ，请使用 \. 。
?	匹配前面的子表达式零次或一次，或指明一个非贪婪限定符。要匹配 ? 字符，请使用 \?。

.匹配除了换行符之外所有字符。这个字符出现一次或多次，然后非贪婪限定，匹配一次，满足条件就停止。这个分组字符串，前面有\\s*不限定个数的空格，后面有\\s*不限定个数的空格，还有一个不限定个数\n*换行符。满足这个条件用$1来替换，而这个$1替换的是第一个捕获分组的匹配值。

Replace表达式可以使用标签'$0'来引用输入数据中匹配的子字符串，'$i'用来引用第i个捕获组的子匹配(对于i <= 9)，而'$$'用来引用'$'字面量。

注意上面一个细节：转义字符在一些语言中已经转移过，如果使用需要多加两个反斜杠。

比如在python中，直接如下写报错

这样写才有效。

print(re.findall("^\\s*(.+?)\\s*\\n*$", str3))

看看python的处理方式：

import re


f = open("0.csv", encoding="utf-8")
str1 = f.read()
print(type(str1))
print(len(str1))
str2 = str1.split("\n")[:6]
# str2是列表，再转回string类型。
print(''.join(str2))
str3 = ''.join(str2)

'''
    以上代码解决了一个open文件格式为_io.TextIOWrapper转化为string类型。
    因为太长，而且，打印的时候“\n”自动换行了，所以我这里先切割，然后取前十段。来解决halcon中正则匹配问题。
    同时解决了，list和string类型之间的互转。
    还有一种处理方式，直接str1= f.read(),不切割。看看效果。应为str1也是string类型。
'''
print(re.match("(.+?)", str3).span())
print(re.findall("^\\s*(.+?)\\s*\\n*$", str3))
# str4 = str3.replace()   # 可以去替换
str4 = re.sub(r"^\\s*(.+?)\\s*$", "$1", str3)
print(str4)
# 去掉空格：
strs = re.sub("\\s", "", str3)
print(strs)


str5 = re.findall("\\-?\\d+", strs)
print(str5)
num_list = list(map(int, str5))
print(num_list)
x = num_list[1::5]
print(x)
y = num_list[2::5]
z = num_list[3::5]
print(y)
print(z)

输出：

<class 'str'>
27652417
0 , -3248 , 9143 , 193 , 1 ,1 , -3248 , 9089 , 193 , 1 ,2 , -3249 , 9040 , 188 , 1 ,3 , -3250 , 8990 , 184 , 1 ,4 , -3247 , 8929 , 194 , 1 ,5 , -3248 , 8878 , 190 , 1 ,
(0, 1)
['0 , -3248 , 9143 , 193 , 1 ,1 , -3248 , 9089 , 193 , 1 ,2 , -3249 , 9040 , 188 , 1 ,3 , -3250 , 8990 , 184 , 1 ,4 , -3247 , 8929 , 194 , 1 ,5 , -3248 , 8878 , 190 , 1 ,']
0 , -3248 , 9143 , 193 , 1 ,1 , -3248 , 9089 , 193 , 1 ,2 , -3249 , 9040 , 188 , 1 ,3 , -3250 , 8990 , 184 , 1 ,4 , -3247 , 8929 , 194 , 1 ,5 , -3248 , 8878 , 190 , 1 ,
0,-3248,9143,193,1,1,-3248,9089,193,1,2,-3249,9040,188,1,3,-3250,8990,184,1,4,-3247,8929,194,1,5,-3248,8878,190,1,
['0', '-3248', '9143', '193', '1', '1', '-3248', '9089', '193', '1', '2', '-3249', '9040', '188', '1', '3', '-3250', '8990', '184', '1', '4', '-3247', '8929', '194', '1', '5', '-3248', '8878', '190', '1']
[0, -3248, 9143, 193, 1, 1, -3248, 9089, 193, 1, 2, -3249, 9040, 188, 1, 3, -3250, 8990, 184, 1, 4, -3247, 8929, 194, 1, 5, -3248, 8878, 190, 1]
[-3248, -3248, -3249, -3250, -3247, -3248]
[9143, 9089, 9040, 8990, 8929, 8878]
[193, 193, 188, 184, 194, 190]

提示：在编写正则时，经验之谈（不明其理）：在正则表达式之前加上小写的r来编译字符串，就像在C#中编写路径用@一样。

看来还是pandas简单些。

回归halcon

在halcon中给了一个非常好的案例：tuple_regexp.hdev

* This program demonstrates the tuple operators and functions for working
* with regular expressions.
* 
* ***************************************************
* ***** Regular expression basics
* ***************************************************
tuple_regexp_match ('abba', 'ab*', Matches)   * *代表前面一个字符出现0次或多次。输出返回为‘abb’
tuple_regexp_match ('abba', 'ba*', Matches)    * 返回为‘b’
tuple_regexp_match ('abba', 'b+a*', Matches)   * 返回值为‘bba’
tuple_regexp_test ('ababab', '(ab){3}', NumMatches)   * 以下三行返回值为'110'。
tuple_regexp_test ('abababa', '(ab){3}', NumMatches)
tuple_regexp_test ('abababa', '^(ab){3}$', NumMatches)   *开头符合，但是结尾不符合。
tuple_regexp_replace ('abba', 'b*', 'x', Result)     * 返回'xabba'，只匹配一次，没有b也匹配。
tuple_regexp_replace ('abba', 'b', 'x', Result)      * 返回'axba' ，只匹配一次，出现b就替换。
tuple_regexp_replace ('abba', ['b','replace_all'], 'x', Result)   * 返回 ‘axxa’
* ***************************************************
* ***** Some sample expressions
* ***************************************************
tuple_regexp_replace (['SN/1234567-X','SN/2345678-Y','SN/3456789-Z'], 'SN/(\\d{7})-([A-Z])', 'Product Model $2, Serial Number $1', Result)
tuple_regexp_replace (['01/04/2000','06/30/2007'], '(\\d{2})/(\\d{2})/(\\d{4})', 'Day: $2, Month: $1, Year: $3', Result)
* ***************************************************
* ***** Working with file names
* ***************************************************
get_system ('image_dir', HalconImages)
get_system ('operating_system', OS)
if (OS{0:2} == 'Win')
    tuple_split (HalconImages, ';', HalconImagesSplit)
else
    tuple_split (HalconImages, ':', HalconImagesSplit)
endif
list_files (HalconImagesSplit[0], ['files','follow_links'], Files)
* Filter list of files by extension PNG
tuple_regexp_select (Files, '\\.png$', FilesPNG)
* Ignore images sets by removing all files which end with a digit
tuple_regexp_select (FilesPNG, ['\\d\\.png$','invert_match'], FilesNoDigit)
* Extract file names without slashes (strip directory part)
tuple_regexp_match (FilesNoDigit, '[^/\\\\]*.png', ShortNames)
* Transform file names, e.g., for creating processed output files
tuple_regexp_replace (ShortNames, '(.*)\\.png$', 'out_$1.jpg', ConvertedNames)
* Count number of files with multi-word names (name contains hyphen or underscore)
tuple_regexp_test (ShortNames, '_|-', NumCombined)
* ***************************************************
* ***** Using regular expressions in HDevelop expressions
* ***************************************************
* Again count number of files with digit and calculate percentage
if (|ShortNames| > 0)
    Result := 100.0 * regexp_test(ShortNames,'\\d') / |ShortNames| + '% of PNG file names contain a digit'
endif
* Return letters 2-n of all files starting with 'a'
Result := regexp_match(regexp_select(ShortNames,'^a'),'^a(.*)')
* The operator =~ is short for regexp_test and useful for boolean expressions
if (ShortNames =~ '^z')
    Result := 'A filename starting with z exists'
endif

逐行研究一下：看代码中的注释。

现在重点看19和20行。

tuple_regexp_replace (['SN/1234567-X','SN/2345678-Y','SN/3456789-Z'], 'SN/(\\d{7})-([A-Z])', 'Product Model $2, Serial Number $1', Result)

通过圆括号()把匹配拆分成两部分，待匹配规则是以‘SN/’开头的，第一个捕获分组是七个数字，第二个捕获分组是A到Z，然后替换原则：类似C#中的{$1}占位符，只不过是从1开始。一步完成了数据提取和替换。

在我另外一篇博文中：3D点云基础知识(二)-bilibili视频资源整理（一）代码有所讲解。此处只分析正则表达部分。在这篇文章中也有涉及。

xiaoyaolangwj

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
数据处理中的正则表达

最近在处理激光雷达点云数据时，点云数据在csv文件中长这个样子：如果直接用Python处理csv非常方便。通过Pandas很容易处理：import numpy as npimport pandas as pdimport open3d as o3dpcd = o3d.geometry.PointCloud() # pcd类型的数据。with open("2.csv", encoding="utf-8") as f: data = pd.read_csv(f, hea
复制链接

扫一扫

专栏目录