LC_ALL=en_US.UTF-8 让 awk 慢了 40 倍!

服务器间awk程序性能差异之谜
通过调整服务器locale设置,解决awk程序在不同服务器上性能差异高达40倍的问题,深入探究awk实现机制及UTF-8处理效率。

  无意中发现,在一台服务器上,非常简单的awk程序,比C的等价物要慢40倍,感觉有点不太正常,还以为的确是awk太慢。不得其解,到另一台服务上试了一下,相同的 awk 程序,相同的测试数据,这台服务器的速度与C相当,也就是说,同样是awk,两台机器速度差了 40倍,而两台机器配置基本相当。非常困惑,找了两小时的原因,终于发现gawk手册里面有一段话:

For other single-character record separators, using ‘LC_ALL=C’will give you much better performance when reading records. Otherwise,gawk has to make several function calls, per inputcharacter to find the record terminator.

再看两台机器的 locale,结果发现,慢的机器上:

[root@slow-server]# locale
LANG=en_US.UTF-8
LC_XXXX=en_US.UTF-8

...
LC_ALL=en_US.UTF-8

快的机器上:

[root@fast-server]# locale
LANG=en_US
LC_XXXX=en_US

...
LC_ALL=   <空>

马上试验,将slow-server的locale改掉:

export LC_ALL=C

速度马上快了40倍,与fast-server相当。

 

这应该是awk实现上的一个缺陷,即便是对utf8,也不应该慢这么多,如果缓冲合适,最多慢2~3倍就可以了,为什么非要gawk has to make several function calls, per inputcharacter

#!/usr/bin/env bash # -*- encoding: utf8 -*- # ======================================================================== # @ Document : shrinklogdir.sh # @ Author : LiuAnping # @ Contact : LiuAnping@suxar.com.cn # @ Creation : 2025-02-27 22:37:04 # @ License : Copyright (c) - 2025 - example.com. - All Rights Reserved. # @ Modified : 2025-02-27 22:46:22 # @ Version : v1.0.0 # @ Purpose : None # ======================================================================== # ======================================================================== # set the locale to utf-8 and the language to english export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8 # ======================================================================== set -eux set -o pipefail echo -e "start time:$(date +%Y-%m-%d_%H:%M:%S)" # ======================================================================== for i in $(ps -ef | grep -i [o]ra_smon_ | awk -F "_smon_" '{print $2}'); do # 根目录检查磁盘使用率 disk_usage=$(df -h | grep -i "openeuler-root" | awk '{sub("%","",$5); print $5}') if [ "${disk_usage}" -gt 70 ]; then echo "磁盘使用率超过70%(当前使用率:${disk_usage}%),开始清理审计日志..." # 获取audit目录大小 backup_dir="/backup/kfptdb/fullbackup" backup_dir_size=$(du -sm "${backup_dir}" | awk '{sub("M","",$1); print $1}') if [ "${backup_dir}" -gt 593510 ]; then echo "backup_dir目录大小超过593510MB(当前大小:${backup_dir_size} MB),开始清理..." # 找到7天前的备份文件并删除 find "${backup_dir}" -name 'FULLBAK_*' -ctime +3 -print0 | xargs -0 rm -f echo "清理完成。" else echo "根目录磁盘使用率未超过80%(当前使用率:$disk_usage%),无需清理。" fi fi # 归档目录检查磁盘使用率 disk_usage1=$(df -h | grep -i "/archlog|" | awk '{sub("%","",$5); print $5}') if [ "${disk_usage1}" -gt 70 ]; then #获取arch文件大小 arch_dir="/archlog/KFPTDB/archivelog" arch_dir_size=$(du -sm "${arch_dir}" | awk '{sub("M","",$1); print $1}') if [ "${arch_dir_size}" -gt 347340 ]; then echo "${arch_dir}目录大小超过347340MB(当前大小:${arch_dir_size} MB),开始清理..." # 找到3天前的审计日志文件并删除 sh /backup/kfptdb/scripts/del_archlog.sh |tee -a /backup/kfptdb/arch_del_log/`date +%F_%H_%M_%S.log` echo "清理完成。" else echo "${arch_dir}目录大小未超过347340MB(当前大小:${arch_dir_size} MB),无需清理。" fi echo "根目录磁盘使用率未超过80%(当前使用率:$disk_usage1%),无需清理。" fi done echo "... Cleanup finished ..."
06-17
--- - hosts: all remote_user: root gather_facts: no tasks: - name: Get file list based on distid # shell: | # {% if dist_id == 1 %} # grep -oP 'neice\K\d+' /home/update/path.txt # {% else %} # cat /home/update/path.txt | awk -F "_" '{print $NF}' # {% endif %} shell: ls -1 /data1/apex | grep '^svr[0-9]\+$' | sed 's/^svr//' register: file_list - debug: var=file_list.stdout_lines - name: register timestamp shell: echo "{{date}}" register: timestamp - name: check dir shell: apex_path=/home/apex ; test -d /data1/apex ; [ $? -eq 0 ] && apex_path=/data1/apex ; echo $apex_path register: apex_path - name: check apex log shell: "ls {{ apex_path.stdout_lines.0 }}/*/ax_itemsvr/hook_log/*{{ timestamp.stdout_lines.0 }}*" register: returnmsg ignore_errors: true - name: echo have apexlog result shell: echo "have" > check_apexlog.txt delegate_to: localhost when: returnmsg.rc == 0 - debug: var=returnmsg.rc - name: check check_apexlog.txt shell: "cat check_apexlog.txt" register: check_result ignore_errors: true delegate_to: localhost - name: check apex log debug: msg: "{{date}} apexlog does not exist !!!" when: check_result.stdout_lines.0 == "none" - name: when apex log does not exist,stop shell: "ls {{ apex_path.stdout_lines.0 }}/*/ax_itemsvr/hook_log/*{{ timestamp.stdout_lines.0 }}*" when: check_result.stdout_lines.0 == "none" - name: copy compression_or_not.txt copy: dest=/tmp/compression_or_not.txt src=compression_or_not.txt - name: echo have apexlog result shell: echo "have" > /tmp/compression_or_not.txt when: returnmsg.rc == 0 - name: check check_apexlog.txt shell: "cat /tmp/compression_or_not.txt" register: compression_or_not ignore_errors: true - debug: var=compression_or_not.stdout_lines.0 - debug: var=timestamp.stdout_lines.0 - name: create dir shell: 'mkdir -p log/{{ timestamp.stdout_lines.0 }}_apexlog_{{ dist_id }}/svr{{ item }}' delegate_to: localhost with_items: - "{{ file_list.stdout_lines }}" - name: compress destination log file shell: cd {{ apex_path.stdout_lines.0 }}/svr{{ item }}/ax_itemsvr/hook_log/ && zip -r apex2log{{ timestamp.stdout_lines.0 }}.zip apex2log{{ timestamp.stdout_lines.0 }}.txt ignore_errors: true with_items: - "{{ file_list.stdout_lines }}" when: compression_or_not.stdout_lines.0 == "have" - name: fetch log fetch: src={{ apex_path.stdout_lines.0 }}/svr{{ item }}/ax_itemsvr/hook_log/apex2log{{ timestamp.stdout_lines.0 }}.zip dest=log/{{ timestamp.stdout_lines.0 }}_apexlog_{{ dist_id }}/svr{{ item }}/ flat=yes ignore_errors: true with_items: - "{{ file_list.stdout_lines }}" when: compression_or_not.stdout_lines.0 == "have" - name: uncompress tgz log file shell: 'cd log/{{ timestamp.stdout_lines.0 }}_apexlog_{{ dist_id }}/svr{{item}} && unzip -o apex2log{{ timestamp.stdout_lines.0 }}.zip && rm -f *.zip' delegate_to: localhost ignore_errors: true with_items: - "{{ file_list.stdout_lines }}" when: compression_or_not.stdout_lines.0 == "have" - name: delete remote tgz package shell: cd {{ apex_path.stdout_lines.0 }}/svr{{ item }}/ax_itemsvr/hook_log/ && rm -f apex2log{{ timestamp.stdout_lines.0 }}.zip with_items: - "{{ file_list.stdout_lines }}" --------------- 1. 增加一个字符串参数A: 传账号或进程名(可能为中文,并可能包含/或\等符号) 不传A: 原逻辑中同时取多个区组时产生多个压缩文件调整为多个区组压缩到一个压缩文件中,按区组目录,同时分线路。 传A: 当传入账号或进程名(可能是中文)时,过滤之后,将结果直接输入到一个文件中,文件命名可以是: {distid}.log,多个区组,产生多个{distid}.log,再将多个文件压缩成一个压缩包。压缩包名称可以固定一个,如{date}_apexlog_multi.tgz
最新发布
11-26
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值