awk命令修行

 

算是一个笔记&搬运吧,本文摘录、翻译自http://linuxcommand.org/lc3_adv_awk.php

历史、用途:

awk是一个广泛应用于linux发行版的标准程序。有两个开源版本一个叫做mawk,另一个叫做gawk。在所有的发行版中awk不是连接到mawk就是连接到gawk。

awk的优点?awk是为了作为过滤器而设计,从标准输入接收数据,转换数据,然后发送到标准输出。awk在处理列形式的数据方面非常有用。近年来awk似乎已经过时了,被更新的解释型编程语言如perl和python取代;但是awk仍然有许多优点:

  1. 简单易学,并不复杂且语法类似C
  2. 真的很擅长解决各类问题
  3. 最重要的,shell自带,在快速处理数据的时候,非常有用

如何工作:

我们知道awk一般用于过滤器,每次读取一条记录。默认情况下,一条记录就是被换行符分隔的一行。然后awk自动分割记录为多个field。第一个field被标示为$1,第二个为$2,以此类推。$0表示整条记录。

pattern/action对是每条记录执行的测试及其对应的执行动作。如果pattern为TRUE,则执行action。一旦遍历所有pattern,则读入下一条记录再执行上述操作。

如 ls -l /usr/bin | awk '{print $0}'

awk中单引号非常重要,单引号中的表达式并不被shell执行。$表示fields而不是shell中的变量或者脚本中的参数。

如下命令不会有输出:

ls -l /usr/bin | ^Ck 'NF > 11 {print $0}'

 因为 没有任何行其fields数大于11

特殊pattern:

特殊pattern有两类:begin和end

begin模式在读入第一条记录之前就会执行,用于初始化变量以及打印输出流的提示。

end模式则在完成最后一条记录的处理后执行action,在用作输出概要的时候非常有用。

ls -l /usr/bin | awk '
BEGIN {print "Directory Report"; print "================"};
NF > 9 {print $9, "is a symbolic link to", NF}
END {print "=============";print "End Of Report"}'

 上述实例中有三个pattern/action对,第一个打印题头,第二部分找出软连接并输出,第三部分打印结束标记。

同时用awk解释器也可实现上面的功能:

新建一个文件,然后编辑如下:

#!/usr/bin/awk -f

# Print a directory report

BEGIN {
    print "Directory Report"
    print "================"
}

NF > 9 {
    print $9, "is a symbolic link to", NF
}

END {
    print "============="
    print "End Of Report"
}

 需要注意解释器后面跟的-f参数。

脚本编程格式:

BEGIN { #action的左括号必须和pattern在同一行
  # 空行被忽略

  # \用于折行
    $1, #参数列可用逗号分割
    $2, #注释必须出现在行尾
    $3
  # 多个语句可用分号分割
  print "String 1"; print "String 2"
} 

通用pattern:
关系表达式:

  1. $1==100  
  2. $1>=100
  3. $1*$2<100

正则表达式:

awk支持类似于egrep的扩展正则表达式。常用的语法如下:

expression ~ /regexp/

例如想要匹配第三列非567开头的行:$3~/^[^567]/

ls -l /usr/bin | awk '
BEGIN {print "Directory Report" ;print "================"};
$9~/^[^ab]/ {print $0}
END {print "end"}
'  

逻辑操作符:

pattern支持使用逻辑操作符,如: || 、 &&、 OR 、 AND 

$1 > 100 && $NF == "Debit"

以上表示第一列大于100且最后的一列为Debit 

pattern, pattern:

提取顺序化的从a到b的某些行

 ls -l /usr/bin | awk '
NR==1;NR==100 {print  NR,$9}' #仅仅打印第一、第一百行

 ls -l /usr/bin | awk '
NR==1,NR==100 {print  NR,$9}'  #打印第一到第一百行

 分隔符和记录:

awk ' BEGIN {FS=":"}; {print $1,":",$5}' < /etc/passwd

这条命令通过BEGIN标注FS为":",然后打印第一、第五列。有两个pattern/action对,通过";"分隔,第一个只有patttern,第二个只有action。第二个无pattern即意味着匹配每一行数据。

然而一旦配置RS=”“,那么换行符被作为FS。任何显式地FS设置无效。即

BEGIN { FS = "\n"; RS = "" }
BEGIN {  RS = "" }

以上两个写法等价。

 内建变量:

FS - Field separator  字段分隔符

RS - Record separator 记录分隔符

NF - Number of fields 字段数

NR - Record number 记录数

OFS - Output field separator 输出字段分隔符

ORS - Output record separator 输出记录分隔符

数组:

awk支持一维数组,数组中的元素必须为字符串或者数字。数组索引也必须为数字或者字符。

a[1] = 5        # Numeric index
a["five"] = 5   # String index

类似多为数组地实现:

a[j,k] = "foo"

删除:

delete a[i]     # delete a single element
delete a        # delete array

算术/逻辑表达式:

AWK supports a pretty complete set of arithmetic and logical operators:

Operators 
Assignment= += -= *= /= %= ^= ++ --
Relational< > <= >= == !=
Arithmetic+ - * / % ^
Matching~ !~
Arrayin
Logical|| &&

流控制:

awk有许多流控制语句,甚至我们可以将awk的程序都看作是一个大循环中的case语句(虽然语法更接近于C语言)。awk的action中通常包含复杂的逻辑约束语句和流控制语句,如:

a = a + 1
{a = a + 1; b = b * a}
ls -l /usr/bin | awk '
BEGIN {line_count = 0 ;page_length = 60}
{line_count++
    if (line_count < page_length) print
    else {print "\f" $0;line_count = 0}}'

以上实例的本质是将pattern/action对嵌套在外层的pattern/action对中。

(var in array)           #判断值是否在数组索引中
if (array[var] != "")    #判断索引对应的键是否非空
((var1,var2) in array)   #判断数组键array[var1,var2]
for ( expression ; 
    expression ; 
    expression ) statement #类似于C语言中的循环控制语句    
ls -l | awk '{s = ""; for (i = NF; i > 0; i--) s = s $i OFS; print s}'
ls -l | awk '{s = ""; for (i = NF; i > 0; i--) s = s $i " "; print s}'
ls -l | awk '{s = ""; for (i = NF; i > 0; i--) s = s ; print s $i " "}'
#以上三条语句等价
ls -l | awk   'BEGIN {VL=0} ;{if (NF==9) print $0, VL };{VL++}'| head
ll -a |awk 'BEGIN {cnt=100;i=0};{if  (i<cnt) print $0,OFS ,i} ;{i++}'
ls -l | awk '{
    s = ""
    i = NF
    while (i > 0) {
        s = s $i OFS
        i--
    }
    print s
}'

写文件与管道符:

ls -l /usr/bin | awk '
$1 ~ /^-/ {print $0 > "regfiles.txt"}
$1 ~ /^d/ {print $0 > "directories.txt"}
$1 ~ /^l/ {print $0 > "symlinks.txt"}
'

ls -l /usr/bin | awk '
$1 ~ /^-/ {a[$9] = $5}
END {for (i in a)
    {print a[i] "\t" i}
}
'

ls -l /usr/bin | awk '
$1 ~ /^-/ {a[$9] = $5}
END {for (i in a)
    {print a[i] "\t" i|"sort -nr"}
}
'

 字符函数:

gsub(rst)

Globally replaces any substring matching regular expression r contained within the target string t with the string s. The target string is optional. If omitted, $0 is used as the target string. The function returns the number of substitutions made.

index(s1s2)

Returns the leftmost position of string s2 within string s1. If s2 does not appear within s1, the function returns 0.

length(s)

Returns the number of characters in string s.

match(sr)

Returns the leftmost position of a substring matching regular expression r within string s. Returns 0 if no match is found. This function also sets the internal variables RSTART and RLENGTH.

split(safs)

Splits string s into fields and stores each field in an element of array a. Fields are split according to field separator fs. For example, if we wanted to break a phone number such as 800-555-1212 into 3 fields separated by the "-" character, we could do this:

phone="800-555-1212"
split(phone, fields, "-")

After doing so, the array fields will contain the following elements:

fields[1] = "800"
fields[2] = "555"
fields[3] = "1212"

sprintf(fmtexprs)

This function behaves like printf, except instead of outputting a formatted string, it returns a formatted string containing the list of expressions to the caller. Use this function to assign a formatted string to a variable:

area_code = "800"
exchange = "555"
number = "1212"
phone_number = sprintf("(%s) %s-%s", area_code, exchange, number)

sub(rst)

Behaves like gsub, except only the first leftmost replacement is made. Like gsub, the target string t is optional. If omitted, $0 is used as the target string.

substr(spl)

Returns the substring contained within string s starting at position p with length l.

数学函数

AWK has the usual set of arithmetic functions. A word of caution about math in AWK: it has limitations in terms of both number size and precision of floating point operations. This is particularly true of mawk. For tasks involving extensive calculation, gawk would be preferred. The gawk documentation provides a good discussion of the issues involved.

atan2(yx)

Returns the arctangent of y/x in radians.

cos(x)

Returns the cosine of x, with x in radians.

exp(x)

Returns the exponential of x, that is e^x.

int(x)

Returns the integer portion of x. For example if x = 1.9, 1 is returned.

log(x)

Returns the natural logarithm of xx must be positive.

rand()

Returns a random floating point value n such that 0 <= n < 1. This is a value between 0 and 1 where a value of 0 is possible but not 1. In AWK, random numbers always follow the same sequence of values unless the seed for the random number generator is first set using the srand() function (see below).

sin(x)

Returns the sine of x, with x in radians.

sqrt(x)

Returns the square root of x.

srand(x)

Sets the seed for the random number generator to x. If x is omitted, then the time of day is used as the seed. To generate a random integer in the range of 1 to n, we can use code like this:

srand()
# Generate a random integer between 1 and 6 inclusive
dice_roll = int(6 * rand()) + 1

 自定义函数实现:

# random_table.awk - generate table of random numbers

function rand_integer(max) {
    return int(max * rand()) + 1
}

BEGIN {
    srand()
    for (i = 0; i < 100; i++) {
        for (j = 0; j < 5; j++) {
            printf("    %5d", rand_integer(99999))
        }
        printf("\n", "")
    }
}

Convert file into CSV format:

awk -f random_table.awk > random_table.dat

Convert file into TSV format:

awk 'BEGIN {OFS=","} {print $1,$2,$3,$4,$5}' random_table.dat

awk '
    {
        t = $1 + $2 + $3 + $4 + $5
        printf("%s = %6d\n", $0, t)
    }
' random_table.dat

Print the total for each row

awk '
    {
        t = $1 + $2 + $3 + $4 + $5
        printf("%s = %6d\n", $0, t)
    }
' random_table.dat

 

Print the total for each column:

awk '
    {
        for (i = 1; i <= 5; i++) {
            t[i] += $i
        }
        print
    }
    END {
        print "  ==="
        for (i = 1; i <= 5; i++) {
            printf("  %7d", t[i])
        }
        printf("\n", "")
     }
' random_table.dat

 

Print the minimum and maximum value in column 1

awk '
    BEGIN {min = 99999}
    $1 > max {max = $1}
    $1 < min {min = $1}
    END {print "Max =", max, "Min =", min}
' random_table.dat

One Last Example

For our last example, we'll create a program that processes a list of pathnames and extracts the extension from each file name to keep a tally of how many files have that extension:

# file_types.awk - sorted list of file name extensions and counts

BEGIN {FS = "."}

{types[$NF]++}

END {
    for (i in types) {
        printf("%6d %s\n", types[i], i) | "sort -nr"
    }
}

To find the 10 most popular file extensions in our home directory, we can use the program like this:

find ~ -name "*.*" | awk -f file_types.awk | head

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值