【Missing Semester L4】linux下数据格式转换

lecture note 主要是如何将数据从一种形式转为另一种形式,经常用于查看log,然后用管道和其他工具最终得到想要的数据格式

日志处理示例

journalctl可以用来查询systemd-journald 服务收集到的日志。systemd-journald 服务是 systemd init 系统提供的收集系统日志的服务。但直接输出往往也别多日志,往往需要进一步处理筛选需要的信息。
比如
ssh myserver journalctl | grep sshd
不过把远程的整个日志弄下来在去grep仍然需要比较长的时间,可以进一步优化
ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' | less
注意这里单引号的使用,在远端就可以过滤,而本地的less只是更便于滚动查看,也可以把过滤后的数据先存一个本地文件。
但是这样仍然有很多噪声–>可以使用工具sed

sed

sed是个流的编辑器stream editor,可以方便的操作数据,最常用的功能之一就是借助正则表达式进行替换

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed 's/.*Disconnected from //'

其中s的用法是s/REGEX/SUBSTITUTION/即正则表达式以及替换后的内容。

正则表达式

在数据处理种用得很多,常见的有:

  • . means “any single character” except newline
  • * zero or more of the preceding match
  • + one or more of the preceding match
  • [abc] any one character of a, b, and c
  • (RX1|RX2) either something that matches RX1 or RX2
  • ^ the start of the line
  • $ the end of the line

不过sed无法处理?代表的非贪心的匹配,需要替换成perl命令
perl -pe 's/.*?Disconnected from //'

sed还可以进行输出(-p)、多个替换、查找,插入文本(-i)等,具体可以看man sed
其实正则表达式确实比较复杂难写,可以借助在线的工具来初步检查正则表达式写的是否正确。


继续刚才的数据处理,可以对上面得到的登录用户名进行排序查看

sort

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c

sort对输入进行排序,unique -c将多个连续一样的行压缩为一行,前面是重复的行数。
可以把重复的行数进一步进行排序只保留出现次数较多的。

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10

-n表示按数字排序而非字典序,-k1,1表示按照空格分隔的第一个字符进行排序,是从小到大排序的,-r可以实现逆序。

如果进一步想要得到一行逗号分隔的用户名而不是每个一行的话,可以进一步处理:

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10
 | awk '{print $2}' | paste -sd,

不过paste在MacOS上不能用,作用是将各个行-s以逗号分隔开-d,

awk

非常适合处理文本流,{}里的内容来说明对匹配行应该做什么处理,默认匹配所有行。参数$0作用欲整个行,$1-$n指的是被分割的各个field,默认是空格分隔。
对上面那个命令而言,就是对每行输出空格分隔的第二个字段,在这里指的就是用户名。
可以做一些更复杂的处理,比如只出现一次的、c开头、e结尾的用户名打印出来,| awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l
awk其实算是一门编程语言,基本可以替代grepsed的用法。

其他数据分析工具

bc可以像计算器一样做运算
| paste -sd+ | bc -l
还可以用到R语言的内容,可以方便的进行复杂的数据分析和画图

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | awk '{print $1}' | R --no-echo -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'

gnuplot也可以进行简单的画图

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10
 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'

另外,xargs可以组合多个命令,或者作为给命令传递参数的过滤器。主要是很多命令不支持用管道来传递参数,此时就要用到xargs.
另外,shelle也是可以处理二进制数据的,比如图片。

练习

1、完成正则教程
2、Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?
3、To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.

4、Find your average, median, and max system boot time over the last ten boots. Use journalctl on Linux and log show on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:
5、Look for boot messages that are not shared between your past three reboots (see journalctl’s -b flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use sed ‘0,/STRING/d’ to remove all lines previous to one that matches STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).
6、Find an online data set like this one, this one, or maybe one from here. Fetch it using curl and extract out just two columns of numerical data. If you’re fetching HTML data, pup might be helpful. For JSON data, try jq. Find the min and max of one column in a single command, and the difference of the sum of each column in another.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值