操作系统自用4 从布朗语料库提取词汇创建字典进程监视和管理

LM306

于 2023-01-11 16:57:32 发布

阅读量186

点赞数

分类专栏：操作系统文章标签： linux bash Powered by 金山文档

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/LM306/article/details/128647535

版权

操作系统专栏收录该内容

6 篇文章 0 订阅

订阅专栏

OS Programme Lecture #4

1. BASH Programming（用unix系统） Read one-million words from text files:

一个更复杂的脚本程序

从布朗语料库(第一个机读语料库)，Brown Corpus，提取词汇和词汇使用频率

该脚本自动遍历brown文件夹里的每一个文件，提取词库中的词语和他们的使用频率

程序可以移除一些符号例如',[,],$，创建字数统计在hashmap数据结构中（也被称作“字典”）

一旦所有数据文件都被读取，这个脚本在最后一个for循环中打印词汇使用频率

可使用man sed查询sed关键词含义并尝试理解

创建WordFrequencies.sh，写入以下代码：

declare -A hashmap

for file in brown/*[0-9]; do

echo "Reading $file"

echo "Reading $file"

echo "Reading $file"

echo "Reading $file"

sed 's_\([^ ]*\)/[^ ]*_\1_g' $file > t1.txt

sed "s/'//g" t1.txt > t2.txt

sed "s/\`//g" t2.txt > t3.txt

sed "s/\[//g" t3.txt > t4.txt

sed "s/\]//g" t4.txt > t5.txt

sed "s/\\$//g" t5.txt > t6.txt

while read -r line; do

line="$line"

if [ ${#line} -gt 0 ]; then

#echo $line

for word in $line; do

if [ ${#word} -gt 0 ]; then

#echo ${word}

if [ ${hashmap[${word}]+_} ]; then

let hashmap[$word]=$((hashmap[${word}]+1))

else

let hashmap[$word]=1

fi

fi

done

fi

done < "t6.txt"

done

for i in "${!hashmap[@]}"; do

echo $i ${hashmap[$i]}

done

运行！（要有耐心，脚本会运行较长时间！）

注释以上代码中某些行，再看一下程序的输出变化，以加深理解

再尝试以下代码，完成作业中的问题，参考代码：

declare -A hashmap

for file in brown/*[0-9]; do

echo "Reading $file"

sed 's_\([^ ]*\)/[^ ]*_\1_g' $file > t1.txt

sed "s/'//g" t1.txt > t2.txt

sed "s/\`//g" t2.txt > t3.txt

sed "s/\[//g" t3.txt > t4.txt

sed "s/\]//g" t4.txt > t5.txt

sed "s/\\$//g" t5.txt > t6.txt

while read -r line; do

line="$line"

if [ ${#line} -gt 0 ]; then

#echo $line

for word in $line; do

if [ ${#word} -gt 0 ]; then

#echo ${word}

if [ ${hashmap[${word}]+_} ]; then

let hashmap[$word]=$((hashmap[${word}]+1))

else

let hashmap[$word]=1

fi

fi

done

fi

done < "t6.txt"

#break

done

numWords=0

topWord=""

topFreq=0

sumFreq=0

for i in "${!hashmap[@]}"; do

echo $i ${hashmap[$i]}

let numWords=$numWords+1

if [ $topFreq -lt ${hashmap[$i]} ]; then

topWord=$i

topFreq=${hashmap[$i]}

fi

let sumFreq+=${hashmap[$i]}

done

avgFreq=`echo $sumFreq/$numWords | bc -l`

echo "What is the total number of words? Answer="$numWords

echo "What is the most frequent word? Answer="$topWord

echo "What is the number of hits of the most frequent word? Answer="$topFreq

echo "Average word frequency="$avgFreq

echo "Does the memory used grow as your script reads more data, and why? Answer=Yes, because the variable 'hashmap' grows with more data."

2. Process Management:

(1) BASH - Process execution

首先，我们写一个无限循环的脚本loop.sh，参考代码如下：

#!/bin/bash

let num=1

while true; do

let square=$num*$num

echo $num $square

let num=$num+1

done

echo "Program terminated ..."

Ctrl+C可以终止脚本运行

在第一个控制台中运行ps aux

打开一个新的terminal，运行ps aux | grep bash

回到第一个控制台，运行loop.sh

切换到新的控制台，运行ps aux | grep bash，运行ps aux | awk '$8 == "R+"'，比较结果

回到第一个控制台，杀死进程loop.sh

切换到新的控制台，运行ps aux | grep bash，运行ps aux | awk '$8 == "R+"'

比较结果！

The ps aux command is a tool to monitor processes running on your Linux system.

A process is associated with any program running on your system, and is used to manage and monitor a program’s memory usage, processor time, and I/O resources.

(2) BASH - process termination with the kill command

用kill终止一个进程运行：

在第一个控制台中运行loop.sh无限循环脚本

切换到第二个控制台，查找运行loop.sh脚本的进程，记录此进程PID

我们现在用kill来终止此进程，在第二个控制台中，尝试运行 kill PID

回到第一个控制台，看一下脚本有没有被终止？

3. 作业：

(1) 熟悉操作WordFrequencies.sh脚本，根据以上参考代码的执行，尝试回答以下几个问题：

- 使用的内存在你的脚本阅读更多数据时，会不会增加？（打开系统监控，观察内存使用情况）

- 词汇总量是多少？

- 使用最频繁的词是什么？

- 最常见单词的命中次数是多少？

- 平均的字数是多少？

(2) 运行(1)WordFrequencies.sh脚本，在过程中打开另一个terminal进行进程监视，随后回到运行该进程的控制台kill该进程，把过程和ps aux | grep bash结果贴图写入实验记录

本课件仅供初学者参考学习，版权最终归属于广大操作系统实验课老师

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
操作系统自用4 从布朗语料库提取词汇创建字典进程监视和管理

bash 从布朗语料库提取词汇创建字典进程监视和管理
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。