Linux learn awk and sed to a minimal limit (Keep Updating)

1 Introduction

My work mostly focuses on natural language processing, which means I spend lots of time dealing with text files. From the bottom of my heart (maybe not right), I believe the cleansing process of language processing is really annoying and nettling. Every source has their own style of formatting their data. Once when I encountered a new format of text files, I used to write a Python script to process them, not that difficult but really time-consuming. This kind of scripts is difficult to maintain and could not transfer to another task (process another format). All of these make me do not like the process of cleansing and maybe even Python script, until one day, one of my Senpai Xiang recommended awk and sed to me.

2 Sample Text Files

This sample is from GitHub AI Challenge , merely for learning purpose.
We have a sample file like this:

<srcset setid="newsdev2017" srclang="en">
<doc sysid="ref" docid="abcnews.199680" genre="news" origlang="en">
<seg id="1">New Questions Over California Water Project</seg>
<seg id="2">Critics and a state lawmaker say they want more explanations on who's paying for a proposed $16 billion water project backed by Gov. Jerry Brown, after a leading California water district said Brown's administration was offering government funding to finish the planning for the two giant water tunnels.</seg>
<seg id="3">Critics said the government funding described by the Los Angeles-based Metropolitan Water District on Thursday could run counter to long-standing state assurances that various local water districts, not California itself, would pay for Brown's vision of digging twin 35-mile-long tunnels to carry water from the Sacramento River south, mainly for Central and Southern California.</seg>
<seg id="4">The $248 million in preliminary spending for the tunnels, which have yet to win regulatory approval, already is the topic of an ongoing federal audit.</seg>
<seg id="5">On Wednesday, state lawmakers ordered a state audit of the tunnels-spending as well.</seg>
<seg id="6">On Thursday, state spokeswoman Nancy Vogel said that despite the account of the Los Angeles-based Metropolitan Water District, no money from the state's general fund would be used finishing the current planning phase of the twin tunnels.</seg>
<seg id="7">However, opponents of the tunnels and a taxpayer group were critical Thursday, and Assemblywoman Susan Eggman, one of the state lawmakers behind this week's audit order, asked the state Thursday for clarification.</seg>
<seg id="8">"It's a shell game," said David Wolfe, the Howard Jarvis Taxpayers Association's legislative director.</seg>
<seg id="9">I think it comes back to the audit (request) yesterday: There are way more questions here than there are answers.</seg>
<seg id="10">The tunnels project is endorsed by Brown and by some politically influential water districts and water customers in Central and Southern California.</seg>
<seg id="11">Supporters say the tunnels would benefit the environment and offer Californians a more secure water supply.</seg>
<seg id="12">Opponents say they fear the state will use the tunnels to divert too much water from the Sacramento River and San Francisco Bay, harming Northern California and further endangering native species there.</seg>
<seg id="13">Metropolitan and other water districts slated to get water from the tunnels have yet to commit to paying for them, out of uncertainty whether the massive spending would really bring them enough water to make the cost worthwhile.</seg>
<seg id="14">The same water districts also announced this year they would not pay to complete the current preliminary work on the tunnels unless the project first won regulatory approval.</seg>
<seg id="15">On Thursday, a monthly report published by the LA-based water district on the tunnels project said, "the state has indicated that any additional funding needs to complete the planning phase will be provided by state or federal sources."</seg>
<seg id="16">After all that local water districts had spent on the project, including $63 million from his water district, "This is to be expected" that the state would use government money to close out planning, said Bob Muir, spokesman for the LA-based water district.</seg>
<seg id="17">He referred further questions to Vogel, the state spokeswoman.</seg>
<seg id="18">Vogel said the state intended to pull money to finish the tunnels planning from user fees for an existing, half-century-old water network, the State Water Project.</seg>
<seg id="19">Tunnel opponents, however, point to a measure state lawmakers passed in 2009 that they say bars the state from spending money on the tunnels until the water agencies that would benefit commit to paying for them.</seg>
<seg id="20">"Project contractors pledged to pay for this project and they've used financial gimmicks to get around this obligation," said Patricia Schifferle, an environmental consultant and longtime opponent of the proposed tunnels.</seg>
<seg id="21">It raises questions as to where this money was suddenly found.</seg>
</doc>
<doc sysid="ref" docid="abcnews.199704" genre="news" origlang="en">
<seg id="1">Business Groups Appeal to China Over Cybersecurity Law</seg>
<seg id="2">A coalition of international business groups has appealed to China to change proposed cybersecurity rules they warn will harm trade and isolate the country.</seg>
<seg id="3">The letter to Chinese Premier Li Keqiang is signed by 46 groups from the United States, Europe, Asia and Latin America, reflecting the global scale of concern the rules might limit or cut off access to China's market for information security products.</seg>
<seg id="4">Signers include the U.S. Chamber of Commerce, the European Services Forum and groups from Japan, Korea and Mexico.</seg>
<seg id="5">The proposed rules would require providers to show Chinese authorities how security products work and to store information about Chinese citizens within the country.</seg>
<seg id="6">The latest letter, dated Wednesday, says that might make data theft easier and reduce access to Chinese customers.</seg>
</doc>
</srcset>

3 Command Line sed

3.1 Delete Certain Pattern in a Line

Most of these have something to do with regular expression. A good place to study regular expression: https://regexr.com/. Refer to RegEx Reference in the left column.

For example, there is a file, the contents of which is:

website.com/path/to/file/234432517.gif" width=“620”>
website.com/path/to/file/143743e53.gif" width=“620”>
website.com/path/to/file/123473232.gif" width=“620”>
website.com/path/to/file/634132317.gif" width=“620”>
website.com/path/to/file/432432173.gif" width=“620”>

and what we need to is to select the website parts only, then, we could do

sed 's/" width="[0-9]\+">//g' file

Notice that you should not use + to replace \+. From this site, we could know that there is two types of Regex, PCRE (Perl Compatible Regular Expressions) and BRE (Basic Regular Expressions). However, sed doesn’t understand PCRE, and it uses BRE by default. It knows neither \s nor \d. However, I personally prefer to use -E, which tells sed to use Extended Regular Expressions, and in this way, the command line to the former problem is:

sed -E 's/"\s*width="[0-9]+">//g' file

Another notice is that you should never use sed -i without first testing without the -i to be sure it works or, if you do, at least use -i.bak (-i with any text will do this) to create a backup.

Another example is that we have text in the following format:

sed -E 's/(<seg id="[0-9]+">|<\/seg>)//g' src_test.sgm

Or maybe sometimes we need to delete the initial space of a line, then we can input:

sed -E 's/^[ ]//g' src_test.sgm

3.2 Delete Certain Row

I need to delete the first row of this text.

sed '1d' src_test.sgm

3.3 Delete Line with Certain Pattern

sed -E '/(<doc sysid=|<\/doc>|<\/srcset>)/d' src_test.sgm

3.4 Display Specific Lines in Text

sed -n '12720603,12720604p' ai_train_zh.txt 

3.5 Delete the Last Sentence

//fileName 文件名

sed -i '$d' fileName

4 Command Line awk

4.1 Count the frequency of words in a text file

awk '{for(i=1;i<=NF;++i){++m[$i]}}END{for(k in m){print k, m[k]}}' words.txt | sort -nr -k 2

For detailed explanation, please refer to :
https://blog.csdn.net/u013246898/article/details/80240024

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值