RH033 Unit8 Text Processing Tools

Objectives
1) Upon completion of this unit, you should be able to:
- Use tools for extracting, analyzing and manipulating text data
Tools for Extracting Text
1) File contents: less and cat
2) File Experts: head and tail
3) Extract by Column: cut
4) Extract by Keyworld: grep
Viewing File Contents - less and cat
1) Cat: dump one or more files to STDOUT
- Multiple files are concatenated together
2) less: view file or STDIN on page at a time
- Useful commands while viewing:
  • /text searches for text
  • n/N jumps to the next/previous match
  • v opens the file in a text editor
- less is the pager used by man
Viewing File Excerpts - head and tail
1) head: Display the first 10 lines of a file
- Use –n to change number of lines displayed
2) tail: Display the last 10 lines of a file
- Use –n to change the number of lines displayed
- Use –f to follow the subsequent additions to the file
  • Very useful for monitoring the log files!
Extracting Text by keyword - grep
1) Print lines of files to STDIN where a pattern is matched
  • grep ‘john’ /etc/passwd
  • date –help | grep year
2) Use –i to search case-insensitively
3) Use –n to print line number of matches
4) Use –v to print lines not containing pattern
5) Use –Ax to include the x lines after each match
6) Use –Bx to include the x lines before each match
7) Use –l to return the name of the files that containing the pattern
Extracting Text by Column - cut
1) Display a specified columns of file or STDIN data
  • cut –d: –f1 /etc/passwd
  • grep root /etc/passwd | cut –d: –f7
2) Use –d to specify the column delimiter (default is TAB)
3) Use –f to specify the column to print
4) Use –c to cut by characters
  • cut –c2 –5 /etc/share/dict/words
Tools for Analyzing Text
1) Text Stats: wc
2) Sorting text: sort
3) Comparing files: diff and match
4) Spell check: aspell
Gathering Text Statistics - wc (word count)
1) Counts words, lines, bytes and characters
2) Can act upon a file or STDIN
3) Use –l for only line count
4) Use –w for only word count
5) Use –c for only byte count
6) Use –m for character count (not displayed)
Sorting Text - sort
1) Sort text to STDOUT – original file unchanged
  • sort [options] file(s)
2) Common options
  • -r performs a rerverse (descending) sort
  • -n performs a numeric sort
  • -f ignores (folds) case of characters in strings
  • -u (unique) removes duplicate lines in output
  • -t c uses c as a field separator
  • -k x sorts by c-delimited field x, can be used mutiple times
  sort –t : –k 3 –n /etc/passwd
Eliminating Duplicate Lines – sort and uniq
1) sort –u: removes duplicated lines from input
2) uniq: removes duplicate adjacent lines from input
  • Use –c to count number of occurences
  • Use with sort for best effect:
sort userlist.txt | uniq –c
Comparing Files – diff
1) Compares two files for differences
2) Use gvimdiff for graphical diff
  • Provided by vim-X11 package
Duplicating File Changes – patch
1) diff output stored in a file is called a “patchfile”
  • Use –u for “unified” diff, best in patchfiles
2) patch duplicates changes in other files (use with care!)
  • Use –b to automatically backup changed file
diff –u foo.conf-broken foo.conf-works > foo.patch
patch –b foo.conf-broken foo.patch
Spell Checking with aspell
1) Interactively spell-check files:
  • aspell check letter.txt
2) Non-interactively list mis-spelled words in STDIN
  • aspell list < letter.txt
  • aspell list &lt; letter.txt | wc –l
Tools for Manipulating Text – tr and sed
1) Alter (translate) Character: tr
  • Converts characters in one set to corresponding characters in another set
  • Only reads data from STDIN
        $ tr ‘a-z’ ‘A-Z’ &lt; lowercase.txt
2) Alter Strings: sed
  • stream editor
  • Performs search/replace operations on a stream of text
  • Normally does not alter source file
  • Use –i.bak to backup and alter source file
  • -i : case-insensitive
  • -g: global
   sed ‘s/cat/dog/’ pets
   sed ‘s/cat/dog/gi’ pets
Sed Examples
1) Quote search and replace instructions!
2) sed addresses
  • sed ‘s/dog/cat/g’ pets
  • sed ‘1,50s/dog/cat/g’ pets ###the replacement will only be performed on lines 1 to 50
  • sed ‘digby/,/duncan/s/dog/cat/g’ pets ###the replacement will only start on the line that contains the string ‘digby’ and continuing through the line that contains ‘duncan’
3) Multiple sed instructions
  • sed –e ‘s/dog/cat/’ –e ‘s/hi/lo/’ pets
  • sed –f myedits pets
Special Characters for Complex Searches Regular Expressions
1) ^ represents beginning of line
2) $ represents end of line
3) Character classes as in bash:
  • [abc], [^abc]
  • [[:upper:]], [^[:upper:]]
4) Used by:
  • grep, sed, less, others
End of Unit8
1) Questions and Answers
2) Summary
  • Extracting Text: cat, less, head, tail, grep, cut
  • Analyzing Text: wc, sort, uniq, diff, patch
  • Manipulating Text: tr, sed
  • Special Search Characters: ^, $, [abc], [[:alpha:]], [^[:alpha:]], etc
[:digit:]
Only the digits 0 to 9
[:alnum:]
Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:]
Any alpha character A to Z or a to z.
[:blank:]
Space and TAB characters only.
[:xdigit:]
Hexadecimal notation 0-9, A-F, a-f.
[:punct:]
Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / &lt; > = @ [ ] / ^ _ { } | ~
[:print:]
Any printable character.
[:space:]
Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as /s.
[:graph:]
Exclude whitespace (SPACE, TAB). Many system abbreviate as /W.
[:upper:]
Any alpha character A to Z.
[:lower:]
Any alpha character a to z.
[:cntrl:]
Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值