RH033 Unit8 Text Processing Tools

最新推荐文章于 2009-09-21 16:17:03 发布

jackiechen_vip

最新推荐文章于 2009-09-21 16:17:03 发布

阅读量581

点赞数

分类专栏： Linux 文章标签： tools processing character file whitespace sorting

本文链接：https://blog.csdn.net/jackiechen_vip/article/details/4685114

版权

Linux 专栏收录该内容

48 篇文章 0 订阅

订阅专栏

Objectives

1) Upon completion of this unit, you should be able to:

- Use tools for extracting, analyzing and manipulating text data

Tools for Extracting Text

1) File contents: less and cat

2) File Experts: head and tail

3) Extract by Column: cut

4) Extract by Keyworld: grep

Viewing File Contents - less and cat

1) Cat: dump one or more files to STDOUT

- Multiple files are concatenated together

2) less: view file or STDIN on page at a time

- Useful commands while viewing:

/text searches for text
n/N jumps to the next/previous match
v opens the file in a text editor

- less is the pager used by man

Viewing File Excerpts - head and tail

1) head: Display the first 10 lines of a file

- Use –n to change number of lines displayed

2) tail: Display the last 10 lines of a file

- Use –n to change the number of lines displayed

- Use –f to follow the subsequent additions to the file

Very useful for monitoring the log files!

Extracting Text by keyword - grep

1) Print lines of files to STDIN where a pattern is matched

grep ‘john’ /etc/passwd
date –help | grep year

2) Use –i to search case-insensitively

3) Use –n to print line number of matches

4) Use –v to print lines not containing pattern

5) Use –Ax to include the x lines after each match

6) Use –Bx to include the x lines before each match

7) Use –l to return the name of the files that containing the pattern

Extracting Text by Column - cut

1) Display a specified columns of file or STDIN data

cut –d: –f1 /etc/passwd
grep root /etc/passwd | cut –d: –f7

2) Use –d to specify the column delimiter (default is TAB)

3) Use –f to specify the column to print

4) Use –c to cut by characters

cut –c2 –5 /etc/share/dict/words

Tools for Analyzing Text

1) Text Stats: wc

2) Sorting text: sort

3) Comparing files: diff and match

4) Spell check: aspell

Gathering Text Statistics - wc (word count)

1) Counts words, lines, bytes and characters

2) Can act upon a file or STDIN

3) Use –l for only line count

4) Use –w for only word count

5) Use –c for only byte count

6) Use –m for character count (not displayed)

Sorting Text - sort

1) Sort text to STDOUT – original file unchanged

sort [options] file(s)

2) Common options

-r performs a rerverse (descending) sort
-n performs a numeric sort
-f ignores (folds) case of characters in strings
-u (unique) removes duplicate lines in output
-t c uses c as a field separator
-k x sorts by c-delimited field x, can be used mutiple times

sort –t : –k 3 –n /etc/passwd

Eliminating Duplicate Lines – sort and uniq

1) sort –u: removes duplicated lines from input

2) uniq: removes duplicate adjacent lines from input

Use –c to count number of occurences
Use with sort for best effect:

sort userlist.txt | uniq –c

Comparing Files – diff

1) Compares two files for differences

2) Use gvimdiff for graphical diff

Provided by vim-X11 package

Duplicating File Changes – patch

1) diff output stored in a file is called a “patchfile”

Use –u for “unified” diff, best in patchfiles

2) patch duplicates changes in other files (use with care!)

Use –b to automatically backup changed file

diff –u foo.conf-broken foo.conf-works > foo.patch

patch –b foo.conf-broken foo.patch

Spell Checking with aspell

1) Interactively spell-check files:

aspell check letter.txt

2) Non-interactively list mis-spelled words in STDIN

aspell list < letter.txt
aspell list < letter.txt | wc –l

Tools for Manipulating Text – tr and sed

1) Alter (translate) Character: tr

Converts characters in one set to corresponding characters in another set
Only reads data from STDIN

$ tr ‘a-z’ ‘A-Z’ < lowercase.txt

2) Alter Strings: sed

stream editor
Performs search/replace operations on a stream of text
Normally does not alter source file
Use –i.bak to backup and alter source file
-i : case-insensitive
-g: global

sed ‘s/cat/dog/’ pets

sed ‘s/cat/dog/gi’ pets

Sed Examples

1) Quote search and replace instructions!

2) sed addresses

sed ‘s/dog/cat/g’ pets
sed ‘1,50s/dog/cat/g’ pets ###the replacement will only be performed on lines 1 to 50
sed ‘digby/,/duncan/s/dog/cat/g’ pets ###the replacement will only start on the line that contains the string ‘digby’ and continuing through the line that contains ‘duncan’

3) Multiple sed instructions

sed –e ‘s/dog/cat/’ –e ‘s/hi/lo/’ pets
sed –f myedits pets

Special Characters for Complex Searches Regular Expressions

1) ^ represents beginning of line

2) $ represents end of line

3) Character classes as in bash:

[abc], [^abc]
[[:upper:]], [^[:upper:]]

4) Used by:

grep, sed, less, others

End of Unit8

1) Questions and Answers

2) Summary

Extracting Text: cat, less, head, tail, grep, cut
Analyzing Text: wc, sort, uniq, diff, patch
Manipulating Text: tr, sed
Special Search Characters: ^, $, [abc], [[:alpha:]], [^[:alpha:]], etc

[:digit:]
Only the digits 0 to 9

[:alnum:]
Any alphanumeric character 0 to 9 OR A to Z or a to z.

[:alpha:]
Any alpha character A to Z or a to z.

[:blank:]
Space and TAB characters only.

[:xdigit:]
Hexadecimal notation 0-9, A-F, a-f.

[:punct:]
Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] / ^ _ { } | ~

[:print:]
Any printable character.

[:space:]
Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as /s.

[:graph:]
Exclude whitespace (SPACE, TAB). Many system abbreviate as /W.

[:upper:]
Any alpha character A to Z.

[:lower:]
Any alpha character a to z.

[:cntrl:]
Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.

jackiechen_vip

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RH033 Unit8 Text Processing Tools

Objectives 1) Upon completion of this unit, you should be able to: - Use tools for extracting, analyzing and manipulating text data Tools for Extracting Text 1) File contents: less and cat 2) Fil
复制链接

扫一扫

专栏目录