chapter 21 -2 The String Library

最新推荐文章于 2023-02-26 22:49:05 发布

wanglang3081

最新推荐文章于 2023-02-26 22:49:05 发布

阅读量748

点赞数

分类专栏： Lua

Lua 专栏收录该内容

45 篇文章 0 订阅

订阅专栏

Tab expansion

An empty capture like ‘()’ has a special meaning in Lua. Instead of capturing
nothing (a quite useless task), this pattern captures its position in the subject
string, as a number:

print(string.match("hello", "()ll()")) --> 3 5

(Note that the result of this example is not the same as what you get from
string.find, because the position of the second empty capture is after the
match.)

> print(string.match("Hello,bacll","()ll()")); -- start from 1 by default
3 5
> print(string.match("Hello,bacll","()ll()",5));--when I know the captured end position , I know the next start position
10 12
> print(string.match("Hello,bacll","()ll()",12));
nil

A nice example of the use of position captures is for expanding tabs in a
string:

a="abc\tcde\thikj\tmnl";
print(a:gsub("()\t","&&")); --abc&&cde&&hikj&&mnl 3 ,

--可以看到\t 被替换成了&&,也就是虽然()\t 返回的时position, but gsub can use this posiion to do the substitution
local function expandTab(s,tabsize)
tabsize=tabsize or 8;
local corr = 0
s=string.gsub(s,"()\t",function(tabposition)
    local sp = tabsize - (tabposition - 1 + corr)%tabsize
   corr = corr - 1 + sp;
--   return string.rep(" ", sp) --这个书上的版本，但还是看不明白，maybe next time.
    return string.rep(" ",tabsize);--this is my implement, just substitue tabsize space.
end);
return s;
end

print(expandTab(a,10));
print(a);

abc cde hikj mnl -- you can see that tab size had been expand to 10.
abc cde hikj mnl

21.6 Tricks of the Trade

consider the pattern we used
to match comments in a C program: ‘/%*.-%*/’. If your program has a literal
string containing “/*”, you may get a wrong result:

test = [[char s[] = "a /* here"; /* a tricky string */]]
print(string.gsub(test, "/%*.-%*/", "<COMMENT>"))
--> char s[] = "a <COMMENT>

Strings with such contents are rare and, for your own use, that pattern will
probably do its job. But you should not distribute a program with such a flaw.

Usually, pattern matching is efficient enough for Lua programs:

But you can take precautions. You should always make
the pattern as specific as possible; loose patterns are slower than specific ones.

An extreme example is ‘(.-)%$’, to get all text in a string up to the first dollar
sign. If the subject string has a dollar sign, everything goes fine; but suppose
that the string does not contain any dollar signs. The algorithm will first try to
match the pattern starting at the first position of the string. It will go through
all the string, looking for a dollar. When the string ends, the pattern fails for the
first position of the string. Then, the algorithm will do the whole search again,
starting at the second position of the string,(原来如此，从第一个位置开始search patter, not match, start from the second position. search the pattern till to the end.or until find the pattern.)

only to discover that the pattern does not match there, too; and so on. This will take a quadratic time.

You can correct this problem simply by anchoring the pattern at the
first position of the string, with ‘^(.-)%$’. The anchor tells the algorithm to stop
the search if it cannot find a match at the first position. With the anchor, the
pattern runs in a hundredth of a second.

Beware also of empty patterns, that is, patterns that match the empty string.
For instance, if you try to match names with a pattern like ‘%a*’, you will find
names everywhere:

i, j = string.find(";$% **#$hello13", "%a*")
print(i,j) --> 1 0 , 因为* 是0 或多次， %a means letter,，那么返回j=0 是什么意思？？？？

It never makes sense to write a pattern that begins or ends with the modifier
‘-’, because it will match only the empty string. This modifier always needs
something around it to anchor its expansion,也就是前面要有一个anchor 字符.

Similarly, patterns that include ‘.*’ are tricky, because this construction can expand much more than you intended.

.* 任意字符0 or 多次，，那还不时整个string 了嘛，，，

====== 下面的没看了，留待以后吧===

Sometimes, it is useful to use Lua itself to build a pattern. We already
used this trick in our function to convert spaces to tabs. As another example,
let us see how we can find long lines in a text, say lines with more than 70
characters. Well, a long line is a sequence of 70 or more characters different
from newline. We can match a single character different from newline with
the character class ‘[^\n]’. Therefore, we can match a long line with a pattern
that repeats 70 times the pattern for one character, followed by zero or more of
these characters. Instead of writing this pattern by hand, we can create it with
string.rep:
pattern = string.rep("[^\n]", 70) .. "[^\n]*"
As another example, suppose you want to make a case-insensitive search. A
way of doing this is to change any letter x in the pattern for the class ‘[xX]’, that
is, a class including both the lower and the upper-case versions of the original
letter. We can automate this conversion with a function:

-----till to P235

=====

21.7 Unicode

Currently, the string library does not offer any explicit support for Unicode.
Nevertheless, it is not difficult to code several useful simple tasks over Unicode
strings encoded in UTF-8 without extra libraries.

UTF-8 is the dominant encoding for Unicode on the Web. Because of its
compatibility with ASCII, UTF-8 is also the ideal encoding for Lua. That compatibility
is enough to ensure that several string-manipulation techniques that
work on ASCII strings also work on UTF-8 with no modifications.

UTF-8 represents each Unicode character using a variable number of bytes (每个 unicode 字符，都有相应的UTF-8 编码的字符，可变字节编码)

For instance, it represents ‘A’ by one byte, 65; it represents the Hebrew character
Aleph, which has code 1488 in Unicode, by the two-byte sequence 215–144.

Aleph 希伯来文的第一个字母

UTF-8 represents all characters in the ASCII range as in ASCII, that is, a single
byte smaller than 128.

It represents all other characters using sequences of bytes where the first byte is in the range [194; 244] and continuation bytes are in the range [128; 191]. now is 2 bytes [194 ->244][128-->191] . this is for non-ASCII

Specifically, the range of the starting bytes for two-byte sequences is [194; 223];

first I　need to know how to print Hex number:

> a=128
> print(string.format("%X",a))
80
> print(string.format("%04X",a))
0080
> print(string.format("%04X",a))

0-127 ANSCII (1　byte encode)

194-223 [X] (2 byte encode)

224-239 [X][X] (3 byte encode)

240-244 [X][X][X] (4 byte encode) [X] 　are in range[128-191]

UTF-8 ,可变字节编码。那么我怎么知道这个字符到底用多少个字节编码呢，that's 用第一个字节的范围来表示,

ie: [195][130][195][180][241][131][130][180] , in the char sequence, there 2 char use 2 byte encode, 1 char use 4 byte encode. 以后遇到一个UTF-8 编码的 16 进制表示就知道怎么看了

Because Lua is 8-bit clean, it can read, write, and store UTF-8 strings just
like regular strings. Literal strings can contain UTF-8 data. (Of course, you
probably will want to edit your source code as a UTF-8 file.) The concatenation
operation works correctly for UTF-8 strings. String order operators (less than,
less equal, etc.) compare UTF-8 strings following the order of the character
codes in Unicode.

The operating-system library and the I/O library are mainly interfaces to
the underlying system, so their support for UTF-8 strings depends on that
underlying system. On Linux, for instance, we can use UTF-8 for file names,
but Windows uses UTF-16. So, to manipulate Unicode file names in Windows,
we need either extra libraries or some changes in the standard Lua libraries.

Functions string.reverse, string.byte, string.char, string.upper, and
string.lower, do not work for UTF-8 strings, as all of them assume that one
character is equivalent to one byte.

Functions string.format and string.rep work without problems with UTF-
8 strings except for the format option '%c', which assumes that one character
is one byte. Functions string.len and string.sub work correctly with UTF-8
strings, with indices referring to byte counts (not character counts). Frequently,
this is what you need. But we can count the number of characters, too, as we
will see in a moment.

For the pattern-matching functions, their applicability to UTF-8 strings depends
on the pattern. Literal patterns work without problems, due to the key
property of UTF-8 that the encoding of any character never appears inside the
encoding of any other character. Character classes and character sets work only
for ASCII characters. For instance, the pattern ‘%s’ works on UTF-8 strings, but
it will match only the ASCII white spaces; it will not match extra Unicode white
spaces such as a non-break space (U+00A0), a paragraph separator (U+2029),
or a Mongolian vowel separator (U+180E).

Some patterns can put the particularities of UTF-8 for good use. For instance,
if you want to count the number of characters in a string, you can use
the following expression:
#(string.gsub(s, "[\128-\191]", "")), use UTF-8's encodeing feature

The gsub removes the continuation bytes from the string, so that what are left
are the one-byte sequences and the starting bytes of multi-line sequences: one
byte for each character.

Using similar ideas, the following example shows how we can iterate over
each character in a UTF-8 string:
for c in string.gmatch(s, ".[\128-\191]*") do -- in window's platform, still have problem,need to check further more
print(c)
end

Unfortunately, there is not much more that Lua can offer. Adequate support
for Unicode demands huge tables, which are incompatible with the small size
of Lua. Unicode has too many peculiarities. It is virtually impossible to abstract
almost any concept from specific languages. Even the concept of what is a
character is vague, because there is no one-to-one correspondence between Unicode
coded characters and graphemes (e.g., characters with diacritical marks
and “completely ignorable” characters). Other apparently basic concepts, such
as what is a letter, also change across different languages..

What is missing most in Lua, in my opinion, are functions to convert between
UTF-8 sequences and Unicode code points and to check the validity of UTF-
8 strings. Probably the next version of Lua will include them. For anything
fancier, the best approach seems to be an external library, such as the slnunicode
library.