c语言%1f u001f,\ u0003和\ u00ff之间的Unicode字符(Unicode characters between \u0003 and \u00ff)...

最新推荐文章于 2024-02-07 08:04:05 发布

殷勤捧玉钟

最新推荐文章于 2024-02-07 08:04:05 发布

阅读量1.9k

点赞数

文章标签： c语言%1f u001f

博客探讨了Java代码中检查Unicode字符在'u0003'和'u00ff'之间的情况。'u0003'表示END OF TEXT，'u00ff'是LATIN SMALL LETTER Y WITH DIAERESIS。内容涉及这两个Unicode点之间可能包含的字符类型，如拉丁字符、控制字符和其他用于西欧语言的字符。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

\ u0003和\ u00ff之间的Unicode字符(Unicode characters between \u0003 and \u00ff)

我有一段Java代码检查它是在两个unicode字符之间：

LA(2) >= '\u0003' && LA(2) <= '\u00ff'

据我所知， \u0003代表END OF TEXT而且\u00ff是LATIN SMALL LETTER Y WITH DIAERESIS ，但这些点之间有什么关系？ (检查LA(2)是什么？)

例如，它是所有拉丁字符，数字字符，还是带重音符号的字符，所有ascii字符或其他内容？

I have a piece of Java code that is checking it is between two unicode characters:

LA(2) >= '\u0003' && LA(2) <= '\u00ff'

I understand that \u0003 represents END OF TEXT and \u00ff is LATIN SMALL LETTER Y WITH DIAERESIS, but what lies between these points? (what is it checking that LA(2) is?)

e.g. is it all Latin characters, or number characters, or characters with accents, all ascii characters, or something else?

原文：https://stackoverflow.com/questions/6816121

更新时间：2020-01-27 10:01

最满意答案

它是拉丁语1减去代码点U + 0000，U + 0001和U + 0002。这包括可以在美国键盘上找到的常用内容，大量的控制字符(低于U + 0020以及U + 007F和U + 009F之间)和一些其他拉丁字符，可以用来写大部分西欧语言。

It's Latin 1 minus the code points U+0000, U+0001 and U+0002. This includes the usual stuff that can be found on the US keyboard, plenty of control characters (below U+0020 and between U+007F and U+009F) and a few other Latin characters that can be used to write the majority of Western European languages.

2011-07-25

相关问答

使用raw_unicode_escape ： text = 'Cze\u00c5\u009b\u00c4\u0087'

text_bytes = text.encode('raw_unicode_escape')

print(text_bytes.decode('utf8')) # outputs Cześć

Use raw_unicode_escape: text = 'Cze\u00c5\u009b\u00c4\u0087'

text_bytes = text.encode('raw_uni

...

>>> u"ÿ".encode('raw-unicode-escape')

'\xff'

r"\u%04X" % ord(u"ÿ")

This did the trick for me. It returns a string object ('\\u00FF') which I can use to make a string compare. It fails for unicode characters above U+FFFF but this is not necessary in

...

尝试这个： $str = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {

return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');

}, $str);

如果是基于UTF-16的C / C ++ / Java / Json样式： $str = preg_replace_callback('/\\\\u([0-9a-fA

...

考虑： >>> print '\u2022'

\u2022

>>> print len('\u2022')

>>> import unicodedata

>>> map(unicodedata.name, '\u2022'.decode('ascii'))

['REVERSE SOLIDUS', 'LATIN SMALL LETTER U', 'DIGIT TWO', 'DIGIT ZERO', 'DIGIT TWO', 'DIGIT TWO']

>>>

VS： >>> print u'\

...

Eclipse默认使用平台默认编码(在Windows中为cp1252)在保存基于文本的文件期间以及在写入标准输出流(由System.out )期间解码字符。您需要明确地将其设置为UTF-8才能实现统治世界。请注意，这样您也不需要再使用那些\uXXXX Unicode转义来表示基于文本的源文件中的这些字符。引发这些问号是因为输出流使用的字符集不支持输入流中指定的字符。也可以看看： Unicode - 如何使角色正确？ Eclipse uses by default platform defa

...

很奇怪......根据Windows上的人物地图，我会说“À到ÿ” 这些是A，C，E，I，D，N，O，U，Y，德国夏普的一些变化(重音，cedillas)...... weird... according to the character map on Windows I'd say "À to ÿ" Those are some variations (accents, cedillas) on A, C, E, I, D, N, O, U, Y, the german Sharp s, ...

...

ISO-8859-1按照设计匹配最接近的第一个Unicode代码点。 ISO-8859-1 matches the first Unicode code points the closest, by design.

为了处理Unicode字符，Vim必须使用能够表示这些字符的'encoding' 。如果值为latin1 ，则无法对所提到的字符进行编码(此8位编码仅包括ASCII和几个西欧字符，请参见此处 )。所以，你需要 :set encoding=utf-8

这样，任何新创建的文件都将使用该编码，您应该能够插入Unicode字符并将其写入(也可以使用其他Unicode文件编码，例如:w ++enc=ucs-2le ;但如果您尝试保留为:w ++enc=latin1 ，你会得到一个CONVERSION

...

好的，所以你有一个任意的字符串(事实上它包含不可打印的字符是无关紧要的)，你想用UTF-8将它转换成一个字节数组。这很容易：) byte[] bytes = Encoding.UTF8.GetBytes(text);

或者要写StreamWriter ，通常将其包装在StreamWriter ： // Note that due to the using statement, this will close the stream at the end

// of the block

usin

...

它是拉丁语1减去代码点U + 0000，U + 0001和U + 0002。这包括可以在美国键盘上找到的常用内容，大量的控制字符(低于U + 0020以及U + 007F和U + 009F之间)和一些其他拉丁字符，可以用来写大部分西欧语言。 It's Latin 1 minus the code points U+0000, U+0001 and U+0002. This includes the usual stuff that can be found on the US keyboard,

...